csc-656-coding-project-4/report.md

485 B

Report

Implementation Elapsed Time (ms) MFLOP/s Memory Bandwidth (GB/s)
CPU Vector Addition xx xx xx
CUDA 1 thread, 1 thread block 1,203.31 425.59 5.35
CUDA 256 threads, 1 thread block 1,212.36 422.76 5.31
CUDA 256 threads/block, many thread blocks 1,232.73 415.16 5.24
CUDA 256 threads/block, many blocks, prefetching 4.77 112,591.01 1,349.96