csc-656-coding-project-4/analysis.md

2.0 KiB

Performance Table

Implementation Execution Time (ms) MFLOPS/s Mem Bandwidth B/s Mem Bandwidth GB/s
CPU 985.325 519.6255043 6538401993 6.5384
CUDA: 1t 1b 50918.5392 10.05527668 126524661.7 0.1265
CUDA: 256t 1b 1689.7723 302.9994041 3812614838 3.8126
CUDA: 256t many b 1219.1305 419.9714469 5284463758 5.2845
CUDA prefetch 329.9085 1551.945464 19527993198 19.528

Analysis Questions

  1. What is the MFLOP/s performance gain going from the CPU-only code to the final version of your CUDA code (the one with the cudaMemPrefetchAsync() call)? Please report this gain in terms of a multiplier, e.g., 2.5x, rather than with an absolute number or a percentage. Show your work on how you compute this result.

Answer:


\text{Performance gain} = \frac{1551.945464}{519.6255043} = 2.98666145 \approx \boxed{\text{3x gain}}
  1. What is the memory bandwidth performance gain (or loss) going from the CPU-only code to the final version of your CUDA code (the one with the cudaMemPrefetchAsync() call)? Please report this gain in terms of a multiplier, e.g., 2.5x, rather than with an absolute number or a percentage. Show your work on how you compute this result.

Answer:


\text{Mem bandwidth gain/loss} = \frac{19527993198}{6538401993} = 2.98666145 \approx \boxed{\text{3x gain}}
  1. For the final version of your CUDA code (the one with the cudaMemPrefetchAsync() call), what is the total number of concurrent threads being run? Show your work on how you arrive at this result.

Answer:


\text{Given: } \\ 

- \ 2,097,152 \text{ thread blocks} \\
- \ N = 536,870,912 \\

\text{Total concurrent threads} = 256  \times 2097152 = 536870912 \\

= \boxed{\text{536,870,912 threads}}