Performance Table

Implementation	Execution Time (ms)	MFLOPS/s	Mem Bandwidth B/s	Mem Bandwidth GB/s
CPU	985.325	519.6255043	6538401993	6.5384
CUDA: 1t 1b	50918.5392	10.05527668	126524661.7	0.1265
CUDA: 256t 1b	1689.7723	302.9994041	3812614838	3.8126
CUDA: 256t many b	1219.1305	419.9714469	5284463758	5.2845
CUDA prefetch	329.9085	1551.945464	19527993198	19.528

Analysis Questions

What is the MFLOP/s performance gain going from the CPU-only code to the final version of your CUDA code (the one with the cudaMemPrefetchAsync() call)? Please report this gain in terms of a multiplier, e.g., 2.5x, rather than with an absolute number or a percentage. Show your work on how you compute this result.

Answer:


\text{Performance gain} = \frac{1551.945464}{519.6255043} = 2.98666145 \approx \boxed{\text{3x gain}}

What is the memory bandwidth performance gain (or loss) going from the CPU-only code to the final version of your CUDA code (the one with the cudaMemPrefetchAsync() call)? Please report this gain in terms of a multiplier, e.g., 2.5x, rather than with an absolute number or a percentage. Show your work on how you compute this result.

Answer:


\text{Mem bandwidth gain/loss} = \frac{19527993198}{6538401993} = 2.98666145 \approx \boxed{\text{3x gain}}

For the final version of your CUDA code (the one with the cudaMemPrefetchAsync() call), what is the total number of concurrent threads being run? Show your work on how you arrive at this result.

Answer:


\text{Given: } \\ 

- \ 2,097,152 \text{ thread blocks} \\
- \ N = 536,870,912 \\

\text{Total concurrent threads} = 256  \times 2097152 = 536870912 \\

= \boxed{\text{536,870,912 threads}}

2.0 KiB Raw Blame History

Performance Table

Analysis Questions

2.0 KiB

Raw Blame History