csc-656-coding-project-4/analysis.md

# Performance Table

| Implementation      | Execution Time (ms) | MFLOPS/s     | Mem Bandwidth B/s | Mem Bandwidth GB/s |
|---------------------|---------------------|--------------|-------------------|--------------------|
| CPU                 | 985.325             | 519.6255043  | 6538401993        | 6.5384             |
| CUDA: 1t 1b         | 50918.5392          | 10.05527668  | 126524661.7       | 0.1265             |
| CUDA: 256t 1b       | 1689.7723           | 302.9994041  | 3812614838        | 3.8126             |
| CUDA: 256t many b   | 1219.1305           | 419.9714469  | 5284463758        | 5.2845             |
| CUDA prefetch       | 329.9085            | 1551.945464  | 19527993198       | 19.528             |

# Analysis Questions

1. What is the MFLOP/s performance gain going from the CPU-only code to the final version of your CUDA code (the one with the cudaMemPrefetchAsync() call)? Please report this gain in terms of a multiplier, e.g., 2.5x, rather than with an absolute number or a percentage. Show your work on how you compute this result.

*Answer*:

$$
\text{Performance gain} = \frac{1551.945464}{519.6255043} = 2.98666145 \approx \boxed{\text{3x gain}}
$$

2. What is the memory bandwidth performance gain (or loss) going from the CPU-only code to the final version of your CUDA code (the one with the cudaMemPrefetchAsync() call)? Please report this gain in terms of a multiplier, e.g., 2.5x, rather than with an absolute number or a percentage. Show your work on how you compute this result.

*Answer*:

$$
\text{Mem bandwidth gain/loss} = \frac{19527993198}{6538401993} = 2.98666145 \approx \boxed{\text{3x gain}}
$$

3. For the final version of your CUDA code (the one with the cudaMemPrefetchAsync() call), what is the total number of concurrent threads being run? Show your work on how you arrive at this result.

*Answer*:

$$
\text{Given: } \\

- \ 2,097,152 \text{ thread blocks} \\
- \ N = 536,870,912 \\

\text{Total concurrent threads} = 256  \times 2097152 = 536870912 \\

= \boxed{\text{536,870,912 threads}}
$$