231 lines
13 KiB
Markdown
231 lines
13 KiB
Markdown
# Profiler Outputs
|
|
|
|
CPU-Only
|
|
---
|
|
|
|
985.325 ms
|
|
|
|
CUDA: 1 Thread, 1 Block
|
|
---
|
|
|
|
```
|
|
uzylol@nid001133:/pscratch/sd/u/uzylol/cuda_vecadd> ./helper.sh vecadd_gpu_1t
|
|
Choose an option:
|
|
1. Build and profile
|
|
2. Clean
|
|
Enter your choice (1 or 2): 1
|
|
Built executable: vecadd_gpu_1t
|
|
WARNING: vecadd_gpu_1t and any of its children processes will be profiled.
|
|
|
|
Max error: 0
|
|
Generating '/tmp/nsys-report-135f.qdstrm'
|
|
[1/7] [========================100%] report1.nsys-rep
|
|
[2/7] [========================100%] report1.sqlite
|
|
[3/7] Executing 'nvtx_sum' stats report
|
|
SKIPPED: /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite does not contain NV Tools Extension (NVTX) data.
|
|
[4/7] Executing 'cuda_api_sum' stats report
|
|
|
|
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
|
|
-------- --------------- --------- ---------------- ---------------- -------------- -------------- ------------- ----------------------
|
|
98.5 50,918,539,217 1 50,918,539,217.0 50,918,539,217.0 50,918,539,217 50,918,539,217 0.0 cudaDeviceSynchronize
|
|
0.6 335,502,692 2 167,751,346.0 167,751,346.0 70,647 335,432,045 237,136,318.7 cudaMallocManaged
|
|
0.5 247,516,551 1 247,516,551.0 247,516,551.0 247,516,551 247,516,551 0.0 cudaLaunchKernel
|
|
0.4 198,199,002 2 99,099,501.0 99,099,501.0 68,308,537 129,890,465 43,544,998.9 cudaFree
|
|
0.0 1,293 1 1,293.0 1,293.0 1,293 1,293 0.0 cuModuleGetLoadingMode
|
|
|
|
[5/7] Executing 'cuda_gpu_kern_sum' stats report
|
|
|
|
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
|
|
-------- --------------- --------- ---------------- ---------------- -------------- -------------- ----------- --------------------------
|
|
100.0 50,918,525,966 1 50,918,525,966.0 50,918,525,966.0 50,918,525,966 50,918,525,966 0.0 add(int, float *, float *)
|
|
|
|
[6/7] Executing 'cuda_gpu_mem_time_sum' stats report
|
|
|
|
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
|
|
-------- --------------- ------ -------- -------- -------- -------- ----------- ------------------------------------
|
|
65.2 206,887,363 24,576 8,418.3 3,647.5 1,982 41,312 11,229.0 [CUDA memcpy Unified Host-to-Device]
|
|
34.8 110,515,732 12,288 8,993.8 3,135.5 1,726 48,416 12,346.7 [CUDA memcpy Unified Device-to-Host]
|
|
|
|
[7/7] Executing 'cuda_gpu_mem_size_sum' stats report
|
|
|
|
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
|
|
---------- ------ -------- -------- -------- -------- ----------- ------------------------------------
|
|
4,294.967 24,576 0.175 0.033 0.004 1.044 0.301 [CUDA memcpy Unified Host-to-Device]
|
|
2,147.484 12,288 0.175 0.033 0.004 1.044 0.301 [CUDA memcpy Unified Device-to-Host]
|
|
|
|
Generated:
|
|
/pscratch/sd/u/uzylol/cuda_vecadd/report1.nsys-rep
|
|
/pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite
|
|
```
|
|
|
|
CUDA: 256 Threads, One Block
|
|
---
|
|
|
|
```
|
|
uzylol@nid001133:/pscratch/sd/u/uzylol/cuda_vecadd> ./helper.sh vecadd_gpu_256t
|
|
Choose an option:
|
|
1. Build and profile
|
|
2. Clean
|
|
Enter your choice (1 or 2): 1
|
|
Built executable: vecadd_gpu_256t
|
|
WARNING: vecadd_gpu_256t and any of its children processes will be profiled.
|
|
|
|
Max error: 0
|
|
Generating '/tmp/nsys-report-a2a6.qdstrm'
|
|
[1/7] [========================100%] report1.nsys-rep
|
|
[2/7] [========================100%] report1.sqlite
|
|
[3/7] Executing 'nvtx_sum' stats report
|
|
SKIPPED: /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite does not contain NV Tools Extension (NVTX) data.
|
|
[4/7] Executing 'cuda_api_sum' stats report
|
|
|
|
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
|
|
-------- --------------- --------- --------------- --------------- ------------- ------------- ------------- ----------------------
|
|
71.3 1,689,772,282 1 1,689,772,282.0 1,689,772,282.0 1,689,772,282 1,689,772,282 0.0 cudaDeviceSynchronize
|
|
13.2 312,348,647 2 156,174,323.5 156,174,323.5 66,999 312,281,648 220,769,095.5 cudaMallocManaged
|
|
8.4 200,018,555 2 100,009,277.5 100,009,277.5 70,954,629 129,063,926 41,089,478.0 cudaFree
|
|
7.0 166,925,228 1 166,925,228.0 166,925,228.0 166,925,228 166,925,228 0.0 cudaLaunchKernel
|
|
0.0 1,031 1 1,031.0 1,031.0 1,031 1,031 0.0 cuModuleGetLoadingMode
|
|
|
|
[5/7] Executing 'cuda_gpu_kern_sum' stats report
|
|
|
|
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
|
|
-------- --------------- --------- --------------- --------------- ------------- ------------- ----------- --------------------------
|
|
100.0 1,689,768,383 1 1,689,768,383.0 1,689,768,383.0 1,689,768,383 1,689,768,383 0.0 add(int, float *, float *)
|
|
|
|
[6/7] Executing 'cuda_gpu_mem_time_sum' stats report
|
|
|
|
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
|
|
-------- --------------- ------ -------- -------- -------- -------- ----------- ------------------------------------
|
|
65.4 208,609,295 24,576 8,488.3 3,920.5 1,982 41,568 11,229.4 [CUDA memcpy Unified Host-to-Device]
|
|
34.6 110,590,888 12,288 8,999.9 3,167.5 1,726 48,384 12,351.0 [CUDA memcpy Unified Device-to-Host]
|
|
|
|
[7/7] Executing 'cuda_gpu_mem_size_sum' stats report
|
|
|
|
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
|
|
---------- ------ -------- -------- -------- -------- ----------- ------------------------------------
|
|
4,294.967 24,576 0.175 0.033 0.004 1.044 0.301 [CUDA memcpy Unified Host-to-Device]
|
|
2,147.484 12,288 0.175 0.033 0.004 1.044 0.301 [CUDA memcpy Unified Device-to-Host]
|
|
|
|
Generated:
|
|
/pscratch/sd/u/uzylol/cuda_vecadd/report1.nsys-rep
|
|
/pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite
|
|
```
|
|
|
|
CUDA: 256 Threads, Many Blocks
|
|
---
|
|
|
|
```
|
|
uzylol@nid001133:/pscratch/sd/u/uzylol/cuda_vecadd> ./helper.sh vecadd_gpu_256t_mb
|
|
Choose an option:
|
|
1. Build and profile
|
|
2. Clean
|
|
Enter your choice (1 or 2): 1
|
|
Built executable: vecadd_gpu_256t_mb
|
|
WARNING: vecadd_gpu_256t_mb and any of its children processes will be profiled.
|
|
|
|
Number of thread blocks: 2097152
|
|
Max error: 0
|
|
Generating '/tmp/nsys-report-cbad.qdstrm'
|
|
[1/7] [========================100%] report1.nsys-rep
|
|
[2/7] [========================100%] report1.sqlite
|
|
[3/7] Executing 'nvtx_sum' stats report
|
|
SKIPPED: /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite does not contain NV Tools Extension (NVTX) data.
|
|
[4/7] Executing 'cuda_api_sum' stats report
|
|
|
|
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
|
|
-------- --------------- --------- --------------- --------------- ------------- ------------- ------------- ----------------------
|
|
68.5 1,219,130,474 1 1,219,130,474.0 1,219,130,474.0 1,219,130,474 1,219,130,474 0.0 cudaDeviceSynchronize
|
|
17.8 316,925,924 2 158,462,962.0 158,462,962.0 55,287 316,870,637 224,022,282.4 cudaMallocManaged
|
|
11.4 203,677,016 2 101,838,508.0 101,838,508.0 68,914,217 134,762,799 46,561,978.9 cudaFree
|
|
2.3 40,488,959 1 40,488,959.0 40,488,959.0 40,488,959 40,488,959 0.0 cudaLaunchKernel
|
|
0.0 1,072 1 1,072.0 1,072.0 1,072 1,072 0.0 cuModuleGetLoadingMode
|
|
|
|
[5/7] Executing 'cuda_gpu_kern_sum' stats report
|
|
|
|
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
|
|
-------- --------------- --------- --------------- --------------- ------------- ------------- ----------- --------------------------
|
|
100.0 1,219,123,989 1 1,219,123,989.0 1,219,123,989.0 1,219,123,989 1,219,123,989 0.0 add(int, float *, float *)
|
|
|
|
[6/7] Executing 'cuda_gpu_mem_time_sum' stats report
|
|
|
|
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
|
|
-------- --------------- ------- -------- -------- -------- -------- ----------- ------------------------------------
|
|
80.3 448,633,690 155,241 2,889.9 2,175.0 1,663 63,231 3,651.9 [CUDA memcpy Unified Host-to-Device]
|
|
19.7 109,875,418 12,286 8,943.1 3,214.5 1,726 48,352 12,278.7 [CUDA memcpy Unified Device-to-Host]
|
|
|
|
[7/7] Executing 'cuda_gpu_mem_size_sum' stats report
|
|
|
|
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
|
|
---------- ------- -------- -------- -------- -------- ----------- ------------------------------------
|
|
3,996.164 155,241 0.026 0.008 0.004 1.044 0.097 [CUDA memcpy Unified Host-to-Device]
|
|
2,147.418 12,286 0.175 0.033 0.004 1.044 0.301 [CUDA memcpy Unified Device-to-Host]
|
|
|
|
Generated:
|
|
/pscratch/sd/u/uzylol/cuda_vecadd/report1.nsys-rep
|
|
/pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite
|
|
```
|
|
|
|
CUDA: 256 Threads, Many Blocks with Prefetch
|
|
---
|
|
|
|
```
|
|
uzylol@nid001133:/pscratch/sd/u/uzylol/cuda_vecadd> ./helper.sh vecadd_gpu_256t_mb_prefetch
|
|
Choose an option:
|
|
1. Build and profile
|
|
2. Clean
|
|
Enter your choice (1 or 2): 1
|
|
Built executable: vecadd_gpu_256t_mb_prefetch
|
|
WARNING: vecadd_gpu_256t_mb_prefetch and any of its children processes will be profiled.
|
|
|
|
Number of thread blocks: 2097152
|
|
Max error: 0
|
|
Generating '/tmp/nsys-report-3752.qdstrm'
|
|
[1/7] [========================100%] report1.nsys-rep
|
|
[2/7] [========================100%] report1.sqlite
|
|
[3/7] Executing 'nvtx_sum' stats report
|
|
SKIPPED: /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite does not contain NV Tools Extension (NVTX) data.
|
|
[4/7] Executing 'cuda_api_sum' stats report
|
|
|
|
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
|
|
-------- --------------- --------- ------------- ------------- ----------- ----------- ------------- ----------------------
|
|
43.5 329,908,462 2 164,954,231.0 164,954,231.0 50,698 329,857,764 233,208,812.9 cudaMallocManaged
|
|
25.1 190,030,072 2 95,015,036.0 95,015,036.0 62,035,958 127,994,114 46,639,459.4 cudaFree
|
|
17.2 130,653,654 1 130,653,654.0 130,653,654.0 130,653,654 130,653,654 0.0 cudaLaunchKernel
|
|
13.5 102,455,160 2 51,227,580.0 51,227,580.0 377,189 102,077,971 71,913,312.6 cudaMemPrefetchAsync
|
|
0.6 4,773,525 1 4,773,525.0 4,773,525.0 4,773,525 4,773,525 0.0 cudaDeviceSynchronize
|
|
0.0 1,312 1 1,312.0 1,312.0 1,312 1,312 0.0 cuModuleGetLoadingMode
|
|
|
|
[5/7] Executing 'cuda_gpu_kern_sum' stats report
|
|
|
|
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
|
|
-------- --------------- --------- ----------- ----------- --------- --------- ----------- --------------------------
|
|
100.0 4,771,134 1 4,771,134.0 4,771,134.0 4,771,134 4,771,134 0.0 add(int, float *, float *)
|
|
|
|
[6/7] Executing 'cuda_gpu_mem_time_sum' stats report
|
|
|
|
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
|
|
-------- --------------- ------ -------- -------- -------- -------- ----------- ------------------------------------
|
|
59.9 165,077,448 2,048 80,604.2 80,576.0 80,511 81,055 68.8 [CUDA memcpy Unified Host-to-Device]
|
|
40.1 110,650,214 12,288 9,004.7 3,135.5 1,726 48,352 12,355.0 [CUDA memcpy Unified Device-to-Host]
|
|
|
|
[7/7] Executing 'cuda_gpu_mem_size_sum' stats report
|
|
|
|
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
|
|
---------- ------ -------- -------- -------- -------- ----------- ------------------------------------
|
|
4,294.967 2,048 2.097 2.097 2.097 2.097 0.000 [CUDA memcpy Unified Host-to-Device]
|
|
2,147.484 12,288 0.175 0.033 0.004 1.044 0.301 [CUDA memcpy Unified Device-to-Host]
|
|
|
|
Generated:
|
|
/pscratch/sd/u/uzylol/cuda_vecadd/report1.nsys-rep
|
|
/pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite
|
|
```
|
|
|
|
## Prefetch Thread Count
|
|
|
|
```
|
|
uzylol@nid001013:/pscratch/sd/u/uzylol/cuda_vecadd> compute-sanitizer vecadd_gpu_256t_mb_prefetch
|
|
========= COMPUTE-SANITIZER
|
|
Number of thread blocks: 2097152
|
|
Max error: 0
|
|
========= ERROR SUMMARY: 0 errors
|
|
``` |