# Profiler Outputs CPU-Only --- 985.325 ms CUDA: 1 Thread, 1 Block --- ``` uzylol@nid001133:/pscratch/sd/u/uzylol/cuda_vecadd> ./helper.sh vecadd_gpu_1t Choose an option: 1. Build and profile 2. Clean Enter your choice (1 or 2): 1 Built executable: vecadd_gpu_1t WARNING: vecadd_gpu_1t and any of its children processes will be profiled. Max error: 0 Generating '/tmp/nsys-report-135f.qdstrm' [1/7] [========================100%] report1.nsys-rep [2/7] [========================100%] report1.sqlite [3/7] Executing 'nvtx_sum' stats report SKIPPED: /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite does not contain NV Tools Extension (NVTX) data. [4/7] Executing 'cuda_api_sum' stats report Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ---------------- ---------------- -------------- -------------- ------------- ---------------------- 98.5 50,918,539,217 1 50,918,539,217.0 50,918,539,217.0 50,918,539,217 50,918,539,217 0.0 cudaDeviceSynchronize 0.6 335,502,692 2 167,751,346.0 167,751,346.0 70,647 335,432,045 237,136,318.7 cudaMallocManaged 0.5 247,516,551 1 247,516,551.0 247,516,551.0 247,516,551 247,516,551 0.0 cudaLaunchKernel 0.4 198,199,002 2 99,099,501.0 99,099,501.0 68,308,537 129,890,465 43,544,998.9 cudaFree 0.0 1,293 1 1,293.0 1,293.0 1,293 1,293 0.0 cuModuleGetLoadingMode [5/7] Executing 'cuda_gpu_kern_sum' stats report Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ---------------- ---------------- -------------- -------------- ----------- -------------------------- 100.0 50,918,525,966 1 50,918,525,966.0 50,918,525,966.0 50,918,525,966 50,918,525,966 0.0 add(int, float *, float *) [6/7] Executing 'cuda_gpu_mem_time_sum' stats report Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation -------- --------------- ------ -------- -------- -------- -------- ----------- ------------------------------------ 65.2 206,887,363 24,576 8,418.3 3,647.5 1,982 41,312 11,229.0 [CUDA memcpy Unified Host-to-Device] 34.8 110,515,732 12,288 8,993.8 3,135.5 1,726 48,416 12,346.7 [CUDA memcpy Unified Device-to-Host] [7/7] Executing 'cuda_gpu_mem_size_sum' stats report Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation ---------- ------ -------- -------- -------- -------- ----------- ------------------------------------ 4,294.967 24,576 0.175 0.033 0.004 1.044 0.301 [CUDA memcpy Unified Host-to-Device] 2,147.484 12,288 0.175 0.033 0.004 1.044 0.301 [CUDA memcpy Unified Device-to-Host] Generated: /pscratch/sd/u/uzylol/cuda_vecadd/report1.nsys-rep /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite ``` CUDA: 256 Threads, One Block --- ``` uzylol@nid001133:/pscratch/sd/u/uzylol/cuda_vecadd> ./helper.sh vecadd_gpu_256t Choose an option: 1. Build and profile 2. Clean Enter your choice (1 or 2): 1 Built executable: vecadd_gpu_256t WARNING: vecadd_gpu_256t and any of its children processes will be profiled. Max error: 0 Generating '/tmp/nsys-report-a2a6.qdstrm' [1/7] [========================100%] report1.nsys-rep [2/7] [========================100%] report1.sqlite [3/7] Executing 'nvtx_sum' stats report SKIPPED: /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite does not contain NV Tools Extension (NVTX) data. [4/7] Executing 'cuda_api_sum' stats report Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- --------------- --------------- ------------- ------------- ------------- ---------------------- 71.3 1,689,772,282 1 1,689,772,282.0 1,689,772,282.0 1,689,772,282 1,689,772,282 0.0 cudaDeviceSynchronize 13.2 312,348,647 2 156,174,323.5 156,174,323.5 66,999 312,281,648 220,769,095.5 cudaMallocManaged 8.4 200,018,555 2 100,009,277.5 100,009,277.5 70,954,629 129,063,926 41,089,478.0 cudaFree 7.0 166,925,228 1 166,925,228.0 166,925,228.0 166,925,228 166,925,228 0.0 cudaLaunchKernel 0.0 1,031 1 1,031.0 1,031.0 1,031 1,031 0.0 cuModuleGetLoadingMode [5/7] Executing 'cuda_gpu_kern_sum' stats report Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- --------------- --------------- ------------- ------------- ----------- -------------------------- 100.0 1,689,768,383 1 1,689,768,383.0 1,689,768,383.0 1,689,768,383 1,689,768,383 0.0 add(int, float *, float *) [6/7] Executing 'cuda_gpu_mem_time_sum' stats report Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation -------- --------------- ------ -------- -------- -------- -------- ----------- ------------------------------------ 65.4 208,609,295 24,576 8,488.3 3,920.5 1,982 41,568 11,229.4 [CUDA memcpy Unified Host-to-Device] 34.6 110,590,888 12,288 8,999.9 3,167.5 1,726 48,384 12,351.0 [CUDA memcpy Unified Device-to-Host] [7/7] Executing 'cuda_gpu_mem_size_sum' stats report Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation ---------- ------ -------- -------- -------- -------- ----------- ------------------------------------ 4,294.967 24,576 0.175 0.033 0.004 1.044 0.301 [CUDA memcpy Unified Host-to-Device] 2,147.484 12,288 0.175 0.033 0.004 1.044 0.301 [CUDA memcpy Unified Device-to-Host] Generated: /pscratch/sd/u/uzylol/cuda_vecadd/report1.nsys-rep /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite ``` CUDA: 256 Threads, Many Blocks --- ``` uzylol@nid001133:/pscratch/sd/u/uzylol/cuda_vecadd> ./helper.sh vecadd_gpu_256t_mb Choose an option: 1. Build and profile 2. Clean Enter your choice (1 or 2): 1 Built executable: vecadd_gpu_256t_mb WARNING: vecadd_gpu_256t_mb and any of its children processes will be profiled. Number of thread blocks: 2097152 Max error: 0 Generating '/tmp/nsys-report-cbad.qdstrm' [1/7] [========================100%] report1.nsys-rep [2/7] [========================100%] report1.sqlite [3/7] Executing 'nvtx_sum' stats report SKIPPED: /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite does not contain NV Tools Extension (NVTX) data. [4/7] Executing 'cuda_api_sum' stats report Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- --------------- --------------- ------------- ------------- ------------- ---------------------- 68.5 1,219,130,474 1 1,219,130,474.0 1,219,130,474.0 1,219,130,474 1,219,130,474 0.0 cudaDeviceSynchronize 17.8 316,925,924 2 158,462,962.0 158,462,962.0 55,287 316,870,637 224,022,282.4 cudaMallocManaged 11.4 203,677,016 2 101,838,508.0 101,838,508.0 68,914,217 134,762,799 46,561,978.9 cudaFree 2.3 40,488,959 1 40,488,959.0 40,488,959.0 40,488,959 40,488,959 0.0 cudaLaunchKernel 0.0 1,072 1 1,072.0 1,072.0 1,072 1,072 0.0 cuModuleGetLoadingMode [5/7] Executing 'cuda_gpu_kern_sum' stats report Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- --------------- --------------- ------------- ------------- ----------- -------------------------- 100.0 1,219,123,989 1 1,219,123,989.0 1,219,123,989.0 1,219,123,989 1,219,123,989 0.0 add(int, float *, float *) [6/7] Executing 'cuda_gpu_mem_time_sum' stats report Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation -------- --------------- ------- -------- -------- -------- -------- ----------- ------------------------------------ 80.3 448,633,690 155,241 2,889.9 2,175.0 1,663 63,231 3,651.9 [CUDA memcpy Unified Host-to-Device] 19.7 109,875,418 12,286 8,943.1 3,214.5 1,726 48,352 12,278.7 [CUDA memcpy Unified Device-to-Host] [7/7] Executing 'cuda_gpu_mem_size_sum' stats report Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation ---------- ------- -------- -------- -------- -------- ----------- ------------------------------------ 3,996.164 155,241 0.026 0.008 0.004 1.044 0.097 [CUDA memcpy Unified Host-to-Device] 2,147.418 12,286 0.175 0.033 0.004 1.044 0.301 [CUDA memcpy Unified Device-to-Host] Generated: /pscratch/sd/u/uzylol/cuda_vecadd/report1.nsys-rep /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite ``` CUDA: 256 Threads, Many Blocks with Prefetch --- ``` uzylol@nid001133:/pscratch/sd/u/uzylol/cuda_vecadd> ./helper.sh vecadd_gpu_256t_mb_prefetch Choose an option: 1. Build and profile 2. Clean Enter your choice (1 or 2): 1 Built executable: vecadd_gpu_256t_mb_prefetch WARNING: vecadd_gpu_256t_mb_prefetch and any of its children processes will be profiled. Number of thread blocks: 2097152 Max error: 0 Generating '/tmp/nsys-report-3752.qdstrm' [1/7] [========================100%] report1.nsys-rep [2/7] [========================100%] report1.sqlite [3/7] Executing 'nvtx_sum' stats report SKIPPED: /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite does not contain NV Tools Extension (NVTX) data. [4/7] Executing 'cuda_api_sum' stats report Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ------------- ------------- ----------- ----------- ------------- ---------------------- 43.5 329,908,462 2 164,954,231.0 164,954,231.0 50,698 329,857,764 233,208,812.9 cudaMallocManaged 25.1 190,030,072 2 95,015,036.0 95,015,036.0 62,035,958 127,994,114 46,639,459.4 cudaFree 17.2 130,653,654 1 130,653,654.0 130,653,654.0 130,653,654 130,653,654 0.0 cudaLaunchKernel 13.5 102,455,160 2 51,227,580.0 51,227,580.0 377,189 102,077,971 71,913,312.6 cudaMemPrefetchAsync 0.6 4,773,525 1 4,773,525.0 4,773,525.0 4,773,525 4,773,525 0.0 cudaDeviceSynchronize 0.0 1,312 1 1,312.0 1,312.0 1,312 1,312 0.0 cuModuleGetLoadingMode [5/7] Executing 'cuda_gpu_kern_sum' stats report Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ----------- ----------- --------- --------- ----------- -------------------------- 100.0 4,771,134 1 4,771,134.0 4,771,134.0 4,771,134 4,771,134 0.0 add(int, float *, float *) [6/7] Executing 'cuda_gpu_mem_time_sum' stats report Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation -------- --------------- ------ -------- -------- -------- -------- ----------- ------------------------------------ 59.9 165,077,448 2,048 80,604.2 80,576.0 80,511 81,055 68.8 [CUDA memcpy Unified Host-to-Device] 40.1 110,650,214 12,288 9,004.7 3,135.5 1,726 48,352 12,355.0 [CUDA memcpy Unified Device-to-Host] [7/7] Executing 'cuda_gpu_mem_size_sum' stats report Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation ---------- ------ -------- -------- -------- -------- ----------- ------------------------------------ 4,294.967 2,048 2.097 2.097 2.097 2.097 0.000 [CUDA memcpy Unified Host-to-Device] 2,147.484 12,288 0.175 0.033 0.004 1.044 0.301 [CUDA memcpy Unified Device-to-Host] Generated: /pscratch/sd/u/uzylol/cuda_vecadd/report1.nsys-rep /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite ``` ## Prefetch Thread Count ``` uzylol@nid001013:/pscratch/sd/u/uzylol/cuda_vecadd> compute-sanitizer vecadd_gpu_256t_mb_prefetch ========= COMPUTE-SANITIZER Number of thread blocks: 2097152 Max error: 0 ========= ERROR SUMMARY: 0 errors ```