csc-656-coding-project-4/outputs.md

13 KiB

Profiler Outputs

CPU-Only

985.325 ms

CUDA: 1 Thread, 1 Block

uzylol@nid001133:/pscratch/sd/u/uzylol/cuda_vecadd> ./helper.sh vecadd_gpu_1t
Choose an option:
1. Build and profile
2. Clean
Enter your choice (1 or 2): 1
Built executable: vecadd_gpu_1t
WARNING: vecadd_gpu_1t and any of its children processes will be profiled.

Max error: 0
Generating '/tmp/nsys-report-135f.qdstrm'
[1/7] [========================100%] report1.nsys-rep
[2/7] [========================100%] report1.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls      Avg (ns)          Med (ns)         Min (ns)        Max (ns)      StdDev (ns)            Name
 --------  ---------------  ---------  ----------------  ----------------  --------------  --------------  -------------  ----------------------
     98.5   50,918,539,217          1  50,918,539,217.0  50,918,539,217.0  50,918,539,217  50,918,539,217            0.0  cudaDeviceSynchronize
      0.6      335,502,692          2     167,751,346.0     167,751,346.0          70,647     335,432,045  237,136,318.7  cudaMallocManaged
      0.5      247,516,551          1     247,516,551.0     247,516,551.0     247,516,551     247,516,551            0.0  cudaLaunchKernel
      0.4      198,199,002          2      99,099,501.0      99,099,501.0      68,308,537     129,890,465   43,544,998.9  cudaFree
      0.0            1,293          1           1,293.0           1,293.0           1,293           1,293            0.0  cuModuleGetLoadingMode

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances      Avg (ns)          Med (ns)         Min (ns)        Max (ns)     StdDev (ns)             Name
 --------  ---------------  ---------  ----------------  ----------------  --------------  --------------  -----------  --------------------------
    100.0   50,918,525,966          1  50,918,525,966.0  50,918,525,966.0  50,918,525,966  50,918,525,966          0.0  add(int, float *, float *)

[6/7] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)               Operation
 --------  ---------------  ------  --------  --------  --------  --------  -----------  ------------------------------------
     65.2      206,887,363  24,576   8,418.3   3,647.5     1,982    41,312     11,229.0  [CUDA memcpy Unified Host-to-Device]
     34.8      110,515,732  12,288   8,993.8   3,135.5     1,726    48,416     12,346.7  [CUDA memcpy Unified Device-to-Host]

[7/7] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count   Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)               Operation
 ----------  ------  --------  --------  --------  --------  -----------  ------------------------------------
  4,294.967  24,576     0.175     0.033     0.004     1.044        0.301  [CUDA memcpy Unified Host-to-Device]
  2,147.484  12,288     0.175     0.033     0.004     1.044        0.301  [CUDA memcpy Unified Device-to-Host]

Generated:
    /pscratch/sd/u/uzylol/cuda_vecadd/report1.nsys-rep
    /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite

CUDA: 256 Threads, One Block

uzylol@nid001133:/pscratch/sd/u/uzylol/cuda_vecadd> ./helper.sh vecadd_gpu_256t
Choose an option:
1. Build and profile
2. Clean
Enter your choice (1 or 2): 1
Built executable: vecadd_gpu_256t
WARNING: vecadd_gpu_256t and any of its children processes will be profiled.

Max error: 0
Generating '/tmp/nsys-report-a2a6.qdstrm'
[1/7] [========================100%] report1.nsys-rep
[2/7] [========================100%] report1.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls     Avg (ns)         Med (ns)        Min (ns)       Max (ns)      StdDev (ns)            Name
 --------  ---------------  ---------  ---------------  ---------------  -------------  -------------  -------------  ----------------------
     71.3    1,689,772,282          1  1,689,772,282.0  1,689,772,282.0  1,689,772,282  1,689,772,282            0.0  cudaDeviceSynchronize
     13.2      312,348,647          2    156,174,323.5    156,174,323.5         66,999    312,281,648  220,769,095.5  cudaMallocManaged
      8.4      200,018,555          2    100,009,277.5    100,009,277.5     70,954,629    129,063,926   41,089,478.0  cudaFree
      7.0      166,925,228          1    166,925,228.0    166,925,228.0    166,925,228    166,925,228            0.0  cudaLaunchKernel
      0.0            1,031          1          1,031.0          1,031.0          1,031          1,031            0.0  cuModuleGetLoadingMode

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances     Avg (ns)         Med (ns)        Min (ns)       Max (ns)     StdDev (ns)             Name
 --------  ---------------  ---------  ---------------  ---------------  -------------  -------------  -----------  --------------------------
    100.0    1,689,768,383          1  1,689,768,383.0  1,689,768,383.0  1,689,768,383  1,689,768,383          0.0  add(int, float *, float *)

[6/7] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)               Operation
 --------  ---------------  ------  --------  --------  --------  --------  -----------  ------------------------------------
     65.4      208,609,295  24,576   8,488.3   3,920.5     1,982    41,568     11,229.4  [CUDA memcpy Unified Host-to-Device]
     34.6      110,590,888  12,288   8,999.9   3,167.5     1,726    48,384     12,351.0  [CUDA memcpy Unified Device-to-Host]

[7/7] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count   Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)               Operation
 ----------  ------  --------  --------  --------  --------  -----------  ------------------------------------
  4,294.967  24,576     0.175     0.033     0.004     1.044        0.301  [CUDA memcpy Unified Host-to-Device]
  2,147.484  12,288     0.175     0.033     0.004     1.044        0.301  [CUDA memcpy Unified Device-to-Host]

Generated:
    /pscratch/sd/u/uzylol/cuda_vecadd/report1.nsys-rep
    /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite

CUDA: 256 Threads, Many Blocks

uzylol@nid001133:/pscratch/sd/u/uzylol/cuda_vecadd> ./helper.sh vecadd_gpu_256t_mb
Choose an option:
1. Build and profile
2. Clean
Enter your choice (1 or 2): 1
Built executable: vecadd_gpu_256t_mb
WARNING: vecadd_gpu_256t_mb and any of its children processes will be profiled.

Number of thread blocks: 2097152
Max error: 0
Generating '/tmp/nsys-report-cbad.qdstrm'
[1/7] [========================100%] report1.nsys-rep
[2/7] [========================100%] report1.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls     Avg (ns)         Med (ns)        Min (ns)       Max (ns)      StdDev (ns)            Name
 --------  ---------------  ---------  ---------------  ---------------  -------------  -------------  -------------  ----------------------
     68.5    1,219,130,474          1  1,219,130,474.0  1,219,130,474.0  1,219,130,474  1,219,130,474            0.0  cudaDeviceSynchronize
     17.8      316,925,924          2    158,462,962.0    158,462,962.0         55,287    316,870,637  224,022,282.4  cudaMallocManaged
     11.4      203,677,016          2    101,838,508.0    101,838,508.0     68,914,217    134,762,799   46,561,978.9  cudaFree
      2.3       40,488,959          1     40,488,959.0     40,488,959.0     40,488,959     40,488,959            0.0  cudaLaunchKernel
      0.0            1,072          1          1,072.0          1,072.0          1,072          1,072            0.0  cuModuleGetLoadingMode

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances     Avg (ns)         Med (ns)        Min (ns)       Max (ns)     StdDev (ns)             Name
 --------  ---------------  ---------  ---------------  ---------------  -------------  -------------  -----------  --------------------------
    100.0    1,219,123,989          1  1,219,123,989.0  1,219,123,989.0  1,219,123,989  1,219,123,989          0.0  add(int, float *, float *)

[6/7] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)   Count   Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)               Operation
 --------  ---------------  -------  --------  --------  --------  --------  -----------  ------------------------------------
     80.3      448,633,690  155,241   2,889.9   2,175.0     1,663    63,231      3,651.9  [CUDA memcpy Unified Host-to-Device]
     19.7      109,875,418   12,286   8,943.1   3,214.5     1,726    48,352     12,278.7  [CUDA memcpy Unified Device-to-Host]

[7/7] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)   Count   Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)               Operation
 ----------  -------  --------  --------  --------  --------  -----------  ------------------------------------
  3,996.164  155,241     0.026     0.008     0.004     1.044        0.097  [CUDA memcpy Unified Host-to-Device]
  2,147.418   12,286     0.175     0.033     0.004     1.044        0.301  [CUDA memcpy Unified Device-to-Host]

Generated:
    /pscratch/sd/u/uzylol/cuda_vecadd/report1.nsys-rep
    /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite

CUDA: 256 Threads, Many Blocks with Prefetch

uzylol@nid001133:/pscratch/sd/u/uzylol/cuda_vecadd> ./helper.sh vecadd_gpu_256t_mb_prefetch
Choose an option:
1. Build and profile
2. Clean
Enter your choice (1 or 2): 1
Built executable: vecadd_gpu_256t_mb_prefetch
WARNING: vecadd_gpu_256t_mb_prefetch and any of its children processes will be profiled.

Number of thread blocks: 2097152
Max error: 0
Generating '/tmp/nsys-report-3752.qdstrm'
[1/7] [========================100%] report1.nsys-rep
[2/7] [========================100%] report1.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)     StdDev (ns)            Name
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  -------------  ----------------------
     43.5      329,908,462          2  164,954,231.0  164,954,231.0       50,698  329,857,764  233,208,812.9  cudaMallocManaged
     25.1      190,030,072          2   95,015,036.0   95,015,036.0   62,035,958  127,994,114   46,639,459.4  cudaFree
     17.2      130,653,654          1  130,653,654.0  130,653,654.0  130,653,654  130,653,654            0.0  cudaLaunchKernel
     13.5      102,455,160          2   51,227,580.0   51,227,580.0      377,189  102,077,971   71,913,312.6  cudaMemPrefetchAsync
      0.6        4,773,525          1    4,773,525.0    4,773,525.0    4,773,525    4,773,525            0.0  cudaDeviceSynchronize
      0.0            1,312          1        1,312.0        1,312.0        1,312        1,312            0.0  cuModuleGetLoadingMode

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)             Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  --------------------------
    100.0        4,771,134          1  4,771,134.0  4,771,134.0  4,771,134  4,771,134          0.0  add(int, float *, float *)

[6/7] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)               Operation
 --------  ---------------  ------  --------  --------  --------  --------  -----------  ------------------------------------
     59.9      165,077,448   2,048  80,604.2  80,576.0    80,511    81,055         68.8  [CUDA memcpy Unified Host-to-Device]
     40.1      110,650,214  12,288   9,004.7   3,135.5     1,726    48,352     12,355.0  [CUDA memcpy Unified Device-to-Host]

[7/7] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count   Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)               Operation
 ----------  ------  --------  --------  --------  --------  -----------  ------------------------------------
  4,294.967   2,048     2.097     2.097     2.097     2.097        0.000  [CUDA memcpy Unified Host-to-Device]
  2,147.484  12,288     0.175     0.033     0.004     1.044        0.301  [CUDA memcpy Unified Device-to-Host]

Generated:
    /pscratch/sd/u/uzylol/cuda_vecadd/report1.nsys-rep
    /pscratch/sd/u/uzylol/cuda_vecadd/report1.sqlite

Prefetch Thread Count

uzylol@nid001013:/pscratch/sd/u/uzylol/cuda_vecadd> compute-sanitizer vecadd_gpu_256t_mb_prefetch
========= COMPUTE-SANITIZER
Number of thread blocks: 2097152
Max error: 0
========= ERROR SUMMARY: 0 errors