diff --git a/output.log b/output.log new file mode 100644 index 0000000..fd71a9f --- /dev/null +++ b/output.log @@ -0,0 +1,118 @@ +ACC: Initialize CUDA +==14394== NVPROF is profiling process 14394, command: ./Jacobi_CCEoffload +ACC: Get Device 0 +ACC: Create Context +ACC: Set Thread Context +ACC: Start transfer 2 items from Jacobi_simple.F90:74 +ACC: allocate, copy to acc 'a_omp(:,:)' (536870912 bytes) +ACC: allocate, copy to acc 'anew_omp(:,:)' (536870912 bytes) +ACC: End transfer (to acc 1073741824 bytes, to host 0 bytes) +ACC: Start transfer 2 items from Jacobi_simple.F90:84 +ACC: present 'tab(:,:)' (536870912 bytes) +ACC: present 'tabnew(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 0 bytes) +ACC: Execute kernel jacobi_$ck_L84_14 blocks:8190 threads:128 async(auto) from Jacobi_simple.F90:84 +ACC: Wait async(auto) from Jacobi_simple.F90:84 +ACC: Start transfer 2 items from Jacobi_simple.F90:84 +ACC: release present 'tab(:,:)' (536870912 bytes) +ACC: release present 'tabnew(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 0 bytes) +ACC: Start transfer 2 items from Jacobi_simple.F90:84 +ACC: present 'tab(:,:)' (536870912 bytes) +ACC: present 'tabnew(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 0 bytes) +ACC: Execute kernel jacobi_$ck_L84_14 blocks:8190 threads:128 async(auto) from Jacobi_simple.F90:84 +ACC: Wait async(auto) from Jacobi_simple.F90:84 +ACC: Start transfer 2 items from Jacobi_simple.F90:84 +ACC: release present 'tab(:,:)' (536870912 bytes) +ACC: release present 'tabnew(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 0 bytes) +ACC: Wait async(auto) from Jacobi_simple.F90:98 +ACC: Start transfer 1 items from Jacobi_simple.F90:98 +ACC: copy to host 'a_omp(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 536870912 bytes) +ACC: Start transfer 2 items from Jacobi_simple.F90:106 +ACC: copy to acc 'a_omp(:,:)' (536870912 bytes) +ACC: copy to acc 'anew_omp(:,:)' (536870912 bytes) +ACC: End transfer (to acc 1073741824 bytes, to host 0 bytes) +ACC: Start transfer 2 items from Jacobi_simple.F90:111 +ACC: present 'a_pointer1(:,:)' (536870912 bytes) +ACC: present 'a_pointer2(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 0 bytes) +ACC: Execute kernel jacobi_$ck_L111_17 blocks:8190 threads:128 async(auto) from Jacobi_simple.F90:111 +ACC: Wait async(auto) from Jacobi_simple.F90:118 +ACC: Start transfer 2 items from Jacobi_simple.F90:118 +ACC: release present 'a_pointer1(:,:)' (536870912 bytes) +ACC: release present 'a_pointer2(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 0 bytes) +ACC: Start transfer 2 items from Jacobi_simple.F90:111 +ACC: present 'a_pointer1(:,:)' (536870912 bytes) +ACC: present 'a_pointer2(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 0 bytes) +ACC: Execute kernel jacobi_$ck_L111_17 blocks:8190 threads:128 async(auto) from Jacobi_simple.F90:111 +ACC: Wait async(auto) from Jacobi_simple.F90:118 +ACC: Start transfer 2 items from Jacobi_simple.F90:118 +ACC: release present 'a_pointer1(:,:)' (536870912 bytes) +ACC: release present 'a_pointer2(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 0 bytes) +ACC: Wait async(auto) from Jacobi_simple.F90:129 +ACC: Start transfer 1 items from Jacobi_simple.F90:129 +ACC: copy to host 'a_omp(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 536870912 bytes) +ACC: Start transfer 2 items from Jacobi_simple.F90:137 +ACC: copy to acc 'a_omp(:,:)' (536870912 bytes) +ACC: copy to acc 'anew_omp(:,:)' (536870912 bytes) +ACC: End transfer (to acc 1073741824 bytes, to host 0 bytes) +ACC: Start transfer 2 items from Jacobi_simple.F90:142 +ACC: present 'a_pointer1(:,:)' (536870912 bytes) +ACC: present 'a_pointer2(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 0 bytes) +ACC: Execute kernel jacobi_$ck_L142_20 blocks:8190 threads:128 async(auto) from Jacobi_simple.F90:142 +ACC: Wait async(auto) from Jacobi_simple.F90:149 +ACC: Start transfer 2 items from Jacobi_simple.F90:149 +ACC: release present 'a_pointer1(:,:)' (536870912 bytes) +ACC: release present 'a_pointer2(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 0 bytes) +ACC: Start transfer 2 items from Jacobi_simple.F90:142 +ACC: present 'a_pointer1(:,:)' (536870912 bytes) +ACC: present 'a_pointer2(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 0 bytes) +ACC: Execute kernel jacobi_$ck_L142_20 blocks:8190 threads:128 async(auto) from Jacobi_simple.F90:142 +ACC: Wait async(auto) from Jacobi_simple.F90:149 +ACC: Start transfer 2 items from Jacobi_simple.F90:149 +ACC: release present 'a_pointer1(:,:)' (536870912 bytes) +ACC: release present 'a_pointer2(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 0 bytes) +ACC: Wait async(auto) from Jacobi_simple.F90:160 +ACC: Start transfer 1 items from Jacobi_simple.F90:160 +ACC: copy to host 'a_omp(:,:)' (536870912 bytes) +ACC: End transfer (to acc 0 bytes, to host 536870912 bytes) +==14394== Profiling application: ./Jacobi_CCEoffload +==14394== Profiling result: + Type Time(%) Time Calls Avg Min Max Name + GPU activities: 69.05% 314.14ms 7 44.878ms 1.3120us 52.706ms [CUDA memcpy HtoD] + 27.39% 124.62ms 3 41.540ms 41.413ms 41.619ms [CUDA memcpy DtoH] + 1.64% 7.4409ms 2 3.7204ms 3.7064ms 3.7345ms jacobi_$ck_L84_14 + 0.96% 4.3781ms 2 2.1891ms 2.1882ms 2.1899ms jacobi_$ck_L142_20 + 0.96% 4.3767ms 2 2.1884ms 2.1874ms 2.1893ms jacobi_$ck_L111_17 + API calls: 47.14% 314.54ms 7 44.934ms 9.2410us 52.755ms cuMemcpyHtoD + 29.93% 199.72ms 1 199.72ms 199.72ms 199.72ms cuCtxCreate + 18.72% 124.89ms 3 41.631ms 41.504ms 41.709ms cuMemcpyDtoH + 2.42% 16.174ms 9 1.7971ms 2.9100us 3.7318ms cuStreamSynchronize + 1.32% 8.8137ms 1 8.8137ms 8.8137ms 8.8137ms cuModuleLoadData + 0.19% 1.2974ms 6 216.23us 9.2680us 1.2018ms cuLaunchKernel + 0.18% 1.2279ms 2 613.93us 556.27us 671.60us cuMemAlloc + 0.08% 548.27us 1 548.27us 548.27us 548.27us cuMemHostAlloc + 0.00% 15.825us 34 465ns 279ns 3.3010us cuEventCreate + 0.00% 14.925us 1 14.925us 14.925us 14.925us cuStreamCreate + 0.00% 6.9020us 3 2.3000us 1.9630us 2.5030us cuModuleGetFunction + 0.00% 5.0140us 6 835ns 137ns 2.4530us cuFuncGetAttribute + 0.00% 4.3160us 1 4.3160us 4.3160us 4.3160us cuDeviceGetPCIBusId + 0.00% 3.5470us 7 506ns 128ns 2.0420us cuDeviceGetAttribute + 0.00% 2.4630us 1 2.4630us 2.4630us 2.4630us cuMemHostGetDevicePointer + 0.00% 2.3330us 1 2.3330us 2.3330us 2.3330us cuModuleGetGlobal + 0.00% 2.0370us 1 2.0370us 2.0370us 2.0370us cuCtxGetCurrent + 0.00% 1.4810us 2 740ns 244ns 1.2370us cuDeviceGet + 0.00% 1.4210us 3 473ns 215ns 838ns cuDeviceGetCount + 0.00% 1.3680us 3 456ns 283ns 570ns cuFuncSetCacheConfig + 0.00% 616ns 1 616ns 616ns 616ns cuCtxSetCurrent