ACC: Initialize CUDA ==14394== NVPROF is profiling process 14394, command: ./Jacobi_CCEoffload ACC: Get Device 0 ACC: Create Context ACC: Set Thread Context ACC: Start transfer 2 items from Jacobi_simple.F90:74 ACC: allocate, copy to acc 'a_omp(:,:)' (536870912 bytes) ACC: allocate, copy to acc 'anew_omp(:,:)' (536870912 bytes) ACC: End transfer (to acc 1073741824 bytes, to host 0 bytes) ACC: Start transfer 2 items from Jacobi_simple.F90:84 ACC: present 'tab(:,:)' (536870912 bytes) ACC: present 'tabnew(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Execute kernel jacobi_$ck_L84_14 blocks:8190 threads:128 async(auto) from Jacobi_simple.F90:84 ACC: Wait async(auto) from Jacobi_simple.F90:84 ACC: Start transfer 2 items from Jacobi_simple.F90:84 ACC: release present 'tab(:,:)' (536870912 bytes) ACC: release present 'tabnew(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Start transfer 2 items from Jacobi_simple.F90:84 ACC: present 'tab(:,:)' (536870912 bytes) ACC: present 'tabnew(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Execute kernel jacobi_$ck_L84_14 blocks:8190 threads:128 async(auto) from Jacobi_simple.F90:84 ACC: Wait async(auto) from Jacobi_simple.F90:84 ACC: Start transfer 2 items from Jacobi_simple.F90:84 ACC: release present 'tab(:,:)' (536870912 bytes) ACC: release present 'tabnew(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Wait async(auto) from Jacobi_simple.F90:98 ACC: Start transfer 1 items from Jacobi_simple.F90:98 ACC: copy to host 'a_omp(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 536870912 bytes) ACC: Start transfer 2 items from Jacobi_simple.F90:106 ACC: copy to acc 'a_omp(:,:)' (536870912 bytes) ACC: copy to acc 'anew_omp(:,:)' (536870912 bytes) ACC: End transfer (to acc 1073741824 bytes, to host 0 bytes) ACC: Start transfer 2 items from Jacobi_simple.F90:111 ACC: present 'a_pointer1(:,:)' (536870912 bytes) ACC: present 'a_pointer2(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Execute kernel jacobi_$ck_L111_17 blocks:8190 threads:128 async(auto) from Jacobi_simple.F90:111 ACC: Wait async(auto) from Jacobi_simple.F90:118 ACC: Start transfer 2 items from Jacobi_simple.F90:118 ACC: release present 'a_pointer1(:,:)' (536870912 bytes) ACC: release present 'a_pointer2(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Start transfer 2 items from Jacobi_simple.F90:111 ACC: present 'a_pointer1(:,:)' (536870912 bytes) ACC: present 'a_pointer2(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Execute kernel jacobi_$ck_L111_17 blocks:8190 threads:128 async(auto) from Jacobi_simple.F90:111 ACC: Wait async(auto) from Jacobi_simple.F90:118 ACC: Start transfer 2 items from Jacobi_simple.F90:118 ACC: release present 'a_pointer1(:,:)' (536870912 bytes) ACC: release present 'a_pointer2(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Wait async(auto) from Jacobi_simple.F90:129 ACC: Start transfer 1 items from Jacobi_simple.F90:129 ACC: copy to host 'a_omp(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 536870912 bytes) ACC: Start transfer 2 items from Jacobi_simple.F90:137 ACC: copy to acc 'a_omp(:,:)' (536870912 bytes) ACC: copy to acc 'anew_omp(:,:)' (536870912 bytes) ACC: End transfer (to acc 1073741824 bytes, to host 0 bytes) ACC: Start transfer 2 items from Jacobi_simple.F90:142 ACC: present 'a_pointer1(:,:)' (536870912 bytes) ACC: present 'a_pointer2(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Execute kernel jacobi_$ck_L142_20 blocks:8190 threads:128 async(auto) from Jacobi_simple.F90:142 ACC: Wait async(auto) from Jacobi_simple.F90:149 ACC: Start transfer 2 items from Jacobi_simple.F90:149 ACC: release present 'a_pointer1(:,:)' (536870912 bytes) ACC: release present 'a_pointer2(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Start transfer 2 items from Jacobi_simple.F90:142 ACC: present 'a_pointer1(:,:)' (536870912 bytes) ACC: present 'a_pointer2(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Execute kernel jacobi_$ck_L142_20 blocks:8190 threads:128 async(auto) from Jacobi_simple.F90:142 ACC: Wait async(auto) from Jacobi_simple.F90:149 ACC: Start transfer 2 items from Jacobi_simple.F90:149 ACC: release present 'a_pointer1(:,:)' (536870912 bytes) ACC: release present 'a_pointer2(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Wait async(auto) from Jacobi_simple.F90:160 ACC: Start transfer 1 items from Jacobi_simple.F90:160 ACC: copy to host 'a_omp(:,:)' (536870912 bytes) ACC: End transfer (to acc 0 bytes, to host 536870912 bytes) ==14394== Profiling application: ./Jacobi_CCEoffload ==14394== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 69.05% 314.14ms 7 44.878ms 1.3120us 52.706ms [CUDA memcpy HtoD] 27.39% 124.62ms 3 41.540ms 41.413ms 41.619ms [CUDA memcpy DtoH] 1.64% 7.4409ms 2 3.7204ms 3.7064ms 3.7345ms jacobi_$ck_L84_14 0.96% 4.3781ms 2 2.1891ms 2.1882ms 2.1899ms jacobi_$ck_L142_20 0.96% 4.3767ms 2 2.1884ms 2.1874ms 2.1893ms jacobi_$ck_L111_17 API calls: 47.14% 314.54ms 7 44.934ms 9.2410us 52.755ms cuMemcpyHtoD 29.93% 199.72ms 1 199.72ms 199.72ms 199.72ms cuCtxCreate 18.72% 124.89ms 3 41.631ms 41.504ms 41.709ms cuMemcpyDtoH 2.42% 16.174ms 9 1.7971ms 2.9100us 3.7318ms cuStreamSynchronize 1.32% 8.8137ms 1 8.8137ms 8.8137ms 8.8137ms cuModuleLoadData 0.19% 1.2974ms 6 216.23us 9.2680us 1.2018ms cuLaunchKernel 0.18% 1.2279ms 2 613.93us 556.27us 671.60us cuMemAlloc 0.08% 548.27us 1 548.27us 548.27us 548.27us cuMemHostAlloc 0.00% 15.825us 34 465ns 279ns 3.3010us cuEventCreate 0.00% 14.925us 1 14.925us 14.925us 14.925us cuStreamCreate 0.00% 6.9020us 3 2.3000us 1.9630us 2.5030us cuModuleGetFunction 0.00% 5.0140us 6 835ns 137ns 2.4530us cuFuncGetAttribute 0.00% 4.3160us 1 4.3160us 4.3160us 4.3160us cuDeviceGetPCIBusId 0.00% 3.5470us 7 506ns 128ns 2.0420us cuDeviceGetAttribute 0.00% 2.4630us 1 2.4630us 2.4630us 2.4630us cuMemHostGetDevicePointer 0.00% 2.3330us 1 2.3330us 2.3330us 2.3330us cuModuleGetGlobal 0.00% 2.0370us 1 2.0370us 2.0370us 2.0370us cuCtxGetCurrent 0.00% 1.4810us 2 740ns 244ns 1.2370us cuDeviceGet 0.00% 1.4210us 3 473ns 215ns 838ns cuDeviceGetCount 0.00% 1.3680us 3 456ns 283ns 570ns cuFuncSetCacheConfig 0.00% 616ns 1 616ns 616ns 616ns cuCtxSetCurrent