Performance study of memory allocation and data transfer on the GH200, comparing regular host allocations with CUDA host allocations (page-locked and directly accessible from device), and comparing the GH200 runs with A100 runs
Recent Commits
Commit | Author | Details | Committed | ||||
---|---|---|---|---|---|---|---|
6531809ce5b5 | calvofl0 | Initial commit | Jul 17 |
README.md
Memory allocation and data transfer on the GH200
Introduction
A significant performance loss was observed on some CPU-GPU codes when moving the runs to a GH200 node from a AMD EPYC 7402 node with two H100 GPUs. The aim of this repository is to showcase the encountered issue, that was isolated in a very small code. We found that in order to get an increase of performance when moving to the GH200 nodes, host variables involving device-host/host-device memory transfers need to be allocated in a paged-locked fashion and need to be directly accessible to the device.
In a C-code, this means that allocations on the host need to be made with the cudaMallocHost function, while on PyTorch codes they need to be made with the pin_memory flag set, and transfers to the host need to be made with the non_blocking flag set. Alternatively, the global environment variable PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync can be set.
Failing to do so ends up with a dignificant penalty for the GH200, making memory transfers up to 20x slower.
Example runs
All test runs have been made withing the NVIDIA NGC PyTorch container, version 24.06-py3. Details on the setup and code compilation are to be found on file INSTALL.md.
Runs on the GH200 node
bash alias nsys=/work/CTR/CI/DCSR/rfabbret/default/fcalvo/gh200/nsight-systems-cli-2024.4.1/target-linux-sbsa-armv8/nsys mkdir -p reports # C code time nsys profile -t cuda --cuda-memory-usage=true --output=reports/DtoH-cudaMalloc_aarch64.nsys-rep ./bin/DtoH-cudaMalloc_aarch64 time nsys profile -t cuda --cuda-memory-usage=true --output=reports/DtoH-malloc_aarch64.nsys-rep ./bin/DtoH-malloc_aarch64 time nsys profile -t cuda --cuda-memory-usage=true --output=reports/HtoD-cudaMalloc_aarch64.nsys-rep ./bin/HtoD-cudaMalloc_aarch64 time nsys profile -t cuda --cuda-memory-usage=true --output=reports/HtoD-malloc_aarch64.nsys-rep ./bin/HtoD-malloc_aarch64 # Python code time PYTORCH_NO_CUDA_MEMORY_CACHING=1 PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-DtoH-cudaMalloc_aarch64.nsys-rep python python/DtoH.py time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-DtoH-pin_aarch64 python python/DtoH.py pin time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-DtoH-malloc_aarch64 python python/DtoH.py time PYTORCH_NO_CUDA_MEMORY_CACHING=1 PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-HtoD-cudaMalloc_aarch64.nsys-rep python python/HtoD.py time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-HtoD-pin_aarch64 python python/HtoD.py pin time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-HtoD-malloc_aarch64 python python/HtoD.py
Runs on the AMD EPYC 7402 node
bash alias nsys=/dcsrsoft/spack/external/nsight-systems-2024.4.1/bin/nsys # C code time nsys profile -t cuda --cuda-memory-usage=true --output=reports/DtoH-cudaMalloc_x64.nsys-rep ./bin/DtoH-cudaMalloc_x64 time nsys profile -t cuda --cuda-memory-usage=true --output=reports/DtoH-malloc_x64.nsys-rep ./bin/DtoH-malloc_x64 time nsys profile -t cuda --cuda-memory-usage=true --output=reports/HtoD-cudaMalloc_x64.nsys-rep ./bin/HtoD-cudaMalloc_x64 time nsys profile -t cuda --cuda-memory-usage=true --output=reports/HtoD-malloc_x64.nsys-rep ./bin/HtoD-malloc_x64 # Python code time PYTORCH_NO_CUDA_MEMORY_CACHING=1 PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-DtoH-cudaMalloc_x64.nsys-rep python python/DtoH.py time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-DtoH-pin_x64 python python/DtoH.py pin time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-DtoH-malloc_x64 python python/DtoH.py time PYTORCH_NO_CUDA_MEMORY_CACHING=1 PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-HtoD-cudaMalloc_x64.nsys-rep python python/HtoD.py time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-HtoD-pin_x64 python python/HtoD.py pin time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-HtoD-malloc_x64 python python/HtoD.py
Generating average reports and visualizing
CSV reports (averaged over all runs in folders reports/run?) can be generated from all nsys-rep files with the script report2csv.sh. A Python 3 installation with the pandas package is required:
bash ./report2csv.sh
CSV report files can then be conveniently visualized with (example for the DtoH-cudaMalloc_aarch64.csv file):
bash source .bash_aliases cat reports/DtoH-cudaMalloc_aarch64.csv | colview