Diffusion Memory allocation and data transfer on the GH200 (main)

Edit
Memory allocation and data transfer on the GH200
ActivePublic

Performance study of memory allocation and data transfer on the GH200, comparing regular host allocations with CUDA host allocations (page-locked and directly accessible from device), and comparing the GH200 runs with A100 runs

Recent Commits

		Commit			Author	Details	Committed
		6531809ce5b5			calvofl0	Initial commit	Jul 17 2024

README.md

Memory allocation and data transfer on the GH200

Introduction

A significant performance loss was observed on some CPU-GPU codes when moving the runs to a GH200 node from a AMD EPYC 7402 node with two H100 GPUs. The aim of this repository is to showcase the encountered issue, that was isolated in a very small code. We found that in order to get an increase of performance when moving to the GH200 nodes, host variables involving device-host/host-device memory transfers need to be allocated in a paged-locked fashion and need to be directly accessible to the device.

In a C-code, this means that allocations on the host need to be made with the cudaMallocHost function, while on PyTorch codes they need to be made with the pin_memory flag set, and transfers to the host need to be made with the non_blocking flag set. Alternatively, the global environment variable PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync can be set.

Failing to do so ends up with a dignificant penalty for the GH200, making memory transfers up to 20x slower.

Example runs

All test runs have been made withing the NVIDIA NGC PyTorch container, version 24.06-py3. Details on the setup and code compilation are to be found on file INSTALL.md.

Runs on the GH200 node

 bash
alias nsys=/work/CTR/CI/DCSR/rfabbret/default/fcalvo/gh200/nsight-systems-cli-2024.4.1/target-linux-sbsa-armv8/nsys

mkdir -p reports

# C code
time nsys profile -t cuda --cuda-memory-usage=true --output=reports/DtoH-cudaMalloc_aarch64.nsys-rep ./bin/DtoH-cudaMalloc_aarch64
time nsys profile -t cuda --cuda-memory-usage=true --output=reports/DtoH-malloc_aarch64.nsys-rep ./bin/DtoH-malloc_aarch64
time nsys profile -t cuda --cuda-memory-usage=true --output=reports/HtoD-cudaMalloc_aarch64.nsys-rep ./bin/HtoD-cudaMalloc_aarch64
time nsys profile -t cuda --cuda-memory-usage=true --output=reports/HtoD-malloc_aarch64.nsys-rep ./bin/HtoD-malloc_aarch64

# Python code
time PYTORCH_NO_CUDA_MEMORY_CACHING=1 PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-DtoH-cudaMalloc_aarch64.nsys-rep python python/DtoH.py
time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-DtoH-pin_aarch64 python python/DtoH.py pin
time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-DtoH-malloc_aarch64 python python/DtoH.py
time PYTORCH_NO_CUDA_MEMORY_CACHING=1 PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-HtoD-cudaMalloc_aarch64.nsys-rep python python/HtoD.py
time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-HtoD-pin_aarch64 python python/HtoD.py pin
time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-HtoD-malloc_aarch64 python python/HtoD.py

Runs on the AMD EPYC 7402 node

 bash
alias nsys=/dcsrsoft/spack/external/nsight-systems-2024.4.1/bin/nsys

# C code
time nsys profile -t cuda --cuda-memory-usage=true --output=reports/DtoH-cudaMalloc_x64.nsys-rep ./bin/DtoH-cudaMalloc_x64
time nsys profile -t cuda --cuda-memory-usage=true --output=reports/DtoH-malloc_x64.nsys-rep ./bin/DtoH-malloc_x64
time nsys profile -t cuda --cuda-memory-usage=true --output=reports/HtoD-cudaMalloc_x64.nsys-rep ./bin/HtoD-cudaMalloc_x64
time nsys profile -t cuda --cuda-memory-usage=true --output=reports/HtoD-malloc_x64.nsys-rep ./bin/HtoD-malloc_x64

# Python code
time PYTORCH_NO_CUDA_MEMORY_CACHING=1 PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-DtoH-cudaMalloc_x64.nsys-rep python python/DtoH.py
time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-DtoH-pin_x64 python python/DtoH.py pin
time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-DtoH-malloc_x64 python python/DtoH.py
time PYTORCH_NO_CUDA_MEMORY_CACHING=1 PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-HtoD-cudaMalloc_x64.nsys-rep python python/HtoD.py
time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-HtoD-pin_x64 python python/HtoD.py pin
time PYTORCH_NO_CUDA_MEMORY_CACHING=1 nsys profile -t cuda --cuda-memory-usage=true --output=reports/python-HtoD-malloc_x64 python python/HtoD.py

Generating average reports and visualizing

CSV reports (averaged over all runs in folders reports/run?) can be generated from all nsys-rep files with the script report2csv.sh. A Python 3 installation with the pandas package is required:

 bash
./report2csv.sh

CSV report files can then be conveniently visualized with (example for the DtoH-cudaMalloc_aarch64.csv file):

 bash
source .bash_aliases

cat reports/DtoH-cudaMalloc_aarch64.csv | colview

.bash_aliases
.gitattributes
.gitignore
INSTALL.md
README.md
bin/
c/
convert2mean.py
python/
pytorch-container-24.06.def
report.adoc
report.pdf
reports/
reports2csv.sh

EditMemory allocation and data transfer on the GH200ActivePublic

Memory allocation and data transfer on the GH200 (main)