<p><aclass="reference internal"href="Section_accelerate.html"><em>Return to Section accelerate overview</em></a></p>
<divclass="section"id="gpu-package">
<h1>5.GPU package<aclass="headerlink"href="#gpu-package"title="Permalink to this headline">¶</a></h1>
<p>The GPU package was developed by Mike Brown at ORNL and his
collaborators, particularly Trung Nguyen (ORNL). It provides GPU
versions of many pair styles, including the 3-body Stillinger-Weber
pair style, and for <aclass="reference internal"href="kspace_style.html"><em>kspace_style pppm</em></a> for
long-range Coulombics. It has the following general features:</p>
<ulclass="simple">
<li>It is designed to exploit common GPU hardware configurations where one
or more GPUs are coupled to many cores of one or more multi-core CPUs,
e.g. within a node of a parallel machine.</li>
<li>Atom-based data (e.g. coordinates, forces) moves back-and-forth
between the CPU(s) and GPU every timestep.</li>
<li>Neighbor lists can be built on the CPU or on the GPU</li>
<li>The charge assignement and force interpolation portions of PPPM can be
run on the GPU. The FFT portion, which requires MPI communication
between processors, runs on the CPU.</li>
<li>Asynchronous force computations can be performed simultaneously on the
CPU(s) and GPU.</li>
<li>It allows for GPU computations to be performed in single or double
precision, or in mixed-mode precision, where pairwise forces are
computed in single precision, but accumulated into double-precision
force vectors.</li>
<li>LAMMPS-specific code is in the GPU package. It makes calls to a
generic GPU library in the lib/gpu directory. This library provides
NVIDIA support as well as more general OpenCL support, so that the
same functionality can eventually be supported on a variety of GPU
hardware.</li>
</ul>
<p>Here is a quick overview of how to use the GPU package:</p>
<ulclass="simple">
<li>build the library in lib/gpu for your GPU hardware wity desired precision</li>
<li>include the GPU package and build LAMMPS</li>
<li>use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU</li>
<li>specify the # of GPUs per node</li>
<li>use GPU styles in your input script</li>
</ul>
<p>The latter two steps can be done using the “-pk gpu” and “-sf gpu”
<aclass="reference internal"href="Section_start.html#start-7"><span>command-line switches</span></a> respectively. Or
the effect of the “-pk” or “-sf” switches can be duplicated by adding
the <aclass="reference internal"href="package.html"><em>package gpu</em></a> or <aclass="reference internal"href="suffix.html"><em>suffix gpu</em></a> commands
<p>The GPU library is in lammps/lib/gpu. Select a Makefile.machine (in
lib/gpu) appropriate for your system. You should pay special
attention to 3 settings in this makefile.</p>
<ulclass="simple">
<li>CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system</li>
<li>CUDA_ARCH = needs to be appropriate to your GPUs</li>
<li>CUDA_PREC = precision (double, mixed, single) you desire</li>
</ul>
<p>See lib/gpu/Makefile.linux.double for examples of the ARCH settings
for different GPU choices, e.g. Fermi vs Kepler. It also lists the
possible precision settings:</p>
<divclass="highlight-python"><divclass="highlight"><pre><spanclass="n">CUDA_PREC</span><spanclass="o">=</span><spanclass="o">-</span><spanclass="n">D_SINGLE_SINGLE</span><spanclass="c"># single precision for all calculations</span>
<spanclass="n">CUDA_PREC</span><spanclass="o">=</span><spanclass="o">-</span><spanclass="n">D_DOUBLE_DOUBLE</span><spanclass="c"># double precision for all calculations</span>
<spanclass="n">CUDA_PREC</span><spanclass="o">=</span><spanclass="o">-</span><spanclass="n">D_SINGLE_DOUBLE</span><spanclass="c"># accumulation of forces, etc, in double</span>
</pre></div>
</div>
<p>The last setting is the mixed mode referred to above. Note that your
GPU must support double precision to use either the 2nd or 3rd of
<p>You must also use the <aclass="reference internal"href="package.html"><em>package gpu</em></a> command to enable the
GPU package, unless the “-sf gpu” or “-pk gpu”<aclass="reference internal"href="Section_start.html#start-7"><span>command-line switches</span></a> were used. It specifies the
number of GPUs/node to use, as well as other options.</p>
<p><strong>Speed-ups to expect:</strong></p>
<p>The performance of a GPU versus a multi-core CPU is a function of your
hardware, which pair style is used, the number of atoms/GPU, and the
precision used on the GPU (double, single, mixed).</p>
<p>See the <aclass="reference external"href="http://lammps.sandia.gov/bench.html">Benchmark page</a> of the
LAMMPS web site for performance of the GPU package on various
hardware, including the Titan HPC platform at ORNL.</p>
<p>You should also experiment with how many MPI tasks per GPU to use to
give the best performance for your problem and machine. This is also
a function of the problem size and the pair style being using.
Likewise, you should experiment with the precision setting for the GPU
library to see if single or mixed precision will give accurate
results, since they will typically be faster.</p>
<p><strong>Guidelines for best performance:</strong></p>
<ulclass="simple">
<li>Using multiple MPI tasks per GPU will often give the best performance,
as allowed my most multi-core CPU/GPU configurations.</li>
<li>If the number of particles per MPI task is small (e.g. 100s of
particles), it can be more efficient to run with fewer MPI tasks per
GPU, even if you do not use all the cores on the compute node.</li>
<li>The <aclass="reference internal"href="package.html"><em>package gpu</em></a> command has several options for tuning
performance. Neighbor lists can be built on the GPU or CPU. Force
calculations can be dynamically balanced across the CPU cores and
GPUs. GPU-specific settings can be made which can be optimized
for different hardware. See the <aclass="reference internal"href="package.html"><em>packakge</em></a> command
doc page for details.</li>
<li>As described by the <aclass="reference internal"href="package.html"><em>package gpu</em></a> command, GPU
accelerated pair styles can perform computations asynchronously with
CPU computations. The “Pair” time reported by LAMMPS will be the
maximum of the time required to complete the CPU pair style
computations and the time required to complete the GPU pair style
computations. Any time spent for GPU-enabled pair styles for
computations that run simultaneously with <aclass="reference internal"href="bond_style.html"><em>bond</em></a>,
<aclass="reference internal"href="improper_style.html"><em>improper</em></a>, and <aclass="reference internal"href="kspace_style.html"><em>long-range</em></a>
calculations will not be included in the “Pair” time.</li>
<li>When the <em>mode</em> setting for the package gpu command is force/neigh,
the time for neighbor list calculations on the GPU will be added into
the “Pair” time, not the “Neigh” time. An additional breakdown of the
times required for various tasks on the GPU (data copy, neighbor
calculations, force computations, etc) are output only with the LAMMPS
screen output (not in the log file) at the end of each run. These
timings represent total time spent on the GPU for each routine,
regardless of asynchronous CPU calculations.</li>
<li>The output section “GPU Time Info (average)” reports “Max Mem / Proc”.
This is the maximum memory used at one time on the GPU for data
storage by a single MPI process.</li>
</ul>
<divclass="section"id="restrictions">
<h2>Restrictions<aclass="headerlink"href="#restrictions"title="Permalink to this headline">¶</a></h2>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.