<p>No additional compile/link flags are needed in Makefile.machine.</p>
<p>Note that if you change the USER-CUDA library precision (discussed
above) and rebuild the USER-CUDA library, then you also need to
re-install the USER-CUDA package and re-build LAMMPS, so that all
affected files are re-compiled and linked to the new USER-CUDA
library.</p>
<p><strong>Run with the USER-CUDA package from the command line:</strong></p>
<p>The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.</p>
<p>When using the USER-CUDA package, you must use exactly one MPI task
per physical GPU.</p>
<p>You must use the “-c on” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to enable the USER-CUDA package.
The “-c on” switch also issues a default <a class="reference internal" href="package.html"><em>package cuda 1</em></a>
command which sets various USER-CUDA options to default values, as
discussed on the <a class="reference internal" href="package.html"><em>package</em></a> command doc page.</p>
<p>Use the “-sf cuda” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>,
which will automatically append “cuda” to styles that support it. Use
the “-pk cuda Ng” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to
set Ng = # of GPUs per node to a different value than the default set
by the “-c on” switch (1 GPU) or change other <a class="reference internal" href="package.html"><em>package cuda</em></a> options.</p>
<div class="highlight-python"><div class="highlight"><pre>lmp_machine -c on -sf cuda -pk cuda 1 -in in.script # 1 MPI task uses 1 GPU
mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node
mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # ditto on 12 16-core nodes
</pre></div>
</div>
<p>The syntax for the “-pk” switch is the same as same as the “package
cuda” command. See the <a class="reference internal" href="package.html"><em>package</em></a> command doc page for
details, including the default values used for all its options if it
is not specified.</p>
<p>Note that the default for the <a class="reference internal" href="package.html"><em>package cuda</em></a> command is
to set the Newton flag to “off” for both pairwise and bonded
interactions. This typically gives fastest performance. If the
<a class="reference internal" href="newton.html"><em>newton</em></a> command is used in the input script, it can
override these defaults.</p>
<p><strong>Or run with the USER-CUDA package by editing an input script:</strong></p>
<p>The discussion above for the mpirun/mpiexec command and the requirement
of one MPI task per GPU is the same.</p>
<p>You must still use the “-c on” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to enable the USER-CUDA package.</p>
<p>Use the <a class="reference internal" href="suffix.html"><em>suffix cuda</em></a> command, or you can explicitly add a
“cuda” suffix to individual styles in your input script, e.g.</p>
<p>You only need to use the <a class="reference internal" href="package.html"><em>package cuda</em></a> command if you
wish to change any of its option defaults, including the number of
GPUs/node (default = 1), as set by the “-c on” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>.</p>
<p><strong>Speed-ups to expect:</strong></p>
<p>The performance of a GPU versus a multi-core CPU is a function of your
hardware, which pair style is used, the number of atoms/GPU, and the
precision used on the GPU (double, single, mixed).</p>
<p>See the <a class="reference external" href="http://lammps.sandia.gov/bench.html">Benchmark page</a> of the
LAMMPS web site for performance of the USER-CUDA package on different
hardware.</p>
<p><strong>Guidelines for best performance:</strong></p>
<ul class="simple">
<li>The USER-CUDA package offers more speed-up relative to CPU performance
when the number of atoms per GPU is large, e.g. on the order of tens
or hundreds of 1000s.</li>
<li>As noted above, this package will continue to run a simulation
entirely on the GPU(s) (except for inter-processor MPI communication),
for multiple timesteps, until a CPU calculation is required, either by
a fix or compute that is non-GPU-ized, or until output is performed
(thermo or dump snapshot or restart file). The less often this
occurs, the faster your simulation will run.</li>
</ul>
<div class="section" id="restrictions">
<h2>Restrictions<a class="headerlink" href="#restrictions" title="Permalink to this headline">¶</a></h2>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.