<li class="toctree-l2"><a class="reference internal" href="#comparison-of-various-accelerator-packages">5.4. Comparison of various accelerator packages</a></li>
<p>Before trying to make your simulation run faster, you should
understand how it currently performs and where the bottlenecks are.</p>
<p>The best way to do this is run the your system (actual number of
atoms) for a modest number of timesteps (say 100 steps) on several
different processor counts, including a single processor if possible.
Do this for an equilibrium version of your system, so that the
100-step timings are representative of a much longer run. There is
typically no need to run for 1000s of timesteps to get accurate
timings; you can simply extrapolate from short runs.</p>
<p>For the set of runs, look at the timing data printed to the screen and
log file at the end of each LAMMPS run. <a class="reference internal" href="Section_start.html#start-8"><span class="std std-ref">This section</span></a> of the manual has an overview.</p>
<p>Running on one (or a few processors) should give a good estimate of
the serial performance and what portions of the timestep are taking
the most time. Running the same problem on a few different processor
counts should give an estimate of parallel scalability. I.e. if the
simulation runs 16x faster on 16 processors, its 100% parallel
efficient; if it runs 8x faster on 16 processors, it’s 50% efficient.</p>
<p>The most important data to look at in the timing info is the timing
breakdown and relative percentages. For example, trying different
options for speeding up the long-range solvers will have little impact
if they only consume 10% of the run time. If the pairwise time is
dominating, you may want to look at GPU or OMP versions of the pair
style, as discussed below. Comparing how the percentages change as
you increase the processor count gives you a sense of how different
operations within the timestep are scaling. Note that if you are
running with a Kspace solver, there is additional output on the
breakdown of the Kspace time. For PPPM, this includes the fraction
spent on FFTs, which can be communication intensive.</p>
<p>Another important detail in the timing info are the histograms of
atoms counts and neighbor counts. If these vary widely across
processors, you have a load-imbalance issue. This often results in
inaccurate relative timing data, because processors have to wait when
communication occurs for other processors to catch up. Thus the
reported times for “Communication” or “Other” may be higher than they
really are, due to load-imbalance. If this is an issue, you can
uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile
LAMMPS, to obtain synchronized timings.</p>
<hr class="docutils" />
</div>
<div class="section" id="general-strategies">
<span id="acc-2"></span><h2>5.2. General strategies</h2>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">this section 5.2 is still a work in progress</p>
</div>
<p>Here is a list of general ideas for improving simulation performance.
Most of them are only applicable to certain models and certain
bottlenecks in the current performance, so let the timing data you
generate be your guide. It is hard, if not impossible, to predict how
much difference these options will make, since it is a function of
problem size, number of processors used, and your machine. There is
no substitute for identifying performance bottlenecks, and trying out
various options.</p>
<ul class="simple">
<li>rRESPA</li>
<li>2-FFT PPPM</li>
<li>Staggered PPPM</li>
<li>single vs double PPPM</li>
<li>partial charge PPPM</li>
<li>verlet/split run style</li>
<li>processor command for proc layout and numa layout</li>
<li>load-balancing: balance and fix balance</li>
</ul>
<p>2-FFT PPPM, also called <em>analytic differentiation</em> or <em>ad</em> PPPM, uses
2 FFTs instead of the 4 FFTs used by the default <em>ik differentiation</em>
PPPM. However, 2-FFT PPPM also requires a slightly larger mesh size to
achieve the same accuracy as 4-FFT PPPM. For problems where the FFT
cost is the performance bottleneck (typically large problems running
on many processors), 2-FFT PPPM may be faster than 4-FFT PPPM.</p>
<p>Staggered PPPM performs calculations using two different meshes, one
shifted slightly with respect to the other. This can reduce force
aliasing errors and increase the accuracy of the method, but also
doubles the amount of work required. For high relative accuracy, using
staggered PPPM allows one to half the mesh size in each dimension as
compared to regular PPPM, which can give around a 4x speedup in the
kspace time. However, for low relative accuracy, using staggered PPPM
gives little benefit and can be up to 2x slower in the kspace
time. For example, the rhodopsin benchmark was run on a single
processor, and results for kspace time vs. relative accuracy for the
different methods are shown in the figure below. For this system,
staggered PPPM (using ik differentiation) becomes useful when using a
relative accuracy of slightly greater than 1e-5 and above.</p>
<span id="acc-3"></span><h2>5.3. Packages with optimized styles</h2>
<p>Accelerated versions of various <a class="reference internal" href="pair_style.html"><span class="doc">pair_style</span></a>,
<a class="reference internal" href="fix.html"><span class="doc">fixes</span></a>, <a class="reference internal" href="compute.html"><span class="doc">computes</span></a>, and other commands have
been added to LAMMPS, which will typically run faster than the
standard non-accelerated versions. Some require appropriate hardware
to be present on your system, e.g. GPUs or Intel Xeon Phi
coprocessors.</p>
<p>All of these commands are in packages provided with LAMMPS. An
overview of packages is give in <a class="reference internal" href="Section_packages.html"><span class="doc">Section packages</span></a>.</p>
<p>These are the accelerator packages
currently in LAMMPS, either as standard or user packages:</p>
<tr class="row-even"><td>enable specific accelerator support via ‘-k on’ <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a>,</td>
<td>only needed for KOKKOS package</td>
</tr>
<tr class="row-odd"><td>set any needed options for the package via “-pk” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> or <a class="reference internal" href="package.html"><span class="doc">package</span></a> command,</td>
<td>only if defaults need to be changed</td>
</tr>
<tr class="row-even"><td>use accelerated styles in your input via “-sf” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> or <a class="reference internal" href="suffix.html"><span class="doc">suffix</span></a> command</td>
<td>lmp_machine -in in.script -sf gpu</td>
</tr>
</tbody>
</table>
<p>Note that the first 4 steps can be done as a single command, using the
src/Make.py tool. This tool is discussed in <a class="reference internal" href="Section_start.html#start-4"><span class="std std-ref">Section 2.4</span></a> of the manual, and its use is
illustrated in the individual accelerator sections. Typically these
steps only need to be done once, to create an executable that uses one
or more accelerator packages.</p>
<p>The last 4 steps can all be done from the command-line when LAMMPS is
launched, without changing your input script, as illustrated in the
individual accelerator sections. Or you can add
<a class="reference internal" href="package.html"><span class="doc">package</span></a> and <a class="reference internal" href="suffix.html"><span class="doc">suffix</span></a> commands to your input
script.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">With a few exceptions, you can build a single LAMMPS executable
with all its accelerator packages installed. Note however that the
USER-INTEL and KOKKOS packages require you to choose one of their
hardware options when building for a specific platform. I.e. CPU or
Phi option for the USER-INTEL package. Or the OpenMP, Cuda, or Phi
option for the KOKKOS package.</p>
</div>
<p>These are the exceptions. You cannot build a single executable with:</p>
<ul class="simple">
<li>both the USER-INTEL Phi and KOKKOS Phi options</li>
<li>the USER-INTEL Phi or Kokkos Phi option, and the GPU package</li>
</ul>
<p>See the examples/accelerate/README and make.list files for sample
Make.py commands that build LAMMPS with any or all of the accelerator
packages. As an example, here is a command that builds with all the
GPU related packages installed (GPU, KOKKOS with Cuda), including
settings to build the needed auxiliary GPU libraries for Kepler GPUs:</p>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.