<li class="toctree-l2"><a class="reference internal" href="Section_accelerate.html#comparison-of-various-accelerator-packages">5.4. Comparison of various accelerator packages</a></li>
<p>Results are speedups obtained on Intel Xeon E5-2697v4 processors
(code-named Broadwell) and Intel Xeon Phi 7250 processors
(code-named Knights Landing) with “18 Jun 2016” LAMMPS built with
Intel Parallel Studio 2016 update 3. Results are with 1 MPI task
per physical core. See <em>src/USER-INTEL/TEST/README</em> for the raw
simulation rates and instructions to reproduce.</p>
<hr class="docutils" />
<p><strong>Quick Start for Experienced Users:</strong></p>
<p>LAMMPS should be built with the USER-INTEL package installed.
Simulations should be run with 1 MPI task per physical <em>core</em>,
not <em>hardware thread</em>.</p>
<p>For Intel Xeon CPUs:</p>
<ul class="simple">
<li>Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary.</li>
<li>If using <em>kspace_style pppm</em> in the input script, add “neigh_modify binsize 3” and “kspace_modify diff ad” to the input script for better
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double # Don't use any coprocessors that might be available, use 2 OpenMP threads for each task, use double precision
</pre>
<p><strong>Or run with the USER-INTEL package by editing an input script:</strong></p>
<p>As an alternative to adding command-line arguments, the input script
can be edited to enable the USER-INTEL package. This requires adding
the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command to the top of the input
script. For the second example above, this would be:</p>
<p>To enable the USER-INTEL package only for individual styles, you can
add an “intel” suffix to the individual style, e.g.:</p>
<pre class="literal-block">
pair_style lj/cut/intel 2.5
</pre>
<p>Alternatively, the <a class="reference internal" href="suffix.html"><span class="doc">suffix intel</span></a> command can be added to
the input script to enable USER-INTEL styles for the commands that
follow in the input script.</p>
<p><strong>Tuning for Performance:</strong></p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">The USER-INTEL package will perform better with modifications
to the input script when <a class="reference internal" href="kspace_style.html"><span class="doc">PPPM</span></a> is used:
<a class="reference internal" href="kspace_modify.html"><span class="doc">kspace_modify diff ad</span></a> and <a class="reference internal" href="neigh_modify.html"><span class="doc">neigh_modify binsize 3</span></a> should be added to the input script.</p>
</div>
<p>Long-Range Thread (LRT) mode is an option to the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command that can improve performance when using
<a class="reference internal" href="kspace_style.html"><span class="doc">PPPM</span></a> for long-range electrostatics on processors
with SMT. It generates an extra pthread for each MPI task. The thread
is dedicated to performing some of the PPPM calculations and MPI
communications. On Intel Xeon Phi x200 series CPUs, this will likely
always improve performance, even on a single node. On Intel Xeon
processors, using this mode might result in better performance when
using multiple nodes, depending on the machine. To use this mode,
specify that the number of OpenMP threads is one less than would
normally be used for the run and add the “lrt yes” option to the “-pk”
command-line suffix or “package intel” command. For example, if a run
would normally perform best with “-pk intel 0 omp 4”, instead use
“-pk intel 0 omp 3 lrt yes”. When using LRT, you should set the
environment variable “KMP_AFFINITY=none”. LRT mode is not supported
when using offload.</p>
<p>Not all styles are supported in the USER-INTEL package. You can mix
the USER-INTEL package with styles from the <a class="reference internal" href="accelerate_opt.html"><span class="doc">OPT</span></a>
package or the <a class="reference external" href="accelerate_omp.html"">USER-OMP package</a>. Of course,
this requires that these packages were installed at build time. This
can performed automatically by using “-sf hybrid intel opt” or
“-sf hybrid intel omp” command-line options. Alternatively, the “opt”
and “omp” suffixes can be appended manually in the input script. For
the latter, the <a class="reference internal" href="package.html"><span class="doc">package omp</span></a> command must be in the
input script or the “-pk omp Nt” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> must be used where Nt is the
number of OpenMP threads. The number of OpenMP threads should not be
set differently for the different packages. Note that the <a class="reference internal" href="suffix.html"><span class="doc">suffix hybrid intel omp</span></a> command can also be used within the
input script to automatically append the “omp” suffix to styles when
USER-INTEL styles are not available.</p>
<p>When running on many nodes, performance might be better when using
fewer OpenMP threads and more MPI tasks. This will depend on the
simulation and the machine. Using the <a class="reference internal" href="run_style.html"><span class="doc">verlet/split</span></a>
run style might also give better performance for simulations with
<a class="reference internal" href="kspace_style.html"><span class="doc">PPPM</span></a> electrostatics. Note that this is an
alternative to LRT mode and the two cannot be used together.</p>
<p>Currently, when using Intel MPI with Intel Xeon Phi x200 series
CPUs, better performance might be obtained by setting the
environment variable “I_MPI_SHM_LMT=shm” for Linux kernels that do
not yet have full support for AVX-512. Runs on Intel Xeon Phi x200
series processors will always perform better using MCDRAM. Please
consult your system documentation for the best approach to specify
that MPI runs are performed in MCDRAM.</p>
<p><strong>Tuning for Offload Performance:</strong></p>
<p>The default settings for offload should give good performance.</p>
<p>When using LAMMPS with offload to Intel coprocessors, best performance
will typically be achieved with concurrent calculations performed on
both the CPU and the coprocessor. This is achieved by offloading only
a fraction of the neighbor and pair computations to the coprocessor or
using <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid</span></a> pair styles where only one style uses
the “intel” suffix. For simulations with long-range electrostatics or
bond, angle, dihedral, improper calculations, computation and data
transfer to the coprocessor will run concurrently with computations
and MPI communications for these calculations on the host CPU. This
is illustrated in the figure below for the rhodopsin protein benchmark
running on E5-2697v2 processors with a Intel Xeon Phi 7120p
coprocessor. In this plot, the vertical access is time and routines
running at the same time are running concurrently on both the host and
<p>The fraction of the offloaded work is controlled by the <em>balance</em>
keyword in the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command. A balance of 0
runs all calculations on the CPU. A balance of 1 runs all
supported calculations on the coprocessor. A balance of 0.5 runs half
of the calculations on the coprocessor. Setting the balance to -1
(the default) will enable dynamic load balancing that continously
adjusts the fraction of offloaded work throughout the simulation.
Because data transfer cannot be timed, this option typically produces
results within 5 to 10 percent of the optimal fixed balance.</p>
<p>If running short benchmark runs with dynamic load balancing, adding a
short warm-up run (10-20 steps) will allow the load-balancer to find a
near-optimal setting that will carry over to additional runs.</p>
<p>The default for the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command is to have
all the MPI tasks on a given compute node use a single Xeon Phi
coprocessor. In general, running with a large number of MPI tasks on
each node will perform best with offload. Each MPI task will
automatically get affinity to a subset of the hardware threads
available on the coprocessor. For example, if your card has 61 cores,
with 60 cores available for offload and 4 hardware threads per core
(240 total threads), running with 24 MPI tasks per node will cause
each MPI task to use a subset of 10 threads on the coprocessor. Fine
tuning of the number of threads to use per MPI task or the number of
threads to use per core can be accomplished with keyword settings of
the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command.</p>
<p>The USER-INTEL package has two modes for deciding which atoms will be
handled by the coprocessor. This choice is controlled with the <em>ghost</em>
keyword of the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command. When set to 0,
ghost atoms (atoms at the borders between MPI tasks) are not offloaded
to the card. This allows for overlap of MPI communication of forces
with computation on the coprocessor when the <a class="reference internal" href="newton.html"><span class="doc">newton</span></a>
setting is “on”. The default is dependent on the style being used,
however, better performance may be achieved by setting this option
explictly.</p>
<p>When using offload with CPU Hyper-Threading disabled, it may help
performance to use fewer MPI tasks and OpenMP threads than available
cores. This is due to the fact that additional threads are generated
internally to handle the asynchronous offload tasks.</p>
<p>If pair computations are being offloaded to an Intel Xeon Phi
coprocessor, a diagnostic line is printed to the screen (not to the
log file), during the setup phase of a run, indicating that offload
mode is being used and indicating the number of coprocessor threads
per MPI task. Additionally, an offload timing summary is printed at
the end of each run. When offloading, the frequency for <a class="reference internal" href="atom_modify.html"><span class="doc">atom sorting</span></a> is changed to 1 so that the per-atom data is
effectively sorted at every rebuild of the neighbor lists. All the
available coprocessor threads on each Phi will be divided among MPI
tasks, unless the <em>tptask</em> option of the “-pk intel” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> is used to limit the coprocessor
threads per MPI task.</p>
<div class="section" id="restrictions">
<h2>5.3.2.1. Restrictions</h2>
<p>When offloading to a coprocessor, <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid</span></a> styles
that require skip lists for neighbor builds cannot be offloaded.
Using <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid/overlay</span></a> is allowed. Only one intel
accelerated style may be used with hybrid styles.
<a class="reference internal" href="special_bonds.html"><span class="doc">Special_bonds</span></a> exclusion lists are not currently
supported with offload, however, the same effect can often be
accomplished by setting cutoffs for excluded atom types to 0. None of
the pair styles in the USER-INTEL package currently support the
“inner”, “middle”, “outer” options for rRESPA integration via the
<a class="reference internal" href="run_style.html"><span class="doc">run_style respa</span></a> command; only the “pair” option is
supported.</p>
<p><strong>References:</strong></p>
<ul class="simple">
<li>Brown, W.M., Carrillo, J.-M.Y., Mishra, B., Gavhane, N., Thakker, F.M., De Kraker, A.R., Yamada, M., Ang, J.A., Plimpton, S.J., “Optimizing Classical Molecular Dynamics in LAMMPS,” in Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, J. Jeffers, J. Reinders, A. Sodani, Eds. Morgan Kaufmann.</li>
<li>Brown, W. M., Semin, A., Hebenstreit, M., Khvostov, S., Raman, K., Plimpton, S.J. Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efficiency. 2016 International Conference for High Performance Computing. In press.</li>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.