accelerate_intel.html
No OneTemporary
Actions

Subscribers

None

File Metadata

Created: Sat, Nov 2, 16:50

accelerate_intel.html
View Options



	<!DOCTYPE html>
	<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
	<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
	<head>
	<meta charset="utf-8">

	<meta name="viewport" content="width=device-width, initial-scale=1.0">

	<title>5.USER-INTEL package — LAMMPS documentation</title>















	<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />



	<link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" />



	<link rel="top" title="LAMMPS documentation" href="index.html"/>


	<script src="_static/js/modernizr.min.js"></script>

	</head>

	<body class="wy-body-for-nav" role="document">

	<div class="wy-grid-for-nav">


	<nav data-toggle="wy-nav-shift" class="wy-nav-side">
	<div class="wy-side-nav-search">



	<a href="Manual.html" class="icon icon-home"> LAMMPS



	</a>


	<div role="search">
	<form id="rtd-search-form" class="wy-form" action="search.html" method="get">
	<input type="text" name="q" placeholder="Search docs" />
	<input type="hidden" name="check_keywords" value="yes" />
	<input type="hidden" name="area" value="default" />
	</form>
	</div>


	</div>

	<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">



	<ul>
	<li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li>
	<li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li>
	<li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li>
	<li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li>
	<li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li>
	<li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li>
	<li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li>
	<li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance & scalability</a></li>
	<li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li>
	<li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying & extending LAMMPS</a></li>
	<li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li>
	<li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li>
	<li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li>
	</ul>



	</div>

	</nav>

	<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">


	<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
	<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
	<a href="Manual.html">LAMMPS</a>
	</nav>



	<div class="wy-nav-content">
	<div class="rst-content">
	<div role="navigation" aria-label="breadcrumbs navigation">
	<ul class="wy-breadcrumbs">
	<li><a href="Manual.html">Docs</a> »</li>

	<li>5.USER-INTEL package</li>
	<li class="wy-breadcrumbs-aside">


	<a href="http://lammps.sandia.gov">Website</a>
	<a href="Section_commands.html#comm">Commands</a>

	</li>
	</ul>
	<hr/>

	</div>
	<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
	<div itemprop="articleBody">

	<p><a class="reference internal" href="Section_accelerate.html"><span class="doc">Return to Section accelerate overview</span></a></p>
	<div class="section" id="user-intel-package">
	<h1>5.USER-INTEL package</h1>
	<p>The USER-INTEL package is maintained by Mike Brown at Intel
	Corporation. It provides two methods for accelerating simulations,
	depending on the hardware you have. The first is acceleration on
	Intel CPUs by running in single, mixed, or double precision with
	vectorization. The second is acceleration on Intel Xeon Phi
	coprocessors via offloading neighbor list and non-bonded force
	calculations to the Phi. The same C++ code is used in both cases.
	When offloading to a coprocessor from a CPU, the same routine is run
	twice, once on the CPU and once with an offload flag. This allows
	LAMMPS to run on the CPU cores and coprocessor cores simulataneously.</p>
	<p><strong>Currently Available USER-INTEL Styles:</strong></p>
	<ul class="simple">
	<li>Angle Styles: charmm, harmonic</li>
	<li>Bond Styles: fene, harmonic</li>
	<li>Dihedral Styles: charmm, harmonic, opls</li>
	<li>Fixes: nve, npt, nvt, nvt/sllod</li>
	<li>Improper Styles: cvff, harmonic</li>
	<li>Pair Styles: buck/coul/cut, buck/coul/long, buck, gayberne,
	charmm/coul/long, lj/cut, lj/cut/coul/long, sw, tersoff</li>
	<li>K-Space Styles: pppm</li>
	</ul>
	<p><strong>Speed-ups to expect:</strong></p>
	<p>The speedups will depend on your simulation, the hardware, which
	styles are used, the number of atoms, and the floating-point
	precision mode. Performance improvements are shown compared to
	LAMMPS <em>without using other acceleration packages</em> as these are
	under active development (and subject to performance changes). The
	measurements were performed using the input files available in
	the src/USER-INTEL/TEST directory. These are scalable in size; the
	results given are with 512K particles (524K for Liquid Crystal).
	Most of the simulations are standard LAMMPS benchmarks (indicated
	by the filename extension in parenthesis) with modifications to the
	run length and to add a warmup run (for use with offload
	benchmarks).</p>
	<img alt="_images/user_intel.png" class="align-center" src="_images/user_intel.png" />
	<p>Results are speedups obtained on Intel Xeon E5-2697v4 processors
	(code-named Broadwell) and Intel Xeon Phi 7250 processors
	(code-named Knights Landing) with “18 Jun 2016” LAMMPS built with
	Intel Parallel Studio 2016 update 3. Results are with 1 MPI task
	per physical core. See <em>src/USER-INTEL/TEST/README</em> for the raw
	simulation rates and instructions to reproduce.</p>
	<hr class="docutils" />
	<p><strong>Quick Start for Experienced Users:</strong></p>
	<p>LAMMPS should be built with the USER-INTEL package installed.
	Simulations should be run with 1 MPI task per physical <em>core</em>,
	not <em>hardware thread</em>.</p>
	<p>For Intel Xeon CPUs:</p>
	<ul class="simple">
	<li>Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary.</li>
	<li>If using <em>kspace_style pppm</em> in the input script, add “neigh_modify binsize 3” and “kspace_modify diff ad” to the input script for better
	performance.</li>
	<li>“-pk intel 0 omp 2 -sf intel” added to LAMMPS command-line</li>
	</ul>
	<p>For Intel Xeon Phi CPUs for simulations without <em>kspace_style
	pppm</em> in the input script</p>
	<ul class="simple">
	<li>Edit src/MAKE/OPTIONS/Makefile.knl as necessary.</li>
	<li>Runs should be performed using MCDRAM.</li>
	<li>“-pk intel 0 omp 2 -sf intel” <em>or</em> “-pk intel 0 omp 4 -sf intel”
	should be added to the LAMMPS command-line. Choice for best
	performance will depend on the simulation.</li>
	</ul>
	<p>For Intel Xeon Phi CPUs for simulations with <em>kspace_style
	pppm</em> in the input script:</p>
	<ul class="simple">
	<li>Edit src/MAKE/OPTIONS/Makefile.knl as necessary.</li>
	<li>Runs should be performed using MCDRAM.</li>
	<li>Add “neigh_modify binsize 3” to the input script for better
	performance.</li>
	<li>Add “kspace_modify diff ad” to the input script for better
	performance.</li>
	<li>export KMP_AFFINITY=none</li>
	<li>“-pk intel 0 omp 3 lrt yes -sf intel” or “-pk intel 0 omp 1 lrt yes
	-sf intel” added to LAMMPS command-line. Choice for best performance
	will depend on the simulation.</li>
	</ul>
	<p>For Intel Xeon Phi coprocessors (Offload):</p>
	<ul class="simple">
	<li>Edit src/MAKE/OPTIONS/Makefile.intel_coprocessor as necessary</li>
	<li>“-pk intel N omp 1” added to command-line where N is the number of
	coprocessors per node.</li>
	</ul>
	<hr class="docutils" />
	<p><strong>Required hardware/software:</strong></p>
	<p>In order to use offload to coprocessors, an Intel Xeon Phi
	coprocessor and an Intel compiler are required. For this, the
	recommended version of the Intel compiler is 14.0.1.106 or
	versions 15.0.2.044 and higher.</p>
	<p>Although any compiler can be used with the USER-INTEL pacakge,
	currently, vectorization directives are disabled by default when
	not using Intel compilers due to lack of standard support and
	observations of decreased performance. The OpenMP standard now
	supports directives for vectorization and we plan to transition the
	code to this standard once it is available in most compilers. We
	expect this to allow improved performance and support with other
	compilers.</p>
	<p>For Intel Xeon Phi x200 series processors (code-named Knights
	Landing), there are multiple configuration options for the hardware.
	For best performance, we recommend that the MCDRAM is configured in
	“Flat” mode and with the cluster mode set to “Quadrant” or “SNC4”.
	“Cache” mode can also be used, although the performance might be
	slightly lower.</p>
	<p><strong>Notes about Simultaneous Multithreading:</strong></p>
	<p>Modern CPUs often support Simultaneous Multithreading (SMT). On
	Intel processors, this is called Hyper-Threading (HT) technology.
	SMT is hardware support for running multiple threads efficiently on
	a single core. <em>Hardware threads</em> or <em>logical cores</em> are often used
	to refer to the number of threads that are supported in hardware.
	For example, the Intel Xeon E5-2697v4 processor is described
	as having 36 cores and 72 threads. This means that 36 MPI processes
	or OpenMP threads can run simultaneously on separate cores, but that
	up to 72 MPI processes or OpenMP threads can be running on the CPU
	without costly operating system context switches.</p>
	<p>Molecular dynamics simulations will often run faster when making use
	of SMT. If a thread becomes stalled, for example because it is
	waiting on data that has not yet arrived from memory, another thread
	can start running so that the CPU pipeline is still being used
	efficiently. Although benefits can be seen by launching a MPI task
	for every hardware thread, for multinode simulations, we recommend
	that OpenMP threads are used for SMT instead, either with the
	USER-INTEL package, <a class="reference external" href="accelerate_omp.html"">USER-OMP package</a>, or
	<a class="reference internal" href="accelerate_kokkos.html"><span class="doc">KOKKOS package</span></a>. In the example above, up
	to 36X speedups can be observed by using all 36 physical cores with
	LAMMPS. By using all 72 hardware threads, an additional 10-30%
	performance gain can be achieved.</p>
	<p>The BIOS on many platforms allows SMT to be disabled, however, we do
	not recommend this on modern processors as there is little to no
	benefit for any software package in most cases. The operating system
	will report every hardware thread as a separate core allowing one to
	determine the number of hardware threads available. On Linux systems,
	this information can normally be obtained with:</p>
	<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">cat</span> <span class="o">/</span><span class="n">proc</span><span class="o">/</span><span class="n">cpuinfo</span>
	</pre></div>
	</div>
	<p><strong>Building LAMMPS with the USER-INTEL package:</strong></p>
	<p>The USER-INTEL package must be installed into the source directory:</p>
	<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">make</span> <span class="n">yes</span><span class="o">-</span><span class="n">user</span><span class="o">-</span><span class="n">intel</span>
	</pre></div>
	</div>
	<p>Several example Makefiles for building with the Intel compiler are
	included with LAMMPS in the src/MAKE/OPTIONS/ directory:</p>
	<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">Makefile</span><span class="o">.</span><span class="n">intel_cpu_intelmpi</span> <span class="c1"># Intel Compiler, Intel MPI, No Offload</span>
	<span class="n">Makefile</span><span class="o">.</span><span class="n">knl</span> <span class="c1"># Intel Compiler, Intel MPI, No Offload</span>
	<span class="n">Makefile</span><span class="o">.</span><span class="n">intel_cpu_mpich</span> <span class="c1"># Intel Compiler, MPICH, No Offload</span>
	<span class="n">Makefile</span><span class="o">.</span><span class="n">intel_cpu_openpmi</span> <span class="c1"># Intel Compiler, OpenMPI, No Offload</span>
	<span class="n">Makefile</span><span class="o">.</span><span class="n">intel_coprocessor</span> <span class="c1"># Intel Compiler, Intel MPI, Offload</span>
	</pre></div>
	</div>
	<p>Makefile.knl is identical to Makefile.intel_cpu_intelmpi except that
	it explicitly specifies that vectorization should be for Intel
	Xeon Phi x200 processors making it easier to cross-compile. For
	users with recent installations of Intel Parallel Studio, the
	process can be as simple as:</p>
	<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">make</span> <span class="n">yes</span><span class="o">-</span><span class="n">user</span><span class="o">-</span><span class="n">intel</span>
	<span class="n">source</span> <span class="o">/</span><span class="n">opt</span><span class="o">/</span><span class="n">intel</span><span class="o">/</span><span class="n">parallel_studio_xe_2016</span><span class="o">.</span><span class="mf">3.067</span><span class="o">/</span><span class="n">psxevars</span><span class="o">.</span><span class="n">sh</span>
	<span class="c1"># or psxevars.csh for C-shell</span>
	<span class="n">make</span> <span class="n">intel_cpu_intelmpi</span>
	</pre></div>
	</div>
	<p>Alternatively, the build can be accomplished with the src/Make.py
	script, described in <a class="reference internal" href="Section_start.html#start-4"><span class="std std-ref">Section 2.4</span></a> of the
	manual. Type “Make.py -h” for help. For an example:</p>
	<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">Make</span><span class="o">.</span><span class="n">py</span> <span class="o">-</span><span class="n">v</span> <span class="o">-</span><span class="n">p</span> <span class="n">intel</span> <span class="n">omp</span> <span class="o">-</span><span class="n">intel</span> <span class="n">cpu</span> <span class="o">-</span><span class="n">a</span> <span class="n">file</span> <span class="n">intel_cpu_intelmpi</span>
	</pre></div>
	</div>
	<p>Note that if you build with support for a Phi coprocessor, the same
	binary can be used on nodes with or without coprocessors installed.
	However, if you do not have coprocessors on your system, building
	without offload support will produce a smaller binary.</p>
	<p>The general requirements for Makefiles with the USER-INTEL package
	are as follows. “-DLAMMPS_MEMALIGN=64” is required for CCFLAGS. When
	using Intel compilers, “-restrict” is required and “-qopenmp” is
	highly recommended for CCFLAGS and LINKFLAGS. LIB should include
	“-ltbbmalloc”. For builds supporting offload, “-DLMP_INTEL_OFFLOAD”
	is required for CCFLAGS and “-qoffload” is required for LINKFLAGS.
	Other recommended CCFLAG options for best performance are
	“-O2 -fno-alias -ansi-alias -qoverride-limits fp-model fast=2
	-no-prec-div”. The Make.py command will add all of these
	automatically.</p>
	<div class="admonition note">
	<p class="first admonition-title">Note</p>
	<p class="last">The vectorization and math capabilities can differ depending on
	the CPU. For Intel compilers, the “-x” flag specifies the type of
	processor for which to optimize. “-xHost” specifies that the compiler
	should build for the processor used for compiling. For Intel Xeon Phi
	x200 series processors, this option is “-xMIC-AVX512”. For fourth
	generation Intel Xeon (v4/Broadwell) processors, “-xCORE-AVX2” should
	be used. For older Intel Xeon processors, “-xAVX” will perform best
	in general for the different simulations in LAMMPS. The default
	in most of the example Makefiles is to use “-xHost”, however this
	should not be used when cross-compiling.</p>
	</div>
	<p><strong>Running LAMMPS with the USER-INTEL package:</strong></p>
	<p>Running LAMMPS with the USER-INTEL package is similar to normal use
	with the exceptions that one should 1) specify that LAMMPS should use
	the USER-INTEL package, 2) specify the number of OpenMP threads, and
	3) optionally specify the specific LAMMPS styles that should use the
	USER-INTEL package. 1) and 2) can be performed from the command-line
	or by editing the input script. 3) requires editing the input script.
	Advanced performance tuning options are also described below to get
	the best performance.</p>
	<p>When running on a single node (including runs using offload to a
	coprocessor), best performance is normally obtained by using 1 MPI
	task per physical core and additional OpenMP threads with SMT. For
	Intel Xeon processors, 2 OpenMP threads should be used for SMT.
	For Intel Xeon Phi CPUs, 2 or 4 OpenMP threads should be used
	(best choice depends on the simulation). In cases where the user
	specifies that LRT mode is used (described below), 1 or 3 OpenMP
	threads should be used. For multi-node runs, using 1 MPI task per
	physical core will often perform best, however, depending on the
	machine and scale, users might get better performance by decreasing
	the number of MPI tasks and using more OpenMP threads. For
	performance, the product of the number of MPI tasks and OpenMP
	threads should not exceed the number of available hardware threads in
	almost all cases.</p>
	<div class="admonition note">
	<p class="first admonition-title">Note</p>
	<p class="last">Setting core affinity is often used to pin MPI tasks and OpenMP
	threads to a core or group of cores so that memory access can be
	uniform. Unless disabled at build time, affinity for MPI tasks and
	OpenMP threads on the host (CPU) will be set by default on the host
	<em>when using offload to a coprocessor</em>. In this case, it is unnecessary
	to use other methods to control affinity (e.g. taskset, numactl,
	I_MPI_PIN_DOMAIN, etc.). This can be disabled with the <em>no_affinity</em>
	option to the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command or by disabling the
	option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the
	CCFLAGS line of your Makefile). Disabling this option is not
	recommended, especially when running on a machine with Intel
	Hyper-Threading technology disabled.</p>
	</div>
	<p><strong>Run with the USER-INTEL package from the command line:</strong></p>
	<p>To enable USER-INTEL optimizations for all available styles used in
	the input script, the “-sf intel”
	<a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> can be used without
	any requirement for editing the input script. This switch will
	automatically append “intel” to styles that support it. It also
	invokes a default command: <a class="reference internal" href="package.html"><span class="doc">package intel 1</span></a>. This
	package command is used to set options for the USER-INTEL package.
	The default package command will specify that USER-INTEL calculations
	are performed in mixed precision, that the number of OpenMP threads
	is specified by the OMP_NUM_THREADS environment variable, and that
	if coprocessors are present and the binary was built with offload
	support, that 1 coprocessor per node will be used with automatic
	balancing of work between the CPU and the coprocessor.</p>
	<p>You can specify different options for the USER-INTEL package by using
	the “-pk intel Nphi” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a>
	with keyword/value pairs as specified in the documentation. Here,
	Nphi = # of Xeon Phi coprocessors/node (ignored without offload
	support). Common options to the USER-INTEL package include <em>omp</em> to
	override any OMP_NUM_THREADS setting and specify the number of OpenMP
	threads, <em>mode</em> to set the floating-point precision mode, and
	<em>lrt</em> to enable Long-Range Thread mode as described below. See the
	<a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command for details, including the
	default values used for all its options if not specified, and how to
	set the number of OpenMP threads via the OMP_NUM_THREADS environment
	variable if desired.</p>
	<p>Examples (see documentation for your MPI/Machine for differences in
	launching MPI applications):</p>
	<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">mpirun</span> <span class="o">-</span><span class="n">np</span> <span class="mi">72</span> <span class="o">-</span><span class="n">ppn</span> <span class="mi">36</span> <span class="n">lmp_machine</span> <span class="o">-</span><span class="n">sf</span> <span class="n">intel</span> <span class="o">-</span><span class="ow">in</span> <span class="ow">in</span><span class="o">.</span><span class="n">script</span> <span class="c1"># 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads</span>
	<span class="n">mpirun</span> <span class="o">-</span><span class="n">np</span> <span class="mi">72</span> <span class="o">-</span><span class="n">ppn</span> <span class="mi">36</span> <span class="n">lmp_machine</span> <span class="o">-</span><span class="n">sf</span> <span class="n">intel</span> <span class="o">-</span><span class="ow">in</span> <span class="ow">in</span><span class="o">.</span><span class="n">script</span> <span class="o">-</span><span class="n">pk</span> <span class="n">intel</span> <span class="mi">0</span> <span class="n">omp</span> <span class="mi">2</span> <span class="n">mode</span> <span class="n">double</span> <span class="c1"># Don't use any coprocessors that might be available, use 2 OpenMP threads for each task, use double precision</span>
	</pre></div>
	</div>
	<p><strong>Or run with the USER-INTEL package by editing an input script:</strong></p>
	<p>As an alternative to adding command-line arguments, the input script
	can be edited to enable the USER-INTEL package. This requires adding
	the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command to the top of the input
	script. For the second example above, this would be:</p>
	<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">package</span> <span class="n">intel</span> <span class="mi">0</span> <span class="n">omp</span> <span class="mi">2</span> <span class="n">mode</span> <span class="n">double</span>
	</pre></div>
	</div>
	<p>To enable the USER-INTEL package only for individual styles, you can
	add an “intel” suffix to the individual style, e.g.:</p>
	<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">pair_style</span> <span class="n">lj</span><span class="o">/</span><span class="n">cut</span><span class="o">/</span><span class="n">intel</span> <span class="mf">2.5</span>
	</pre></div>
	</div>
	<p>Alternatively, the <a class="reference internal" href="suffix.html"><span class="doc">suffix intel</span></a> command can be added to
	the input script to enable USER-INTEL styles for the commands that
	follow in the input script.</p>
	<p><strong>Tuning for Performance:</strong></p>
	<div class="admonition note">
	<p class="first admonition-title">Note</p>
	<p class="last">The USER-INTEL package will perform better with modifications
	to the input script when <a class="reference internal" href="kspace_style.html"><span class="doc">PPPM</span></a> is used:
	<a class="reference internal" href="kspace_modify.html"><span class="doc">kspace_modify diff ad</span></a> and <a class="reference internal" href="neigh_modify.html"><span class="doc">neigh_modify binsize 3</span></a> should be added to the input script.</p>
	</div>
	<p>Long-Range Thread (LRT) mode is an option to the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command that can improve performance when using
	<a class="reference internal" href="kspace_style.html"><span class="doc">PPPM</span></a> for long-range electrostatics on processors
	with SMT. It generates an extra pthread for each MPI task. The thread
	is dedicated to performing some of the PPPM calculations and MPI
	communications. On Intel Xeon Phi x200 series CPUs, this will likely
	always improve performance, even on a single node. On Intel Xeon
	processors, using this mode might result in better performance when
	using multiple nodes, depending on the machine. To use this mode,
	specify that the number of OpenMP threads is one less than would
	normally be used for the run and add the “lrt yes” option to the “-pk”
	command-line suffix or “package intel” command. For example, if a run
	would normally perform best with “-pk intel 0 omp 4”, instead use
	“-pk intel 0 omp 3 lrt yes”. When using LRT, you should set the
	environment variable “KMP_AFFINITY=none”. LRT mode is not supported
	when using offload.</p>
	<p>Not all styles are supported in the USER-INTEL package. You can mix
	the USER-INTEL package with styles from the <a class="reference internal" href="accelerate_opt.html"><span class="doc">OPT</span></a>
	package or the <a class="reference external" href="accelerate_omp.html"">USER-OMP package</a>. Of course,
	this requires that these packages were installed at build time. This
	can performed automatically by using “-sf hybrid intel opt” or
	“-sf hybrid intel omp” command-line options. Alternatively, the “opt”
	and “omp” suffixes can be appended manually in the input script. For
	the latter, the <a class="reference internal" href="package.html"><span class="doc">package omp</span></a> command must be in the
	input script or the “-pk omp Nt” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> must be used where Nt is the
	number of OpenMP threads. The number of OpenMP threads should not be
	set differently for the different packages. Note that the <a class="reference internal" href="suffix.html"><span class="doc">suffix hybrid intel omp</span></a> command can also be used within the
	input script to automatically append the “omp” suffix to styles when
	USER-INTEL styles are not available.</p>
	<p>When running on many nodes, performance might be better when using
	fewer OpenMP threads and more MPI tasks. This will depend on the
	simulation and the machine. Using the <a class="reference internal" href="run_style.html"><span class="doc">verlet/split</span></a>
	run style might also give better performance for simulations with
	<a class="reference internal" href="kspace_style.html"><span class="doc">PPPM</span></a> electrostatics. Note that this is an
	alternative to LRT mode and the two cannot be used together.</p>
	<p>Currently, when using Intel MPI with Intel Xeon Phi x200 series
	CPUs, better performance might be obtained by setting the
	environment variable “I_MPI_SHM_LMT=shm” for Linux kernels that do
	not yet have full support for AVX-512. Runs on Intel Xeon Phi x200
	series processors will always perform better using MCDRAM. Please
	consult your system documentation for the best approach to specify
	that MPI runs are performed in MCDRAM.</p>
	<p><strong>Tuning for Offload Performance:</strong></p>
	<p>The default settings for offload should give good performance.</p>
	<p>When using LAMMPS with offload to Intel coprocessors, best performance
	will typically be achieved with concurrent calculations performed on
	both the CPU and the coprocessor. This is achieved by offloading only
	a fraction of the neighbor and pair computations to the coprocessor or
	using <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid</span></a> pair styles where only one style uses
	the “intel” suffix. For simulations with long-range electrostatics or
	bond, angle, dihedral, improper calculations, computation and data
	transfer to the coprocessor will run concurrently with computations
	and MPI communications for these calculations on the host CPU. This
	is illustrated in the figure below for the rhodopsin protein benchmark
	running on E5-2697v2 processors with a Intel Xeon Phi 7120p
	coprocessor. In this plot, the vertical access is time and routines
	running at the same time are running concurrently on both the host and
	the coprocessor.</p>
	<img alt="_images/offload_knc.png" class="align-center" src="_images/offload_knc.png" />
	<p>The fraction of the offloaded work is controlled by the <em>balance</em>
	keyword in the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command. A balance of 0
	runs all calculations on the CPU. A balance of 1 runs all
	supported calculations on the coprocessor. A balance of 0.5 runs half
	of the calculations on the coprocessor. Setting the balance to -1
	(the default) will enable dynamic load balancing that continously
	adjusts the fraction of offloaded work throughout the simulation.
	Because data transfer cannot be timed, this option typically produces
	results within 5 to 10 percent of the optimal fixed balance.</p>
	<p>If running short benchmark runs with dynamic load balancing, adding a
	short warm-up run (10-20 steps) will allow the load-balancer to find a
	near-optimal setting that will carry over to additional runs.</p>
	<p>The default for the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command is to have
	all the MPI tasks on a given compute node use a single Xeon Phi
	coprocessor. In general, running with a large number of MPI tasks on
	each node will perform best with offload. Each MPI task will
	automatically get affinity to a subset of the hardware threads
	available on the coprocessor. For example, if your card has 61 cores,
	with 60 cores available for offload and 4 hardware threads per core
	(240 total threads), running with 24 MPI tasks per node will cause
	each MPI task to use a subset of 10 threads on the coprocessor. Fine
	tuning of the number of threads to use per MPI task or the number of
	threads to use per core can be accomplished with keyword settings of
	the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command.</p>
	<p>The USER-INTEL package has two modes for deciding which atoms will be
	handled by the coprocessor. This choice is controlled with the <em>ghost</em>
	keyword of the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command. When set to 0,
	ghost atoms (atoms at the borders between MPI tasks) are not offloaded
	to the card. This allows for overlap of MPI communication of forces
	with computation on the coprocessor when the <a class="reference internal" href="newton.html"><span class="doc">newton</span></a>
	setting is “on”. The default is dependent on the style being used,
	however, better performance may be achieved by setting this option
	explictly.</p>
	<p>When using offload with CPU Hyper-Threading disabled, it may help
	performance to use fewer MPI tasks and OpenMP threads than available
	cores. This is due to the fact that additional threads are generated
	internally to handle the asynchronous offload tasks.</p>
	<p>If pair computations are being offloaded to an Intel Xeon Phi
	coprocessor, a diagnostic line is printed to the screen (not to the
	log file), during the setup phase of a run, indicating that offload
	mode is being used and indicating the number of coprocessor threads
	per MPI task. Additionally, an offload timing summary is printed at
	the end of each run. When offloading, the frequency for <a class="reference internal" href="atom_modify.html"><span class="doc">atom sorting</span></a> is changed to 1 so that the per-atom data is
	effectively sorted at every rebuild of the neighbor lists. All the
	available coprocessor threads on each Phi will be divided among MPI
	tasks, unless the <em>tptask</em> option of the “-pk intel” <a class="reference internal" href="Section_start.html#start-7"><span class="std std-ref">command-line switch</span></a> is used to limit the coprocessor
	threads per MPI task.</p>
	<div class="section" id="restrictions">
	<h2>Restrictions</h2>
	<p>When offloading to a coprocessor, <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid</span></a> styles
	that require skip lists for neighbor builds cannot be offloaded.
	Using <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid/overlay</span></a> is allowed. Only one intel
	accelerated style may be used with hybrid styles.
	<a class="reference internal" href="special_bonds.html"><span class="doc">Special_bonds</span></a> exclusion lists are not currently
	supported with offload, however, the same effect can often be
	accomplished by setting cutoffs for excluded atom types to 0. None of
	the pair styles in the USER-INTEL package currently support the
	“inner”, “middle”, “outer” options for rRESPA integration via the
	<a class="reference internal" href="run_style.html"><span class="doc">run_style respa</span></a> command; only the “pair” option is
	supported.</p>
	<p><strong>References:</strong></p>
	<ul class="simple">
	<li>Brown, W.M., Carrillo, J.-M.Y., Mishra, B., Gavhane, N., Thakker, F.M., De Kraker, A.R., Yamada, M., Ang, J.A., Plimpton, S.J., “Optimizing Classical Molecular Dynamics in LAMMPS,” in Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, J. Jeffers, J. Reinders, A. Sodani, Eds. Morgan Kaufmann.</li>
	<li>Brown, W. M., Semin, A., Hebenstreit, M., Khvostov, S., Raman, K., Plimpton, S.J. Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efficiency. 2016 International Conference for High Performance Computing. In press.</li>
	<li>Brown, W.M., Carrillo, J.-M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J. Optimizing Legacy Molecular Dynamics Software with Directive-Based Offload. Computer Physics Communications. 2015. 195: p. 95-101.</li>
	</ul>
	</div>
	</div>


	</div>
	</div>
	<footer>


	<hr/>

	<div role="contentinfo">
	<p>
	© Copyright 2013 Sandia Corporation.
	</p>
	</div>
	Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.

	</footer>

	</div>
	</div>

	</section>

	</div>





	<script type="text/javascript">
	var DOCUMENTATION_OPTIONS = {
	URL_ROOT:'./',
	VERSION:'',
	COLLAPSE_INDEX:false,
	FILE_SUFFIX:'.html',
	HAS_SOURCE: true
	};
	</script>
	<script type="text/javascript" src="_static/jquery.js"></script>
	<script type="text/javascript" src="_static/underscore.js"></script>
	<script type="text/javascript" src="_static/doctools.js"></script>
	<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
	<script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script>
	<script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script>
	<script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script>





	<script type="text/javascript" src="_static/js/theme.js"></script>




	<script type="text/javascript">
	jQuery(function () {
	SphinxRtdTheme.StickyNav.enable();
	});
	</script>


	</body>
	</html>

accelerate_intel.htmlNo OneTemporaryActions

File Metadata

accelerate_intel.htmlView Options

Event Timeline

accelerate_intel.html
No OneTemporary
Actions

accelerate_intel.html
View Options