diff --git a/doc/accelerate_cuda.html b/doc/accelerate_cuda.html
index f1d45f093..65183cff2 100644
--- a/doc/accelerate_cuda.html
+++ b/doc/accelerate_cuda.html
@@ -1,372 +1,372 @@
 
 
 <!DOCTYPE html>
 <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
 <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
 <head>
   <meta charset="utf-8">
   
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
   
   <title>5.USER-CUDA package &mdash; LAMMPS 15 May 2015 version documentation</title>
   
 
   
   
 
   
 
   
   
     
 
   
 
   
   
     <link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
   
 
   
     <link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" />
   
 
   
     <link rel="top" title="LAMMPS 15 May 2015 version documentation" href="index.html"/> 
 
   
   <script src="_static/js/modernizr.min.js"></script>
 
 </head>
 
 <body class="wy-body-for-nav" role="document">
 
   <div class="wy-grid-for-nav">
 
     
     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
       <div class="wy-side-nav-search">
         
 
         
           <a href="Manual.html" class="icon icon-home"> LAMMPS
         
 
         
         </a>
 
         
 <div role="search">
   <form id="rtd-search-form" class="wy-form" action="search.html" method="get">
     <input type="text" name="q" placeholder="Search docs" />
     <input type="hidden" name="check_keywords" value="yes" />
     <input type="hidden" name="area" value="default" />
   </form>
 </div>
 
         
       </div>
 
       <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
         
           
           
               <ul>
 <li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance &amp; scalability</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying &amp; extending LAMMPS</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li>
 </ul>
 
           
         
       </div>
       &nbsp;
     </nav>
 
     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
 
       
       <nav class="wy-nav-top" role="navigation" aria-label="top navigation">
         <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
         <a href="Manual.html">LAMMPS</a>
       </nav>
 
 
       
       <div class="wy-nav-content">
         <div class="rst-content">
           <div role="navigation" aria-label="breadcrumbs navigation">
   <ul class="wy-breadcrumbs">
     <li><a href="Manual.html">Docs</a> &raquo;</li>
       
     <li>5.USER-CUDA package</li>
       <li class="wy-breadcrumbs-aside">
         
           
             <a href="http://lammps.sandia.gov">Website</a>
             <a href="Section_commands.html#comm">Commands</a>
         
       </li>
   </ul>
   <hr/>
   
 </div>
           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
            <div itemprop="articleBody">
             
   <p><a class="reference internal" href="Section_accelerate.html"><em>Return to Section accelerate overview</em></a></p>
 <div class="section" id="user-cuda-package">
 <h1>5.USER-CUDA package<a class="headerlink" href="#user-cuda-package" title="Permalink to this headline">¶</a></h1>
 <p>The USER-CUDA package was developed by Christian Trott (Sandia) while
 at U Technology Ilmenau in Germany.  It provides NVIDIA GPU versions
 of many pair styles, many fixes, a few computes, and for long-range
 Coulombics via the PPPM command.  It has the following general
 features:</p>
 <ul class="simple">
 <li>The package is designed to allow an entire LAMMPS calculation, for
 many timesteps, to run entirely on the GPU (except for inter-processor
 MPI communication), so that atom-based data (e.g. coordinates, forces)
 do not have to move back-and-forth between the CPU and GPU.</li>
 <li>The speed-up advantage of this approach is typically better when the
 number of atoms per GPU is large</li>
 <li>Data will stay on the GPU until a timestep where a non-USER-CUDA fix
 or compute is invoked.  Whenever a non-GPU operation occurs (fix,
 compute, output), data automatically moves back to the CPU as needed.
 This may incur a performance penalty, but should otherwise work
 transparently.</li>
 <li>Neighbor lists are constructed on the GPU.</li>
 <li>The package only supports use of a single MPI task, running on a
 single CPU (core), assigned to each GPU.</li>
 </ul>
 <p>Here is a quick overview of how to use the USER-CUDA package:</p>
 <ul class="simple">
 <li>build the library in lib/cuda for your GPU hardware with desired precision</li>
 <li>include the USER-CUDA package and build LAMMPS</li>
 <li>use the mpirun command to specify 1 MPI task per GPU (on each node)</li>
 <li>enable the USER-CUDA package via the &#8220;-c on&#8221; command-line switch</li>
 <li>specify the # of GPUs per node</li>
 <li>use USER-CUDA styles in your input script</li>
 </ul>
 <p>The latter two steps can be done using the &#8220;-pk cuda&#8221; and &#8220;-sf cuda&#8221;
 <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> respectively.  Or
 the effect of the &#8220;-pk&#8221; or &#8220;-sf&#8221; switches can be duplicated by adding
 the <a class="reference internal" href="package.html"><em>package cuda</em></a> or <a class="reference internal" href="suffix.html"><em>suffix cuda</em></a> commands
 respectively to your input script.</p>
 <p><strong>Required hardware/software:</strong></p>
 <p>To use this package, you need to have one or more NVIDIA GPUs and
 install the NVIDIA Cuda software on your system:</p>
 <p>Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
 help you to find out the Compute Capability of your card:</p>
 <p><a class="reference external" href="http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units">http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units</a></p>
 <p>Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the
 corresponding GPU drivers.  The Nvidia Cuda SDK is not required, but
 we recommend it also be installed.  You can then make sure its sample
 projects can be compiled without problems.</p>
 <p><strong>Building LAMMPS with the USER-CUDA package:</strong></p>
 <p>This requires two steps (a,b): build the USER-CUDA library, then build
 LAMMPS with the USER-CUDA package.</p>
 <p>You can do both these steps in one line, using the src/Make.py script,
 described in <a class="reference internal" href="Section_start.html#start-4"><span>Section 2.4</span></a> of the manual.
 Type &#8220;Make.py -h&#8221; for help.  If run from the src directory, this
 command will create src/lmp_cuda using src/MAKE/Makefile.mpi as the
 starting Makefile.machine:</p>
-<div class="highlight-python"><div class="highlight"><pre>Make.py -p cuda -cuda mode=single arch=20 -o cuda lib-cuda file mpi
+<div class="highlight-python"><div class="highlight"><pre>Make.py -p cuda -cuda mode=single arch=20 -o cuda -a lib-cuda file mpi
 </pre></div>
 </div>
 <p>Or you can follow these two (a,b) steps:</p>
 <ol class="loweralpha simple">
 <li>Build the USER-CUDA library</li>
 </ol>
 <p>The USER-CUDA library is in lammps/lib/cuda.  If your <em>CUDA</em> toolkit
 is not installed in the default system directoy <em>/usr/local/cuda</em> edit
 the file <em>lib/cuda/Makefile.common</em> accordingly.</p>
 <p>To build the library with the settings in lib/cuda/Makefile.default,
 simply type:</p>
 <div class="highlight-python"><div class="highlight"><pre><span class="n">make</span>
 </pre></div>
 </div>
 <p>To set options when the library is built, type &#8220;make OPTIONS&#8221;, where
 <em>OPTIONS</em> are one or more of the following. The settings will be
 written to the <em>lib/cuda/Makefile.defaults</em> before the build.</p>
 <pre class="literal-block">
 <em>precision=N</em> to set the precision level
   N = 1 for single precision (default)
   N = 2 for double precision
   N = 3 for positions in double precision
   N = 4 for positions and velocities in double precision
 <em>arch=M</em> to set GPU compute capability
   M = 35 for Kepler GPUs
   M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
   M = 21 for CC2.1 (GF104/114,  e.g. GTX560, GTX460, GTX450)
   M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
 <em>prec_timer=0/1</em> to use hi-precision timers
   0 = do not use them (default)
   1 = use them
   this is usually only useful for Mac machines
 <em>dbg=0/1</em> to activate debug mode
   0 = no debug mode (default)
   1 = yes debug mode
   this is only useful for developers
 <em>cufft=1</em> for use of the CUDA FFT library
   0 = no CUFFT support (default)
   in the future other CUDA-enabled FFT libraries might be supported
 </pre>
 <p>If the build is successful, it will produce the files liblammpscuda.a and
 Makefile.lammps.</p>
 <p>Note that if you change any of the options (like precision), you need
 to re-build the entire library.  Do a &#8220;make clean&#8221; first, followed by
 &#8220;make&#8221;.</p>
 <ol class="loweralpha simple" start="2">
 <li>Build LAMMPS with the USER-CUDA package</li>
 </ol>
 <div class="highlight-python"><div class="highlight"><pre>cd lammps/src
 make yes-user-cuda
 make machine
 </pre></div>
 </div>
 <p>No additional compile/link flags are needed in Makefile.machine.</p>
 <p>Note that if you change the USER-CUDA library precision (discussed
 above) and rebuild the USER-CUDA library, then you also need to
 re-install the USER-CUDA package and re-build LAMMPS, so that all
 affected files are re-compiled and linked to the new USER-CUDA
 library.</p>
 <p><strong>Run with the USER-CUDA package from the command line:</strong></p>
 <p>The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.</p>
 <p>When using the USER-CUDA package, you must use exactly one MPI task
 per physical GPU.</p>
 <p>You must use the &#8220;-c on&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to enable the USER-CUDA package.
 The &#8220;-c on&#8221; switch also issues a default <a class="reference internal" href="package.html"><em>package cuda 1</em></a>
 command which sets various USER-CUDA options to default values, as
 discussed on the <a class="reference internal" href="package.html"><em>package</em></a> command doc page.</p>
 <p>Use the &#8220;-sf cuda&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>,
 which will automatically append &#8220;cuda&#8221; to styles that support it.  Use
 the &#8220;-pk cuda Ng&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to
 set Ng = # of GPUs per node to a different value than the default set
 by the &#8220;-c on&#8221; switch (1 GPU) or change other <a class="reference internal" href="package.html"><em>package cuda</em></a> options.</p>
 <div class="highlight-python"><div class="highlight"><pre>lmp_machine -c on -sf cuda -pk cuda 1 -in in.script                       # 1 MPI task uses 1 GPU
 mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script          # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node
 mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script  # ditto on 12 16-core nodes
 </pre></div>
 </div>
 <p>The syntax for the &#8220;-pk&#8221; switch is the same as same as the &#8220;package
 cuda&#8221; command.  See the <a class="reference internal" href="package.html"><em>package</em></a> command doc page for
 details, including the default values used for all its options if it
 is not specified.</p>
 <p>Note that the default for the <a class="reference internal" href="package.html"><em>package cuda</em></a> command is
 to set the Newton flag to &#8220;off&#8221; for both pairwise and bonded
 interactions.  This typically gives fastest performance.  If the
 <a class="reference internal" href="newton.html"><em>newton</em></a> command is used in the input script, it can
 override these defaults.</p>
 <p><strong>Or run with the USER-CUDA package by editing an input script:</strong></p>
 <p>The discussion above for the mpirun/mpiexec command and the requirement
 of one MPI task per GPU is the same.</p>
 <p>You must still use the &#8220;-c on&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to enable the USER-CUDA package.</p>
 <p>Use the <a class="reference internal" href="suffix.html"><em>suffix cuda</em></a> command, or you can explicitly add a
 &#8220;cuda&#8221; suffix to individual styles in your input script, e.g.</p>
 <div class="highlight-python"><div class="highlight"><pre>pair_style lj/cut/cuda 2.5
 </pre></div>
 </div>
 <p>You only need to use the <a class="reference internal" href="package.html"><em>package cuda</em></a> command if you
 wish to change any of its option defaults, including the number of
 GPUs/node (default = 1), as set by the &#8220;-c on&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>.</p>
 <p><strong>Speed-ups to expect:</strong></p>
 <p>The performance of a GPU versus a multi-core CPU is a function of your
 hardware, which pair style is used, the number of atoms/GPU, and the
 precision used on the GPU (double, single, mixed).</p>
 <p>See the <a class="reference external" href="http://lammps.sandia.gov/bench.html">Benchmark page</a> of the
 LAMMPS web site for performance of the USER-CUDA package on different
 hardware.</p>
 <p><strong>Guidelines for best performance:</strong></p>
 <ul class="simple">
 <li>The USER-CUDA package offers more speed-up relative to CPU performance
 when the number of atoms per GPU is large, e.g. on the order of tens
 or hundreds of 1000s.</li>
 <li>As noted above, this package will continue to run a simulation
 entirely on the GPU(s) (except for inter-processor MPI communication),
 for multiple timesteps, until a CPU calculation is required, either by
 a fix or compute that is non-GPU-ized, or until output is performed
 (thermo or dump snapshot or restart file).  The less often this
 occurs, the faster your simulation will run.</li>
 </ul>
 <div class="section" id="restrictions">
 <h2>Restrictions<a class="headerlink" href="#restrictions" title="Permalink to this headline">¶</a></h2>
 <p>None.</p>
 </div>
 </div>
 
 
            </div>
           </div>
           <footer>
   
 
   <hr/>
 
   <div role="contentinfo">
     <p>
         &copy; Copyright .
     </p>
   </div>
   Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
 
 </footer>
 
         </div>
       </div>
 
     </section>
 
   </div>
   
 
 
   
 
     <script type="text/javascript">
         var DOCUMENTATION_OPTIONS = {
             URL_ROOT:'./',
             VERSION:'15 May 2015 version',
             COLLAPSE_INDEX:false,
             FILE_SUFFIX:'.html',
             HAS_SOURCE:  true
         };
     </script>
       <script type="text/javascript" src="_static/jquery.js"></script>
       <script type="text/javascript" src="_static/underscore.js"></script>
       <script type="text/javascript" src="_static/doctools.js"></script>
       <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script>
 
   
 
   
   
     <script type="text/javascript" src="_static/js/theme.js"></script>
   
 
   
   
   <script type="text/javascript">
       jQuery(function () {
           SphinxRtdTheme.StickyNav.enable();
       });
   </script>
    
 
 </body>
 </html>
\ No newline at end of file
diff --git a/doc/accelerate_cuda.txt b/doc/accelerate_cuda.txt
index 43b4d660d..5a6ca4925 100644
--- a/doc/accelerate_cuda.txt
+++ b/doc/accelerate_cuda.txt
@@ -1,223 +1,223 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.1 USER-CUDA package :h4
 
 The USER-CUDA package was developed by Christian Trott (Sandia) while
 at U Technology Ilmenau in Germany.  It provides NVIDIA GPU versions
 of many pair styles, many fixes, a few computes, and for long-range
 Coulombics via the PPPM command.  It has the following general
 features:
 
 The package is designed to allow an entire LAMMPS calculation, for
 many timesteps, to run entirely on the GPU (except for inter-processor
 MPI communication), so that atom-based data (e.g. coordinates, forces)
 do not have to move back-and-forth between the CPU and GPU. :ulb,l
 
 The speed-up advantage of this approach is typically better when the
 number of atoms per GPU is large :l
 
 Data will stay on the GPU until a timestep where a non-USER-CUDA fix
 or compute is invoked.  Whenever a non-GPU operation occurs (fix,
 compute, output), data automatically moves back to the CPU as needed.
 This may incur a performance penalty, but should otherwise work
 transparently. :l
 
 Neighbor lists are constructed on the GPU. :l
 
 The package only supports use of a single MPI task, running on a
 single CPU (core), assigned to each GPU. :l,ule
 
 Here is a quick overview of how to use the USER-CUDA package:
 
 build the library in lib/cuda for your GPU hardware with desired precision
 include the USER-CUDA package and build LAMMPS
 use the mpirun command to specify 1 MPI task per GPU (on each node)
 enable the USER-CUDA package via the "-c on" command-line switch
 specify the # of GPUs per node
 use USER-CUDA styles in your input script :ul
 
 The latter two steps can be done using the "-pk cuda" and "-sf cuda"
 "command-line switches"_Section_start.html#start_7 respectively.  Or
 the effect of the "-pk" or "-sf" switches can be duplicated by adding
 the "package cuda"_package.html or "suffix cuda"_suffix.html commands
 respectively to your input script.
 
 [Required hardware/software:]
 
 To use this package, you need to have one or more NVIDIA GPUs and
 install the NVIDIA Cuda software on your system:
 
 Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
 help you to find out the Compute Capability of your card:
 
 http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
 
 Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the
 corresponding GPU drivers.  The Nvidia Cuda SDK is not required, but
 we recommend it also be installed.  You can then make sure its sample
 projects can be compiled without problems.
 
 [Building LAMMPS with the USER-CUDA package:]
 
 This requires two steps (a,b): build the USER-CUDA library, then build
 LAMMPS with the USER-CUDA package.
 
 You can do both these steps in one line, using the src/Make.py script,
 described in "Section 2.4"_Section_start.html#start_4 of the manual.
 Type "Make.py -h" for help.  If run from the src directory, this
 command will create src/lmp_cuda using src/MAKE/Makefile.mpi as the
 starting Makefile.machine:
 
-Make.py -p cuda -cuda mode=single arch=20 -o cuda lib-cuda file mpi :pre
+Make.py -p cuda -cuda mode=single arch=20 -o cuda -a lib-cuda file mpi :pre
 
 Or you can follow these two (a,b) steps:
 
 (a) Build the USER-CUDA library
 
 The USER-CUDA library is in lammps/lib/cuda.  If your {CUDA} toolkit
 is not installed in the default system directoy {/usr/local/cuda} edit
 the file {lib/cuda/Makefile.common} accordingly.
 
 To build the library with the settings in lib/cuda/Makefile.default,
 simply type:
 
 make :pre
 
 To set options when the library is built, type "make OPTIONS", where
 {OPTIONS} are one or more of the following. The settings will be
 written to the {lib/cuda/Makefile.defaults} before the build.
 
 {precision=N} to set the precision level
   N = 1 for single precision (default)
   N = 2 for double precision
   N = 3 for positions in double precision
   N = 4 for positions and velocities in double precision
 {arch=M} to set GPU compute capability
   M = 35 for Kepler GPUs
   M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
   M = 21 for CC2.1 (GF104/114,  e.g. GTX560, GTX460, GTX450)
   M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
 {prec_timer=0/1} to use hi-precision timers
   0 = do not use them (default)
   1 = use them
   this is usually only useful for Mac machines 
 {dbg=0/1} to activate debug mode
   0 = no debug mode (default)
   1 = yes debug mode
   this is only useful for developers
 {cufft=1} for use of the CUDA FFT library
   0 = no CUFFT support (default)
   in the future other CUDA-enabled FFT libraries might be supported :pre
 
 If the build is successful, it will produce the files liblammpscuda.a and
 Makefile.lammps.
 
 Note that if you change any of the options (like precision), you need
 to re-build the entire library.  Do a "make clean" first, followed by
 "make".
 
 (b) Build LAMMPS with the USER-CUDA package
 
 cd lammps/src
 make yes-user-cuda
 make machine :pre
 
 No additional compile/link flags are needed in Makefile.machine.
 
 Note that if you change the USER-CUDA library precision (discussed
 above) and rebuild the USER-CUDA library, then you also need to
 re-install the USER-CUDA package and re-build LAMMPS, so that all
 affected files are re-compiled and linked to the new USER-CUDA
 library.
 
 [Run with the USER-CUDA package from the command line:]
 
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 
 When using the USER-CUDA package, you must use exactly one MPI task
 per physical GPU.
 
 You must use the "-c on" "command-line
 switch"_Section_start.html#start_7 to enable the USER-CUDA package.
 The "-c on" switch also issues a default "package cuda 1"_package.html
 command which sets various USER-CUDA options to default values, as
 discussed on the "package"_package.html command doc page.
 
 Use the "-sf cuda" "command-line switch"_Section_start.html#start_7,
 which will automatically append "cuda" to styles that support it.  Use
 the "-pk cuda Ng" "command-line switch"_Section_start.html#start_7 to
 set Ng = # of GPUs per node to a different value than the default set
 by the "-c on" switch (1 GPU) or change other "package
 cuda"_package.html options.
 
 lmp_machine -c on -sf cuda -pk cuda 1 -in in.script                       # 1 MPI task uses 1 GPU
 mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script          # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node
 mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script  # ditto on 12 16-core nodes :pre
 
 The syntax for the "-pk" switch is the same as same as the "package
 cuda" command.  See the "package"_package.html command doc page for
 details, including the default values used for all its options if it
 is not specified.
 
 Note that the default for the "package cuda"_package.html command is
 to set the Newton flag to "off" for both pairwise and bonded
 interactions.  This typically gives fastest performance.  If the
 "newton"_newton.html command is used in the input script, it can
 override these defaults.
 
 [Or run with the USER-CUDA package by editing an input script:]
 
 The discussion above for the mpirun/mpiexec command and the requirement
 of one MPI task per GPU is the same.
 
 You must still use the "-c on" "command-line
 switch"_Section_start.html#start_7 to enable the USER-CUDA package.
 
 Use the "suffix cuda"_suffix.html command, or you can explicitly add a
 "cuda" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/cuda 2.5 :pre
 
 You only need to use the "package cuda"_package.html command if you
 wish to change any of its option defaults, including the number of
 GPUs/node (default = 1), as set by the "-c on" "command-line
 switch"_Section_start.html#start_7.
 
 [Speed-ups to expect:]
 
 The performance of a GPU versus a multi-core CPU is a function of your
 hardware, which pair style is used, the number of atoms/GPU, and the
 precision used on the GPU (double, single, mixed).
 
 See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
 LAMMPS web site for performance of the USER-CUDA package on different
 hardware.
 
 [Guidelines for best performance:]
 
 The USER-CUDA package offers more speed-up relative to CPU performance
 when the number of atoms per GPU is large, e.g. on the order of tens
 or hundreds of 1000s. :ulb,l
 
 As noted above, this package will continue to run a simulation
 entirely on the GPU(s) (except for inter-processor MPI communication),
 for multiple timesteps, until a CPU calculation is required, either by
 a fix or compute that is non-GPU-ized, or until output is performed
 (thermo or dump snapshot or restart file).  The less often this
 occurs, the faster your simulation will run. :l,ule
 
 [Restrictions:]
 
 None.
diff --git a/doc/accelerate_gpu.html b/doc/accelerate_gpu.html
index 95edb9fd7..eddf4e064 100644
--- a/doc/accelerate_gpu.html
+++ b/doc/accelerate_gpu.html
@@ -1,401 +1,401 @@
 
 
 <!DOCTYPE html>
 <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
 <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
 <head>
   <meta charset="utf-8">
   
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
   
   <title>5.GPU package &mdash; LAMMPS 15 May 2015 version documentation</title>
   
 
   
   
 
   
 
   
   
     
 
   
 
   
   
     <link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
   
 
   
     <link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" />
   
 
   
     <link rel="top" title="LAMMPS 15 May 2015 version documentation" href="index.html"/> 
 
   
   <script src="_static/js/modernizr.min.js"></script>
 
 </head>
 
 <body class="wy-body-for-nav" role="document">
 
   <div class="wy-grid-for-nav">
 
     
     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
       <div class="wy-side-nav-search">
         
 
         
           <a href="Manual.html" class="icon icon-home"> LAMMPS
         
 
         
         </a>
 
         
 <div role="search">
   <form id="rtd-search-form" class="wy-form" action="search.html" method="get">
     <input type="text" name="q" placeholder="Search docs" />
     <input type="hidden" name="check_keywords" value="yes" />
     <input type="hidden" name="area" value="default" />
   </form>
 </div>
 
         
       </div>
 
       <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
         
           
           
               <ul>
 <li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance &amp; scalability</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying &amp; extending LAMMPS</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li>
 </ul>
 
           
         
       </div>
       &nbsp;
     </nav>
 
     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
 
       
       <nav class="wy-nav-top" role="navigation" aria-label="top navigation">
         <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
         <a href="Manual.html">LAMMPS</a>
       </nav>
 
 
       
       <div class="wy-nav-content">
         <div class="rst-content">
           <div role="navigation" aria-label="breadcrumbs navigation">
   <ul class="wy-breadcrumbs">
     <li><a href="Manual.html">Docs</a> &raquo;</li>
       
     <li>5.GPU package</li>
       <li class="wy-breadcrumbs-aside">
         
           
             <a href="http://lammps.sandia.gov">Website</a>
             <a href="Section_commands.html#comm">Commands</a>
         
       </li>
   </ul>
   <hr/>
   
 </div>
           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
            <div itemprop="articleBody">
             
   <p><a class="reference internal" href="Section_accelerate.html"><em>Return to Section accelerate overview</em></a></p>
 <div class="section" id="gpu-package">
 <h1>5.GPU package<a class="headerlink" href="#gpu-package" title="Permalink to this headline">¶</a></h1>
 <p>The GPU package was developed by Mike Brown at ORNL and his
 collaborators, particularly Trung Nguyen (ORNL).  It provides GPU
 versions of many pair styles, including the 3-body Stillinger-Weber
 pair style, and for <a class="reference internal" href="kspace_style.html"><em>kspace_style pppm</em></a> for
 long-range Coulombics.  It has the following general features:</p>
 <ul class="simple">
 <li>It is designed to exploit common GPU hardware configurations where one
 or more GPUs are coupled to many cores of one or more multi-core CPUs,
 e.g. within a node of a parallel machine.</li>
 <li>Atom-based data (e.g. coordinates, forces) moves back-and-forth
 between the CPU(s) and GPU every timestep.</li>
 <li>Neighbor lists can be built on the CPU or on the GPU</li>
 <li>The charge assignement and force interpolation portions of PPPM can be
 run on the GPU.  The FFT portion, which requires MPI communication
 between processors, runs on the CPU.</li>
 <li>Asynchronous force computations can be performed simultaneously on the
 CPU(s) and GPU.</li>
 <li>It allows for GPU computations to be performed in single or double
 precision, or in mixed-mode precision, where pairwise forces are
 computed in single precision, but accumulated into double-precision
 force vectors.</li>
 <li>LAMMPS-specific code is in the GPU package.  It makes calls to a
 generic GPU library in the lib/gpu directory.  This library provides
 NVIDIA support as well as more general OpenCL support, so that the
 same functionality can eventually be supported on a variety of GPU
 hardware.</li>
 </ul>
 <p>Here is a quick overview of how to use the GPU package:</p>
 <ul class="simple">
 <li>build the library in lib/gpu for your GPU hardware wity desired precision</li>
 <li>include the GPU package and build LAMMPS</li>
 <li>use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU</li>
 <li>specify the # of GPUs per node</li>
 <li>use GPU styles in your input script</li>
 </ul>
 <p>The latter two steps can be done using the &#8220;-pk gpu&#8221; and &#8220;-sf gpu&#8221;
 <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> respectively.  Or
 the effect of the &#8220;-pk&#8221; or &#8220;-sf&#8221; switches can be duplicated by adding
 the <a class="reference internal" href="package.html"><em>package gpu</em></a> or <a class="reference internal" href="suffix.html"><em>suffix gpu</em></a> commands
 respectively to your input script.</p>
 <p><strong>Required hardware/software:</strong></p>
 <p>To use this package, you currently need to have an NVIDIA GPU and
 install the NVIDIA Cuda software on your system:</p>
 <ul class="simple">
 <li>Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information</li>
 <li>Go to <a class="reference external" href="http://www.nvidia.com/object/cuda_get.html">http://www.nvidia.com/object/cuda_get.html</a></li>
 <li>Install a driver and toolkit appropriate for your system (SDK is not necessary)</li>
 <li>Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties</li>
 </ul>
 <p><strong>Building LAMMPS with the GPU package:</strong></p>
 <p>This requires two steps (a,b): build the GPU library, then build
 LAMMPS with the GPU package.</p>
 <p>You can do both these steps in one line, using the src/Make.py script,
 described in <a class="reference internal" href="Section_start.html#start-4"><span>Section 2.4</span></a> of the manual.
 Type &#8220;Make.py -h&#8221; for help.  If run from the src directory, this
 command will create src/lmp_gpu using src/MAKE/Makefile.mpi as the
 starting Makefile.machine:</p>
-<div class="highlight-python"><div class="highlight"><pre>Make.py -p gpu -gpu mode=single arch=31 -o gpu lib-gpu file mpi
+<div class="highlight-python"><div class="highlight"><pre>Make.py -p gpu -gpu mode=single arch=31 -o gpu -a lib-gpu file mpi
 </pre></div>
 </div>
 <p>Or you can follow these two (a,b) steps:</p>
 <ol class="loweralpha simple">
 <li>Build the GPU library</li>
 </ol>
 <p>The GPU library is in lammps/lib/gpu.  Select a Makefile.machine (in
 lib/gpu) appropriate for your system.  You should pay special
 attention to 3 settings in this makefile.</p>
 <ul class="simple">
 <li>CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system</li>
 <li>CUDA_ARCH = needs to be appropriate to your GPUs</li>
 <li>CUDA_PREC = precision (double, mixed, single) you desire</li>
 </ul>
 <p>See lib/gpu/Makefile.linux.double for examples of the ARCH settings
 for different GPU choices, e.g. Fermi vs Kepler.  It also lists the
 possible precision settings:</p>
 <div class="highlight-python"><div class="highlight"><pre><span class="n">CUDA_PREC</span> <span class="o">=</span> <span class="o">-</span><span class="n">D_SINGLE_SINGLE</span>  <span class="c"># single precision for all calculations</span>
 <span class="n">CUDA_PREC</span> <span class="o">=</span> <span class="o">-</span><span class="n">D_DOUBLE_DOUBLE</span>  <span class="c"># double precision for all calculations</span>
 <span class="n">CUDA_PREC</span> <span class="o">=</span> <span class="o">-</span><span class="n">D_SINGLE_DOUBLE</span>  <span class="c"># accumulation of forces, etc, in double</span>
 </pre></div>
 </div>
 <p>The last setting is the mixed mode referred to above.  Note that your
 GPU must support double precision to use either the 2nd or 3rd of
 these settings.</p>
 <p>To build the library, type:</p>
 <div class="highlight-python"><div class="highlight"><pre>make -f Makefile.machine
 </pre></div>
 </div>
 <p>If successful, it will produce the files libgpu.a and Makefile.lammps.</p>
 <p>The latter file has 3 settings that need to be appropriate for the
 paths and settings for the CUDA system software on your machine.
 Makefile.lammps is a copy of the file specified by the EXTRAMAKE
 setting in Makefile.machine.  You can change EXTRAMAKE or create your
 own Makefile.lammps.machine if needed.</p>
 <p>Note that to change the precision of the GPU library, you need to
 re-build the entire library.  Do a &#8220;clean&#8221; first, e.g. &#8220;make -f
 Makefile.linux clean&#8221;, followed by the make command above.</p>
 <ol class="loweralpha simple" start="2">
 <li>Build LAMMPS with the GPU package</li>
 </ol>
 <div class="highlight-python"><div class="highlight"><pre>cd lammps/src
 make yes-gpu
 make machine
 </pre></div>
 </div>
 <p>No additional compile/link flags are needed in Makefile.machine.</p>
 <p>Note that if you change the GPU library precision (discussed above)
 and rebuild the GPU library, then you also need to re-install the GPU
 package and re-build LAMMPS, so that all affected files are
 re-compiled and linked to the new GPU library.</p>
 <p><strong>Run with the GPU package from the command line:</strong></p>
 <p>The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.</p>
 <p>When using the GPU package, you cannot assign more than one GPU to a
 single MPI task.  However multiple MPI tasks can share the same GPU,
 and in many cases it will be more efficient to run this way.  Likewise
 it may be more efficient to use less MPI tasks/node than the available
 # of CPU cores.  Assignment of multiple MPI tasks to a GPU will happen
 automatically if you create more MPI tasks/node than there are
 GPUs/mode.  E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be
 shared by 4 MPI tasks.</p>
 <p>Use the &#8220;-sf gpu&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>,
 which will automatically append &#8220;gpu&#8221; to styles that support it.  Use
 the &#8220;-pk gpu Ng&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to
 set Ng = # of GPUs/node to use.</p>
 <div class="highlight-python"><div class="highlight"><pre>lmp_machine -sf gpu -pk gpu 1 -in in.script                         # 1 MPI task uses 1 GPU
 mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script           # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
 mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script   # ditto on 4 16-core nodes
 </pre></div>
 </div>
 <p>Note that if the &#8220;-sf gpu&#8221; switch is used, it also issues a default
 <a class="reference internal" href="package.html"><em>package gpu 1</em></a> command, which sets the number of
 GPUs/node to 1.</p>
 <p>Using the &#8220;-pk&#8221; switch explicitly allows for setting of the number of
 GPUs/node to use and additional options.  Its syntax is the same as
 same as the &#8220;package gpu&#8221; command.  See the <a class="reference internal" href="package.html"><em>package</em></a>
 command doc page for details, including the default values used for
 all its options if it is not specified.</p>
 <p>Note that the default for the <a class="reference internal" href="package.html"><em>package gpu</em></a> command is to
 set the Newton flag to &#8220;off&#8221; pairwise interactions.  It does not
 affect the setting for bonded interactions (LAMMPS default is &#8220;on&#8221;).
 The &#8220;off&#8221; setting for pairwise interaction is currently required for
 GPU package pair styles.</p>
 <p><strong>Or run with the GPU package by editing an input script:</strong></p>
 <p>The discussion above for the mpirun/mpiexec command, MPI tasks/node,
 and use of multiple MPI tasks/GPU is the same.</p>
 <p>Use the <a class="reference internal" href="suffix.html"><em>suffix gpu</em></a> command, or you can explicitly add an
 &#8220;gpu&#8221; suffix to individual styles in your input script, e.g.</p>
 <div class="highlight-python"><div class="highlight"><pre>pair_style lj/cut/gpu 2.5
 </pre></div>
 </div>
 <p>You must also use the <a class="reference internal" href="package.html"><em>package gpu</em></a> command to enable the
 GPU package, unless the &#8220;-sf gpu&#8221; or &#8220;-pk gpu&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> were used.  It specifies the
 number of GPUs/node to use, as well as other options.</p>
 <p><strong>Speed-ups to expect:</strong></p>
 <p>The performance of a GPU versus a multi-core CPU is a function of your
 hardware, which pair style is used, the number of atoms/GPU, and the
 precision used on the GPU (double, single, mixed).</p>
 <p>See the <a class="reference external" href="http://lammps.sandia.gov/bench.html">Benchmark page</a> of the
 LAMMPS web site for performance of the GPU package on various
 hardware, including the Titan HPC platform at ORNL.</p>
 <p>You should also experiment with how many MPI tasks per GPU to use to
 give the best performance for your problem and machine.  This is also
 a function of the problem size and the pair style being using.
 Likewise, you should experiment with the precision setting for the GPU
 library to see if single or mixed precision will give accurate
 results, since they will typically be faster.</p>
 <p><strong>Guidelines for best performance:</strong></p>
 <ul class="simple">
 <li>Using multiple MPI tasks per GPU will often give the best performance,
 as allowed my most multi-core CPU/GPU configurations.</li>
 <li>If the number of particles per MPI task is small (e.g. 100s of
 particles), it can be more efficient to run with fewer MPI tasks per
 GPU, even if you do not use all the cores on the compute node.</li>
 <li>The <a class="reference internal" href="package.html"><em>package gpu</em></a> command has several options for tuning
 performance.  Neighbor lists can be built on the GPU or CPU.  Force
 calculations can be dynamically balanced across the CPU cores and
 GPUs.  GPU-specific settings can be made which can be optimized
 for different hardware.  See the <a class="reference internal" href="package.html"><em>packakge</em></a> command
 doc page for details.</li>
 <li>As described by the <a class="reference internal" href="package.html"><em>package gpu</em></a> command, GPU
 accelerated pair styles can perform computations asynchronously with
 CPU computations. The &#8220;Pair&#8221; time reported by LAMMPS will be the
 maximum of the time required to complete the CPU pair style
 computations and the time required to complete the GPU pair style
 computations. Any time spent for GPU-enabled pair styles for
 computations that run simultaneously with <a class="reference internal" href="bond_style.html"><em>bond</em></a>,
 <a class="reference internal" href="angle_style.html"><em>angle</em></a>, <a class="reference internal" href="dihedral_style.html"><em>dihedral</em></a>,
 <a class="reference internal" href="improper_style.html"><em>improper</em></a>, and <a class="reference internal" href="kspace_style.html"><em>long-range</em></a>
 calculations will not be included in the &#8220;Pair&#8221; time.</li>
 <li>When the <em>mode</em> setting for the package gpu command is force/neigh,
 the time for neighbor list calculations on the GPU will be added into
 the &#8220;Pair&#8221; time, not the &#8220;Neigh&#8221; time.  An additional breakdown of the
 times required for various tasks on the GPU (data copy, neighbor
 calculations, force computations, etc) are output only with the LAMMPS
 screen output (not in the log file) at the end of each run.  These
 timings represent total time spent on the GPU for each routine,
 regardless of asynchronous CPU calculations.</li>
 <li>The output section &#8220;GPU Time Info (average)&#8221; reports &#8220;Max Mem / Proc&#8221;.
 This is the maximum memory used at one time on the GPU for data
 storage by a single MPI process.</li>
 </ul>
 <div class="section" id="restrictions">
 <h2>Restrictions<a class="headerlink" href="#restrictions" title="Permalink to this headline">¶</a></h2>
 <p>None.</p>
 </div>
 </div>
 
 
            </div>
           </div>
           <footer>
   
 
   <hr/>
 
   <div role="contentinfo">
     <p>
         &copy; Copyright .
     </p>
   </div>
   Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
 
 </footer>
 
         </div>
       </div>
 
     </section>
 
   </div>
   
 
 
   
 
     <script type="text/javascript">
         var DOCUMENTATION_OPTIONS = {
             URL_ROOT:'./',
             VERSION:'15 May 2015 version',
             COLLAPSE_INDEX:false,
             FILE_SUFFIX:'.html',
             HAS_SOURCE:  true
         };
     </script>
       <script type="text/javascript" src="_static/jquery.js"></script>
       <script type="text/javascript" src="_static/underscore.js"></script>
       <script type="text/javascript" src="_static/doctools.js"></script>
       <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script>
 
   
 
   
   
     <script type="text/javascript" src="_static/js/theme.js"></script>
   
 
   
   
   <script type="text/javascript">
       jQuery(function () {
           SphinxRtdTheme.StickyNav.enable();
       });
   </script>
    
 
 </body>
 </html>
\ No newline at end of file
diff --git a/doc/accelerate_gpu.txt b/doc/accelerate_gpu.txt
index bfe9ae7e2..b06e409cd 100644
--- a/doc/accelerate_gpu.txt
+++ b/doc/accelerate_gpu.txt
@@ -1,252 +1,252 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.2 GPU package :h4
 
 The GPU package was developed by Mike Brown at ORNL and his
 collaborators, particularly Trung Nguyen (ORNL).  It provides GPU
 versions of many pair styles, including the 3-body Stillinger-Weber
 pair style, and for "kspace_style pppm"_kspace_style.html for
 long-range Coulombics.  It has the following general features:
 
 It is designed to exploit common GPU hardware configurations where one
 or more GPUs are coupled to many cores of one or more multi-core CPUs,
 e.g. within a node of a parallel machine. :ulb,l
 
 Atom-based data (e.g. coordinates, forces) moves back-and-forth
 between the CPU(s) and GPU every timestep. :l
 
 Neighbor lists can be built on the CPU or on the GPU :l
 
 The charge assignement and force interpolation portions of PPPM can be
 run on the GPU.  The FFT portion, which requires MPI communication
 between processors, runs on the CPU. :l
 
 Asynchronous force computations can be performed simultaneously on the
 CPU(s) and GPU. :l
 
 It allows for GPU computations to be performed in single or double
 precision, or in mixed-mode precision, where pairwise forces are
 computed in single precision, but accumulated into double-precision
 force vectors. :l
 
 LAMMPS-specific code is in the GPU package.  It makes calls to a
 generic GPU library in the lib/gpu directory.  This library provides
 NVIDIA support as well as more general OpenCL support, so that the
 same functionality can eventually be supported on a variety of GPU
 hardware. :l,ule
 
 Here is a quick overview of how to use the GPU package:
 
 build the library in lib/gpu for your GPU hardware wity desired precision
 include the GPU package and build LAMMPS
 use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU
 specify the # of GPUs per node
 use GPU styles in your input script :ul
 
 The latter two steps can be done using the "-pk gpu" and "-sf gpu"
 "command-line switches"_Section_start.html#start_7 respectively.  Or
 the effect of the "-pk" or "-sf" switches can be duplicated by adding
 the "package gpu"_package.html or "suffix gpu"_suffix.html commands
 respectively to your input script.
 
 [Required hardware/software:]
 
 To use this package, you currently need to have an NVIDIA GPU and
 install the NVIDIA Cuda software on your system:
 
 Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information
 Go to http://www.nvidia.com/object/cuda_get.html
 Install a driver and toolkit appropriate for your system (SDK is not necessary)
 Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties :ul
 
 [Building LAMMPS with the GPU package:]
 
 This requires two steps (a,b): build the GPU library, then build
 LAMMPS with the GPU package.
 
 You can do both these steps in one line, using the src/Make.py script,
 described in "Section 2.4"_Section_start.html#start_4 of the manual.
 Type "Make.py -h" for help.  If run from the src directory, this
 command will create src/lmp_gpu using src/MAKE/Makefile.mpi as the
 starting Makefile.machine:
 
-Make.py -p gpu -gpu mode=single arch=31 -o gpu lib-gpu file mpi :pre
+Make.py -p gpu -gpu mode=single arch=31 -o gpu -a lib-gpu file mpi :pre
 
 Or you can follow these two (a,b) steps:
 
 (a) Build the GPU library
 
 The GPU library is in lammps/lib/gpu.  Select a Makefile.machine (in
 lib/gpu) appropriate for your system.  You should pay special
 attention to 3 settings in this makefile.
 
 CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system
 CUDA_ARCH = needs to be appropriate to your GPUs
 CUDA_PREC = precision (double, mixed, single) you desire :ul
 
 See lib/gpu/Makefile.linux.double for examples of the ARCH settings
 for different GPU choices, e.g. Fermi vs Kepler.  It also lists the
 possible precision settings:
 
 CUDA_PREC = -D_SINGLE_SINGLE  # single precision for all calculations
 CUDA_PREC = -D_DOUBLE_DOUBLE  # double precision for all calculations
 CUDA_PREC = -D_SINGLE_DOUBLE  # accumulation of forces, etc, in double :pre
 
 The last setting is the mixed mode referred to above.  Note that your
 GPU must support double precision to use either the 2nd or 3rd of
 these settings.
 
 To build the library, type:
 
 make -f Makefile.machine :pre
 
 If successful, it will produce the files libgpu.a and Makefile.lammps.
 
 The latter file has 3 settings that need to be appropriate for the
 paths and settings for the CUDA system software on your machine.
 Makefile.lammps is a copy of the file specified by the EXTRAMAKE
 setting in Makefile.machine.  You can change EXTRAMAKE or create your
 own Makefile.lammps.machine if needed.
 
 Note that to change the precision of the GPU library, you need to
 re-build the entire library.  Do a "clean" first, e.g. "make -f
 Makefile.linux clean", followed by the make command above.
 
 (b) Build LAMMPS with the GPU package
 
 cd lammps/src
 make yes-gpu
 make machine :pre
 
 No additional compile/link flags are needed in Makefile.machine.
 
 Note that if you change the GPU library precision (discussed above)
 and rebuild the GPU library, then you also need to re-install the GPU
 package and re-build LAMMPS, so that all affected files are
 re-compiled and linked to the new GPU library.
 
 [Run with the GPU package from the command line:]
 
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 
 When using the GPU package, you cannot assign more than one GPU to a
 single MPI task.  However multiple MPI tasks can share the same GPU,
 and in many cases it will be more efficient to run this way.  Likewise
 it may be more efficient to use less MPI tasks/node than the available
 # of CPU cores.  Assignment of multiple MPI tasks to a GPU will happen
 automatically if you create more MPI tasks/node than there are
 GPUs/mode.  E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be
 shared by 4 MPI tasks.
 
 Use the "-sf gpu" "command-line switch"_Section_start.html#start_7,
 which will automatically append "gpu" to styles that support it.  Use
 the "-pk gpu Ng" "command-line switch"_Section_start.html#start_7 to
 set Ng = # of GPUs/node to use.
 
 lmp_machine -sf gpu -pk gpu 1 -in in.script                         # 1 MPI task uses 1 GPU
 mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script           # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
 mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script   # ditto on 4 16-core nodes :pre
 
 Note that if the "-sf gpu" switch is used, it also issues a default
 "package gpu 1"_package.html command, which sets the number of
 GPUs/node to 1.
 
 Using the "-pk" switch explicitly allows for setting of the number of
 GPUs/node to use and additional options.  Its syntax is the same as
 same as the "package gpu" command.  See the "package"_package.html
 command doc page for details, including the default values used for
 all its options if it is not specified.
 
 Note that the default for the "package gpu"_package.html command is to
 set the Newton flag to "off" pairwise interactions.  It does not
 affect the setting for bonded interactions (LAMMPS default is "on").
 The "off" setting for pairwise interaction is currently required for
 GPU package pair styles.
 
 [Or run with the GPU package by editing an input script:]
 
 The discussion above for the mpirun/mpiexec command, MPI tasks/node,
 and use of multiple MPI tasks/GPU is the same.
 
 Use the "suffix gpu"_suffix.html command, or you can explicitly add an
 "gpu" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/gpu 2.5 :pre
 
 You must also use the "package gpu"_package.html command to enable the
 GPU package, unless the "-sf gpu" or "-pk gpu" "command-line
 switches"_Section_start.html#start_7 were used.  It specifies the
 number of GPUs/node to use, as well as other options.
 
 [Speed-ups to expect:]
 
 The performance of a GPU versus a multi-core CPU is a function of your
 hardware, which pair style is used, the number of atoms/GPU, and the
 precision used on the GPU (double, single, mixed).
 
 See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
 LAMMPS web site for performance of the GPU package on various
 hardware, including the Titan HPC platform at ORNL.
 
 You should also experiment with how many MPI tasks per GPU to use to
 give the best performance for your problem and machine.  This is also
 a function of the problem size and the pair style being using.
 Likewise, you should experiment with the precision setting for the GPU
 library to see if single or mixed precision will give accurate
 results, since they will typically be faster.
 
 [Guidelines for best performance:]
 
 Using multiple MPI tasks per GPU will often give the best performance,
 as allowed my most multi-core CPU/GPU configurations. :ulb,l
 
 If the number of particles per MPI task is small (e.g. 100s of
 particles), it can be more efficient to run with fewer MPI tasks per
 GPU, even if you do not use all the cores on the compute node. :l
 
 The "package gpu"_package.html command has several options for tuning
 performance.  Neighbor lists can be built on the GPU or CPU.  Force
 calculations can be dynamically balanced across the CPU cores and
 GPUs.  GPU-specific settings can be made which can be optimized
 for different hardware.  See the "packakge"_package.html command
 doc page for details. :l
 
 As described by the "package gpu"_package.html command, GPU
 accelerated pair styles can perform computations asynchronously with
 CPU computations. The "Pair" time reported by LAMMPS will be the
 maximum of the time required to complete the CPU pair style
 computations and the time required to complete the GPU pair style
 computations. Any time spent for GPU-enabled pair styles for
 computations that run simultaneously with "bond"_bond_style.html,
 "angle"_angle_style.html, "dihedral"_dihedral_style.html,
 "improper"_improper_style.html, and "long-range"_kspace_style.html
 calculations will not be included in the "Pair" time. :l
 
 When the {mode} setting for the package gpu command is force/neigh,
 the time for neighbor list calculations on the GPU will be added into
 the "Pair" time, not the "Neigh" time.  An additional breakdown of the
 times required for various tasks on the GPU (data copy, neighbor
 calculations, force computations, etc) are output only with the LAMMPS
 screen output (not in the log file) at the end of each run.  These
 timings represent total time spent on the GPU for each routine,
 regardless of asynchronous CPU calculations. :l
 
 The output section "GPU Time Info (average)" reports "Max Mem / Proc".
 This is the maximum memory used at one time on the GPU for data
 storage by a single MPI process. :l,ule
 
 [Restrictions:]
 
 None.
diff --git a/doc/accelerate_intel.html b/doc/accelerate_intel.html
index 984302118..bc04eacac 100644
--- a/doc/accelerate_intel.html
+++ b/doc/accelerate_intel.html
@@ -1,484 +1,484 @@
 
 
 <!DOCTYPE html>
 <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
 <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
 <head>
   <meta charset="utf-8">
   
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
   
   <title>5.USER-INTEL package &mdash; LAMMPS 15 May 2015 version documentation</title>
   
 
   
   
 
   
 
   
   
     
 
   
 
   
   
     <link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
   
 
   
     <link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" />
   
 
   
     <link rel="top" title="LAMMPS 15 May 2015 version documentation" href="index.html"/> 
 
   
   <script src="_static/js/modernizr.min.js"></script>
 
 </head>
 
 <body class="wy-body-for-nav" role="document">
 
   <div class="wy-grid-for-nav">
 
     
     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
       <div class="wy-side-nav-search">
         
 
         
           <a href="Manual.html" class="icon icon-home"> LAMMPS
         
 
         
         </a>
 
         
 <div role="search">
   <form id="rtd-search-form" class="wy-form" action="search.html" method="get">
     <input type="text" name="q" placeholder="Search docs" />
     <input type="hidden" name="check_keywords" value="yes" />
     <input type="hidden" name="area" value="default" />
   </form>
 </div>
 
         
       </div>
 
       <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
         
           
           
               <ul>
 <li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance &amp; scalability</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying &amp; extending LAMMPS</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li>
 </ul>
 
           
         
       </div>
       &nbsp;
     </nav>
 
     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
 
       
       <nav class="wy-nav-top" role="navigation" aria-label="top navigation">
         <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
         <a href="Manual.html">LAMMPS</a>
       </nav>
 
 
       
       <div class="wy-nav-content">
         <div class="rst-content">
           <div role="navigation" aria-label="breadcrumbs navigation">
   <ul class="wy-breadcrumbs">
     <li><a href="Manual.html">Docs</a> &raquo;</li>
       
     <li>5.USER-INTEL package</li>
       <li class="wy-breadcrumbs-aside">
         
           
             <a href="http://lammps.sandia.gov">Website</a>
             <a href="Section_commands.html#comm">Commands</a>
         
       </li>
   </ul>
   <hr/>
   
 </div>
           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
            <div itemprop="articleBody">
             
   <p><a class="reference internal" href="Section_accelerate.html"><em>Return to Section accelerate overview</em></a></p>
 <div class="section" id="user-intel-package">
 <h1>5.USER-INTEL package<a class="headerlink" href="#user-intel-package" title="Permalink to this headline">¶</a></h1>
 <p>The USER-INTEL package was developed by Mike Brown at Intel
 Corporation.  It provides a capability to accelerate simulations by
 offloading neighbor list and non-bonded force calculations to Intel(R)
 Xeon Phi(TM) coprocessors (not native mode like the KOKKOS package).
 Additionally, it supports running simulations in single, mixed, or
 double precision with vectorization, even if a coprocessor is not
 present, i.e. on an Intel(R) CPU.  The same C++ code is used for both
 cases.  When offloading to a coprocessor, the routine is run twice,
 once with an offload flag.</p>
 <p>The USER-INTEL package can be used in tandem with the USER-OMP
 package.  This is useful when offloading pair style computations to
 coprocessors, so that other styles not supported by the USER-INTEL
 package, e.g. bond, angle, dihedral, improper, and long-range
 electrostatics, can run simultaneously in threaded mode on the CPU
 cores.  Since less MPI tasks than CPU cores will typically be invoked
 when running with coprocessors, this enables the extra CPU cores to be
 used for useful computation.</p>
 <p>If LAMMPS is built with both the USER-INTEL and USER-OMP packages
 intsalled, this mode of operation is made easier to use, because the
 &#8220;-suffix intel&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> or
 the <a class="reference internal" href="suffix.html"><em>suffix intel</em></a> command will both set a second-choice
 suffix to &#8220;omp&#8221; so that styles from the USER-OMP package will be used
 if available, after first testing if a style from the USER-INTEL
 package is available.</p>
 <p>When using the USER-INTEL package, you must choose at build time
 whether you are building for CPU-only acceleration or for using the
 Xeon Phi in offload mode.</p>
 <p>Here is a quick overview of how to use the USER-INTEL package
 for CPU-only acceleration:</p>
 <ul class="simple">
 <li>specify these CCFLAGS in your src/MAKE/Makefile.machine: -openmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost</li>
 <li>specify -openmp with LINKFLAGS in your Makefile.machine</li>
 <li>include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS</li>
 <li>specify how many OpenMP threads per MPI task to use</li>
 <li>use USER-INTEL and (optionally) USER-OMP styles in your input script</li>
 </ul>
 <p>Note that many of these settings can only be used with the Intel
 compiler, as discussed below.</p>
 <p>Using the USER-INTEL package to offload work to the Intel(R)
 Xeon Phi(TM) coprocessor is the same except for these additional
 steps:</p>
 <ul class="simple">
 <li>add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine</li>
 <li>add the flag -offload to LINKFLAGS in your Makefile.machine</li>
 </ul>
 <p>The latter two steps in the first case and the last step in the
 coprocessor case can be done using the &#8220;-pk intel&#8221; and &#8220;-sf intel&#8221;
 <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> respectively.  Or
 the effect of the &#8220;-pk&#8221; or &#8220;-sf&#8221; switches can be duplicated by adding
 the <a class="reference internal" href="package.html"><em>package intel</em></a> or <a class="reference internal" href="suffix.html"><em>suffix intel</em></a>
 commands respectively to your input script.</p>
 <p><strong>Required hardware/software:</strong></p>
 <p>To use the offload option, you must have one or more Intel(R) Xeon
 Phi(TM) coprocessors and use an Intel(R) C++ compiler.</p>
 <p>Optimizations for vectorization have only been tested with the
 Intel(R) compiler.  Use of other compilers may not result in
 vectorization or give poor performance.</p>
 <p>Use of an Intel C++ compiler is recommended, but not required (though
 g++ will not recognize some of the settings, so they cannot be used).
 The compiler must support the OpenMP interface.</p>
 <p>The recommended version of the Intel(R) compiler is 14.0.1.106.
 Versions 15.0.1.133 and later are also supported. If using Intel(R)
 MPI, versions 15.0.2.044 and later are recommended.</p>
 <p><strong>Building LAMMPS with the USER-INTEL package:</strong></p>
 <p>You can choose to build with or without support for offload to a
 Intel(R) Xeon Phi(TM) coprocessor. If you build with support for a
 coprocessor, the same binary can be used on nodes with and without
 coprocessors installed. However, if you do not have coprocessors
 on your system, building without offload support will produce a
 smaller binary.</p>
 <p>You can do either in one line, using the src/Make.py script, described
 in <a class="reference internal" href="Section_start.html#start-4"><span>Section 2.4</span></a> of the manual.  Type
 &#8220;Make.py -h&#8221; for help.  If run from the src directory, these commands
 will create src/lmp_intel_cpu and lmp_intel_phi using
 src/MAKE/Makefile.mpi as the starting Makefile.machine:</p>
-<div class="highlight-python"><div class="highlight"><pre>Make.py -p intel omp -intel cpu -o intel_cpu -cc icc file mpi
-Make.py -p intel omp -intel phi -o intel_phi -cc icc file mpi
+<div class="highlight-python"><div class="highlight"><pre>Make.py -p intel omp -intel cpu -o intel_cpu -cc icc -a file mpi
+Make.py -p intel omp -intel phi -o intel_phi -cc icc -a file mpi
 </pre></div>
 </div>
 <p>Note that this assumes that your MPI and its mpicxx wrapper
 is using the Intel compiler.  If it is not, you should
 leave off the &#8220;-cc icc&#8221; switch.</p>
 <p>Or you can follow these steps:</p>
 <div class="highlight-python"><div class="highlight"><pre>cd lammps/src
 make yes-user-intel
 make yes-user-omp (if desired)
 make machine
 </pre></div>
 </div>
 <p>Note that if the USER-OMP package is also installed, you can use
 styles from both packages, as described below.</p>
 <p>The Makefile.machine needs a &#8220;-fopenmp&#8221; flag for OpenMP support in
 both the CCFLAGS and LINKFLAGS variables.  You also need to add
 -DLAMMPS_MEMALIGN=64 and -restrict to CCFLAGS.</p>
 <p>If you are compiling on the same architecture that will be used for
 the runs, adding the flag <em>-xHost</em> to CCFLAGS will enable
 vectorization with the Intel(R) compiler. Otherwise, you must
 provide the correct compute node architecture to the -x option
 (e.g. -xAVX).</p>
 <p>In order to build with support for an Intel(R) Xeon Phi(TM)
 coprocessor, the flag <em>-offload</em> should be added to the LINKFLAGS line
 and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.</p>
 <p>Example makefiles Makefile.intel_cpu and Makefile.intel_phi are
 included in the src/MAKE/OPTIONS directory with settings that perform
 well with the Intel(R) compiler. The latter file has support for
 offload to coprocessors; the former does not.</p>
 <p><strong>Notes on CPU and core affinity:</strong></p>
 <p>Setting core affinity is often used to pin MPI tasks and OpenMP
 threads to a core or group of cores so that memory access can be
 uniform. Unless disabled at build time, affinity for MPI tasks and
 OpenMP threads on the host will be set by default on the host
 when using offload to a coprocessor. In this case, it is unnecessary
 to use other methods to control affinity (e.g. taskset, numactl,
 I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script
 with the <em>no_affinity</em> option to the <a class="reference internal" href="package.html"><em>package intel</em></a>
 command or by disabling the option at build time (by adding
 -DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile).
 Disabling this option is not recommended, especially when running
 on a machine with hyperthreading disabled.</p>
 <p><strong>Running with the USER-INTEL package from the command line:</strong></p>
 <p>The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.</p>
 <p>If you plan to compute (any portion of) pairwise interactions using
 USER-INTEL pair styles on the CPU, or use USER-OMP styles on the CPU,
 you need to choose how many OpenMP threads per MPI task to use.  Note
 that the product of MPI tasks * OpenMP threads/task should not exceed
 the physical number of cores (on a node), otherwise performance will
 suffer.</p>
 <p>If LAMMPS was built with coprocessor support for the USER-INTEL
 package, you also need to specify the number of coprocessor/node and
 the number of coprocessor threads per MPI task to use.  Note that
 coprocessor threads (which run on the coprocessor) are totally
 independent from OpenMP threads (which run on the CPU).  The default
 values for the settings that affect coprocessor threads are typically
 fine, as discussed below.</p>
 <p>Use the &#8220;-sf intel&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>,
 which will automatically append &#8220;intel&#8221; to styles that support it.  If
 a style does not support it, an &#8220;omp&#8221; suffix is tried next.  OpenMP
 threads per MPI task can be set via the &#8220;-pk intel Nphi omp Nt&#8221; or
 &#8220;-pk omp Nt&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a>, which
 set Nt = # of OpenMP threads per MPI task to use.  The &#8220;-pk omp&#8221; form
 is only allowed if LAMMPS was also built with the USER-OMP package.</p>
 <p>Use the &#8220;-pk intel Nphi&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to set Nphi = # of Xeon Phi(TM)
 coprocessors/node, if LAMMPS was built with coprocessor support.  All
 the available coprocessor threads on each Phi will be divided among
 MPI tasks, unless the <em>tptask</em> option of the &#8220;-pk intel&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> is used to limit the coprocessor
 threads per MPI task.  See the <a class="reference internal" href="package.html"><em>package intel</em></a> command
 for details.</p>
 <div class="highlight-python"><div class="highlight"><pre>CPU-only without USER-OMP (but using Intel vectorization on CPU):
 lmp_machine -sf intel -in in.script                 # 1 MPI task
 mpirun -np 32 lmp_machine -sf intel -in in.script   # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes)
 </pre></div>
 </div>
 <div class="highlight-python"><div class="highlight"><pre>CPU-only with USER-OMP (and Intel vectorization on CPU):
 lmp_machine -sf intel -pk intel 16 0 -in in.script             # 1 MPI task on a 16-core node
 mpirun -np 4 lmp_machine -sf intel -pk omp 4 -in in.script     # 4 MPI tasks each with 4 threads on a single 16-core node
 mpirun -np 32 lmp_machine -sf intel -pk omp 4 -in in.script    # ditto on 8 16-core nodes
 </pre></div>
 </div>
 <div class="highlight-python"><div class="highlight"><pre>CPUs + Xeon Phi(TM) coprocessors with or without USER-OMP:
 lmp_machine -sf intel -pk intel 1 omp 16 -in in.script                       # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, all 240 coprocessor threads
 lmp_machine -sf intel -pk intel 1 omp 16 tptask 32 -in in.script             # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, only 32 coprocessor threads
 mpirun -np 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script           # 4 MPI tasks, 4 OpenMP threads/task, 1 coprocessor, 60 coprocessor threads/task
 mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script   # ditto on 8 16-core nodes
 mpirun -np 8 lmp_machine -sf intel -pk intel 4 omp 2 -in in.script           # 8 MPI tasks, 2 OpenMP threads/task, 4 coprocessors, 120 coprocessor threads/task
 </pre></div>
 </div>
 <p>Note that if the &#8220;-sf intel&#8221; switch is used, it also invokes two
 default commands: <a class="reference internal" href="package.html"><em>package intel 1</em></a>, followed by <a class="reference internal" href="package.html"><em>package omp 0</em></a>.  These both set the number of OpenMP threads per
 MPI task via the OMP_NUM_THREADS environment variable.  The first
 command sets the number of Xeon Phi(TM) coprocessors/node to 1 (and
 the precision mode to &#8220;mixed&#8221;, as one of its option defaults).  The
 latter command is not invoked if LAMMPS was not built with the
 USER-OMP package.  The Nphi = 1 value for the first command is ignored
 if LAMMPS was not built with coprocessor support.</p>
 <p>Using the &#8220;-pk intel&#8221; or &#8220;-pk omp&#8221; switches explicitly allows for
 direct setting of the number of OpenMP threads per MPI task, and
 additional options for either of the USER-INTEL or USER-OMP packages.
 In particular, the &#8220;-pk intel&#8221; switch sets the number of
 coprocessors/node and can limit the number of coprocessor threads per
 MPI task.  The syntax for these two switches is the same as the
 <a class="reference internal" href="package.html"><em>package omp</em></a> and <a class="reference internal" href="package.html"><em>package intel</em></a> commands.
 See the <a class="reference internal" href="package.html"><em>package</em></a> command doc page for details, including
 the default values used for all its options if these switches are not
 specified, and how to set the number of OpenMP threads via the
 OMP_NUM_THREADS environment variable if desired.</p>
 <p><strong>Or run with the USER-INTEL package by editing an input script:</strong></p>
 <p>The discussion above for the mpirun/mpiexec command, MPI tasks/node,
 OpenMP threads per MPI task, and coprocessor threads per MPI task is
 the same.</p>
 <p>Use the <a class="reference internal" href="suffix.html"><em>suffix intel</em></a> command, or you can explicitly add an
 &#8220;intel&#8221; suffix to individual styles in your input script, e.g.</p>
 <div class="highlight-python"><div class="highlight"><pre>pair_style lj/cut/intel 2.5
 </pre></div>
 </div>
 <p>You must also use the <a class="reference internal" href="package.html"><em>package intel</em></a> command, unless the
 &#8220;-sf intel&#8221; or &#8220;-pk intel&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> were used.  It specifies how many
 coprocessors/node to use, as well as other OpenMP threading and
 coprocessor options.  Its doc page explains how to set the number of
 OpenMP threads via an environment variable if desired.</p>
 <p>If LAMMPS was also built with the USER-OMP package, you must also use
 the <a class="reference internal" href="package.html"><em>package omp</em></a> command to enable that package, unless
 the &#8220;-sf intel&#8221; or &#8220;-pk omp&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> were used.  It specifies how many
 OpenMP threads per MPI task to use, as well as other options.  Its doc
 page explains how to set the number of OpenMP threads via an
 environment variable if desired.</p>
 <p><strong>Speed-ups to expect:</strong></p>
 <p>If LAMMPS was not built with coprocessor support when including the
 USER-INTEL package, then acclerated styles will run on the CPU using
 vectorization optimizations and the specified precision.  This may
 give a substantial speed-up for a pair style, particularly if mixed or
 single precision is used.</p>
 <p>If LAMMPS was built with coproccesor support, the pair styles will run
 on one or more Intel(R) Xeon Phi(TM) coprocessors (per node).  The
 performance of a Xeon Phi versus a multi-core CPU is a function of
 your hardware, which pair style is used, the number of
 atoms/coprocessor, and the precision used on the coprocessor (double,
 single, mixed).</p>
 <p>See the <a class="reference external" href="http://lammps.sandia.gov/bench.html">Benchmark page</a> of the
 LAMMPS web site for performance of the USER-INTEL package on different
 hardware.</p>
 <p><strong>Guidelines for best performance on an Intel(R) Xeon Phi(TM)
 coprocessor:</strong></p>
 <ul class="simple">
 <li>The default for the <a class="reference internal" href="package.html"><em>package intel</em></a> command is to have
 all the MPI tasks on a given compute node use a single Xeon Phi(TM)
 coprocessor.  In general, running with a large number of MPI tasks on
 each node will perform best with offload.  Each MPI task will
 automatically get affinity to a subset of the hardware threads
 available on the coprocessor.  For example, if your card has 61 cores,
 with 60 cores available for offload and 4 hardware threads per core
 (240 total threads), running with 24 MPI tasks per node will cause
 each MPI task to use a subset of 10 threads on the coprocessor.  Fine
 tuning of the number of threads to use per MPI task or the number of
 threads to use per core can be accomplished with keyword settings of
 the <a class="reference internal" href="package.html"><em>package intel</em></a> command.</li>
 <li>If desired, only a fraction of the pair style computation can be
 offloaded to the coprocessors.  This is accomplished by using the
 <em>balance</em> keyword in the <a class="reference internal" href="package.html"><em>package intel</em></a> command.  A
 balance of 0 runs all calculations on the CPU.  A balance of 1 runs
 all calculations on the coprocessor.  A balance of 0.5 runs half of
 the calculations on the coprocessor.  Setting the balance to -1 (the
 default) will enable dynamic load balancing that continously adjusts
 the fraction of offloaded work throughout the simulation.  This option
 typically produces results within 5 to 10 percent of the optimal fixed
 balance.</li>
 <li>When using offload with CPU hyperthreading disabled, it may help
 performance to use fewer MPI tasks and OpenMP threads than available
 cores.  This is due to the fact that additional threads are generated
 internally to handle the asynchronous offload tasks.</li>
 <li>If running short benchmark runs with dynamic load balancing, adding a
 short warm-up run (10-20 steps) will allow the load-balancer to find a
 near-optimal setting that will carry over to additional runs.</li>
 <li>If pair computations are being offloaded to an Intel(R) Xeon Phi(TM)
 coprocessor, a diagnostic line is printed to the screen (not to the
 log file), during the setup phase of a run, indicating that offload
 mode is being used and indicating the number of coprocessor threads
 per MPI task.  Additionally, an offload timing summary is printed at
 the end of each run.  When offloading, the frequency for <a class="reference internal" href="atom_modify.html"><em>atom sorting</em></a> is changed to 1 so that the per-atom data is
 effectively sorted at every rebuild of the neighbor lists.</li>
 <li>For simulations with long-range electrostatics or bond, angle,
 dihedral, improper calculations, computation and data transfer to the
 coprocessor will run concurrently with computations and MPI
 communications for these calculations on the host CPU.  The USER-INTEL
 package has two modes for deciding which atoms will be handled by the
 coprocessor.  This choice is controlled with the <em>ghost</em> keyword of
 the <a class="reference internal" href="package.html"><em>package intel</em></a> command.  When set to 0, ghost atoms
 (atoms at the borders between MPI tasks) are not offloaded to the
 card.  This allows for overlap of MPI communication of forces with
 computation on the coprocessor when the <a class="reference internal" href="newton.html"><em>newton</em></a> setting
 is &#8220;on&#8221;.  The default is dependent on the style being used, however,
 better performance may be achieved by setting this option
 explictly.</li>
 </ul>
 <div class="section" id="restrictions">
 <h2>Restrictions<a class="headerlink" href="#restrictions" title="Permalink to this headline">¶</a></h2>
 <p>When offloading to a coprocessor, <a class="reference internal" href="pair_hybrid.html"><em>hybrid</em></a> styles
 that require skip lists for neighbor builds cannot be offloaded.
 Using <a class="reference internal" href="pair_hybrid.html"><em>hybrid/overlay</em></a> is allowed.  Only one intel
 accelerated style may be used with hybrid styles.
 <a class="reference internal" href="special_bonds.html"><em>Special_bonds</em></a> exclusion lists are not currently
 supported with offload, however, the same effect can often be
 accomplished by setting cutoffs for excluded atom types to 0.  None of
 the pair styles in the USER-INTEL package currently support the
 &#8220;inner&#8221;, &#8220;middle&#8221;, &#8220;outer&#8221; options for rRESPA integration via the
 <a class="reference internal" href="run_style.html"><em>run_style respa</em></a> command; only the &#8220;pair&#8221; option is
 supported.</p>
 </div>
 </div>
 
 
            </div>
           </div>
           <footer>
   
 
   <hr/>
 
   <div role="contentinfo">
     <p>
         &copy; Copyright .
     </p>
   </div>
   Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
 
 </footer>
 
         </div>
       </div>
 
     </section>
 
   </div>
   
 
 
   
 
     <script type="text/javascript">
         var DOCUMENTATION_OPTIONS = {
             URL_ROOT:'./',
             VERSION:'15 May 2015 version',
             COLLAPSE_INDEX:false,
             FILE_SUFFIX:'.html',
             HAS_SOURCE:  true
         };
     </script>
       <script type="text/javascript" src="_static/jquery.js"></script>
       <script type="text/javascript" src="_static/underscore.js"></script>
       <script type="text/javascript" src="_static/doctools.js"></script>
       <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script>
 
   
 
   
   
     <script type="text/javascript" src="_static/js/theme.js"></script>
   
 
   
   
   <script type="text/javascript">
       jQuery(function () {
           SphinxRtdTheme.StickyNav.enable();
       });
   </script>
    
 
 </body>
 </html>
\ No newline at end of file
diff --git a/doc/accelerate_intel.txt b/doc/accelerate_intel.txt
index c0cbafa44..879413893 100644
--- a/doc/accelerate_intel.txt
+++ b/doc/accelerate_intel.txt
@@ -1,347 +1,347 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.3 USER-INTEL package :h4
 
 The USER-INTEL package was developed by Mike Brown at Intel
 Corporation.  It provides a capability to accelerate simulations by
 offloading neighbor list and non-bonded force calculations to Intel(R)
 Xeon Phi(TM) coprocessors (not native mode like the KOKKOS package).
 Additionally, it supports running simulations in single, mixed, or
 double precision with vectorization, even if a coprocessor is not
 present, i.e. on an Intel(R) CPU.  The same C++ code is used for both
 cases.  When offloading to a coprocessor, the routine is run twice,
 once with an offload flag.
 
 The USER-INTEL package can be used in tandem with the USER-OMP
 package.  This is useful when offloading pair style computations to
 coprocessors, so that other styles not supported by the USER-INTEL
 package, e.g. bond, angle, dihedral, improper, and long-range
 electrostatics, can run simultaneously in threaded mode on the CPU
 cores.  Since less MPI tasks than CPU cores will typically be invoked
 when running with coprocessors, this enables the extra CPU cores to be
 used for useful computation.
 
 If LAMMPS is built with both the USER-INTEL and USER-OMP packages
 intsalled, this mode of operation is made easier to use, because the
 "-suffix intel" "command-line switch"_Section_start.html#start_7 or
 the "suffix intel"_suffix.html command will both set a second-choice
 suffix to "omp" so that styles from the USER-OMP package will be used
 if available, after first testing if a style from the USER-INTEL
 package is available.
 
 When using the USER-INTEL package, you must choose at build time
 whether you are building for CPU-only acceleration or for using the
 Xeon Phi in offload mode.
 
 Here is a quick overview of how to use the USER-INTEL package
 for CPU-only acceleration:
 
 specify these CCFLAGS in your src/MAKE/Makefile.machine: -openmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost
 specify -openmp with LINKFLAGS in your Makefile.machine
 include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS
 specify how many OpenMP threads per MPI task to use
 use USER-INTEL and (optionally) USER-OMP styles in your input script :ul
 
 Note that many of these settings can only be used with the Intel
 compiler, as discussed below.
 
 Using the USER-INTEL package to offload work to the Intel(R)
 Xeon Phi(TM) coprocessor is the same except for these additional
 steps:
 
 add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
 add the flag -offload to LINKFLAGS in your Makefile.machine :ul
 
 The latter two steps in the first case and the last step in the
 coprocessor case can be done using the "-pk intel" and "-sf intel"
 "command-line switches"_Section_start.html#start_7 respectively.  Or
 the effect of the "-pk" or "-sf" switches can be duplicated by adding
 the "package intel"_package.html or "suffix intel"_suffix.html
 commands respectively to your input script.
 
 [Required hardware/software:]
 
 To use the offload option, you must have one or more Intel(R) Xeon
 Phi(TM) coprocessors and use an Intel(R) C++ compiler.
 
 Optimizations for vectorization have only been tested with the
 Intel(R) compiler.  Use of other compilers may not result in
 vectorization or give poor performance.
 
 Use of an Intel C++ compiler is recommended, but not required (though
 g++ will not recognize some of the settings, so they cannot be used).
 The compiler must support the OpenMP interface.
 
 The recommended version of the Intel(R) compiler is 14.0.1.106. 
 Versions 15.0.1.133 and later are also supported. If using Intel(R) 
 MPI, versions 15.0.2.044 and later are recommended.
 
 [Building LAMMPS with the USER-INTEL package:]
 
 You can choose to build with or without support for offload to a
 Intel(R) Xeon Phi(TM) coprocessor. If you build with support for a
 coprocessor, the same binary can be used on nodes with and without
 coprocessors installed. However, if you do not have coprocessors
 on your system, building without offload support will produce a
 smaller binary.
 
 You can do either in one line, using the src/Make.py script, described
 in "Section 2.4"_Section_start.html#start_4 of the manual.  Type
 "Make.py -h" for help.  If run from the src directory, these commands
 will create src/lmp_intel_cpu and lmp_intel_phi using
 src/MAKE/Makefile.mpi as the starting Makefile.machine:
 
-Make.py -p intel omp -intel cpu -o intel_cpu -cc icc file mpi 
-Make.py -p intel omp -intel phi -o intel_phi -cc icc file mpi :pre
+Make.py -p intel omp -intel cpu -o intel_cpu -cc icc -a file mpi 
+Make.py -p intel omp -intel phi -o intel_phi -cc icc -a file mpi :pre
 
 Note that this assumes that your MPI and its mpicxx wrapper
 is using the Intel compiler.  If it is not, you should
 leave off the "-cc icc" switch.
 
 Or you can follow these steps:
 
 cd lammps/src
 make yes-user-intel
 make yes-user-omp (if desired)
 make machine :pre
 
 Note that if the USER-OMP package is also installed, you can use
 styles from both packages, as described below.
 
 The Makefile.machine needs a "-fopenmp" flag for OpenMP support in
 both the CCFLAGS and LINKFLAGS variables.  You also need to add
 -DLAMMPS_MEMALIGN=64 and -restrict to CCFLAGS.
 
 If you are compiling on the same architecture that will be used for
 the runs, adding the flag {-xHost} to CCFLAGS will enable
 vectorization with the Intel(R) compiler. Otherwise, you must
 provide the correct compute node architecture to the -x option
 (e.g. -xAVX).
 
 In order to build with support for an Intel(R) Xeon Phi(TM)
 coprocessor, the flag {-offload} should be added to the LINKFLAGS line
 and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
 
 Example makefiles Makefile.intel_cpu and Makefile.intel_phi are
 included in the src/MAKE/OPTIONS directory with settings that perform
 well with the Intel(R) compiler. The latter file has support for
 offload to coprocessors; the former does not.
 
 [Notes on CPU and core affinity:]
 
 Setting core affinity is often used to pin MPI tasks and OpenMP
 threads to a core or group of cores so that memory access can be
 uniform. Unless disabled at build time, affinity for MPI tasks and 
 OpenMP threads on the host will be set by default on the host 
 when using offload to a coprocessor. In this case, it is unnecessary 
 to use other methods to control affinity (e.g. taskset, numactl,
 I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script
 with the {no_affinity} option to the "package intel"_package.html 
 command or by disabling the option at build time (by adding
 -DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile).
 Disabling this option is not recommended, especially when running
 on a machine with hyperthreading disabled.
 
 [Running with the USER-INTEL package from the command line:]
 
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 
 If you plan to compute (any portion of) pairwise interactions using
 USER-INTEL pair styles on the CPU, or use USER-OMP styles on the CPU,
 you need to choose how many OpenMP threads per MPI task to use.  Note
 that the product of MPI tasks * OpenMP threads/task should not exceed
 the physical number of cores (on a node), otherwise performance will
 suffer.
 
 If LAMMPS was built with coprocessor support for the USER-INTEL
 package, you also need to specify the number of coprocessor/node and
 the number of coprocessor threads per MPI task to use.  Note that
 coprocessor threads (which run on the coprocessor) are totally
 independent from OpenMP threads (which run on the CPU).  The default
 values for the settings that affect coprocessor threads are typically
 fine, as discussed below.
 
 Use the "-sf intel" "command-line switch"_Section_start.html#start_7,
 which will automatically append "intel" to styles that support it.  If
 a style does not support it, an "omp" suffix is tried next.  OpenMP
 threads per MPI task can be set via the "-pk intel Nphi omp Nt" or
 "-pk omp Nt" "command-line switches"_Section_start.html#start_7, which
 set Nt = # of OpenMP threads per MPI task to use.  The "-pk omp" form
 is only allowed if LAMMPS was also built with the USER-OMP package.
 
 Use the "-pk intel Nphi" "command-line
 switch"_Section_start.html#start_7 to set Nphi = # of Xeon Phi(TM)
 coprocessors/node, if LAMMPS was built with coprocessor support.  All
 the available coprocessor threads on each Phi will be divided among
 MPI tasks, unless the {tptask} option of the "-pk intel" "command-line
 switch"_Section_start.html#start_7 is used to limit the coprocessor
 threads per MPI task.  See the "package intel"_package.html command
 for details.
 
 CPU-only without USER-OMP (but using Intel vectorization on CPU):
 lmp_machine -sf intel -in in.script                 # 1 MPI task
 mpirun -np 32 lmp_machine -sf intel -in in.script   # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes) :pre
 
 CPU-only with USER-OMP (and Intel vectorization on CPU):
 lmp_machine -sf intel -pk intel 16 0 -in in.script             # 1 MPI task on a 16-core node
 mpirun -np 4 lmp_machine -sf intel -pk omp 4 -in in.script     # 4 MPI tasks each with 4 threads on a single 16-core node
 mpirun -np 32 lmp_machine -sf intel -pk omp 4 -in in.script    # ditto on 8 16-core nodes :pre
 
 CPUs + Xeon Phi(TM) coprocessors with or without USER-OMP:
 lmp_machine -sf intel -pk intel 1 omp 16 -in in.script                       # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, all 240 coprocessor threads
 lmp_machine -sf intel -pk intel 1 omp 16 tptask 32 -in in.script             # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, only 32 coprocessor threads
 mpirun -np 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script           # 4 MPI tasks, 4 OpenMP threads/task, 1 coprocessor, 60 coprocessor threads/task
 mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script   # ditto on 8 16-core nodes
 mpirun -np 8 lmp_machine -sf intel -pk intel 4 omp 2 -in in.script           # 8 MPI tasks, 2 OpenMP threads/task, 4 coprocessors, 120 coprocessor threads/task :pre 
 
 Note that if the "-sf intel" switch is used, it also invokes two
 default commands: "package intel 1"_package.html, followed by "package
 omp 0"_package.html.  These both set the number of OpenMP threads per
 MPI task via the OMP_NUM_THREADS environment variable.  The first
 command sets the number of Xeon Phi(TM) coprocessors/node to 1 (and
 the precision mode to "mixed", as one of its option defaults).  The
 latter command is not invoked if LAMMPS was not built with the
 USER-OMP package.  The Nphi = 1 value for the first command is ignored
 if LAMMPS was not built with coprocessor support.
 
 Using the "-pk intel" or "-pk omp" switches explicitly allows for
 direct setting of the number of OpenMP threads per MPI task, and
 additional options for either of the USER-INTEL or USER-OMP packages.
 In particular, the "-pk intel" switch sets the number of
 coprocessors/node and can limit the number of coprocessor threads per
 MPI task.  The syntax for these two switches is the same as the
 "package omp"_package.html and "package intel"_package.html commands.
 See the "package"_package.html command doc page for details, including
 the default values used for all its options if these switches are not
 specified, and how to set the number of OpenMP threads via the
 OMP_NUM_THREADS environment variable if desired.
 
 [Or run with the USER-INTEL package by editing an input script:]
 
 The discussion above for the mpirun/mpiexec command, MPI tasks/node,
 OpenMP threads per MPI task, and coprocessor threads per MPI task is
 the same.
 
 Use the "suffix intel"_suffix.html command, or you can explicitly add an
 "intel" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/intel 2.5 :pre
 
 You must also use the "package intel"_package.html command, unless the
 "-sf intel" or "-pk intel" "command-line
 switches"_Section_start.html#start_7 were used.  It specifies how many
 coprocessors/node to use, as well as other OpenMP threading and
 coprocessor options.  Its doc page explains how to set the number of
 OpenMP threads via an environment variable if desired.
 
 If LAMMPS was also built with the USER-OMP package, you must also use
 the "package omp"_package.html command to enable that package, unless
 the "-sf intel" or "-pk omp" "command-line
 switches"_Section_start.html#start_7 were used.  It specifies how many
 OpenMP threads per MPI task to use, as well as other options.  Its doc
 page explains how to set the number of OpenMP threads via an
 environment variable if desired.
 
 [Speed-ups to expect:]
 
 If LAMMPS was not built with coprocessor support when including the
 USER-INTEL package, then acclerated styles will run on the CPU using
 vectorization optimizations and the specified precision.  This may
 give a substantial speed-up for a pair style, particularly if mixed or
 single precision is used.
 
 If LAMMPS was built with coproccesor support, the pair styles will run
 on one or more Intel(R) Xeon Phi(TM) coprocessors (per node).  The
 performance of a Xeon Phi versus a multi-core CPU is a function of
 your hardware, which pair style is used, the number of
 atoms/coprocessor, and the precision used on the coprocessor (double,
 single, mixed).
 
 See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
 LAMMPS web site for performance of the USER-INTEL package on different
 hardware.
 
 [Guidelines for best performance on an Intel(R) Xeon Phi(TM)
 coprocessor:]
 
 The default for the "package intel"_package.html command is to have
 all the MPI tasks on a given compute node use a single Xeon Phi(TM)
 coprocessor.  In general, running with a large number of MPI tasks on
 each node will perform best with offload.  Each MPI task will
 automatically get affinity to a subset of the hardware threads
 available on the coprocessor.  For example, if your card has 61 cores,
 with 60 cores available for offload and 4 hardware threads per core
 (240 total threads), running with 24 MPI tasks per node will cause
 each MPI task to use a subset of 10 threads on the coprocessor.  Fine
 tuning of the number of threads to use per MPI task or the number of
 threads to use per core can be accomplished with keyword settings of
 the "package intel"_package.html command. :ulb,l
 
 If desired, only a fraction of the pair style computation can be
 offloaded to the coprocessors.  This is accomplished by using the
 {balance} keyword in the "package intel"_package.html command.  A
 balance of 0 runs all calculations on the CPU.  A balance of 1 runs
 all calculations on the coprocessor.  A balance of 0.5 runs half of
 the calculations on the coprocessor.  Setting the balance to -1 (the
 default) will enable dynamic load balancing that continously adjusts
 the fraction of offloaded work throughout the simulation.  This option
 typically produces results within 5 to 10 percent of the optimal fixed
 balance. :l
 
 When using offload with CPU hyperthreading disabled, it may help
 performance to use fewer MPI tasks and OpenMP threads than available
 cores.  This is due to the fact that additional threads are generated
 internally to handle the asynchronous offload tasks. :l
 
 If running short benchmark runs with dynamic load balancing, adding a
 short warm-up run (10-20 steps) will allow the load-balancer to find a
 near-optimal setting that will carry over to additional runs. :l
 
 If pair computations are being offloaded to an Intel(R) Xeon Phi(TM)
 coprocessor, a diagnostic line is printed to the screen (not to the
 log file), during the setup phase of a run, indicating that offload
 mode is being used and indicating the number of coprocessor threads
 per MPI task.  Additionally, an offload timing summary is printed at
 the end of each run.  When offloading, the frequency for "atom
 sorting"_atom_modify.html is changed to 1 so that the per-atom data is
 effectively sorted at every rebuild of the neighbor lists. :l
 
 For simulations with long-range electrostatics or bond, angle,
 dihedral, improper calculations, computation and data transfer to the
 coprocessor will run concurrently with computations and MPI
 communications for these calculations on the host CPU.  The USER-INTEL
 package has two modes for deciding which atoms will be handled by the
 coprocessor.  This choice is controlled with the {ghost} keyword of
 the "package intel"_package.html command.  When set to 0, ghost atoms
 (atoms at the borders between MPI tasks) are not offloaded to the
 card.  This allows for overlap of MPI communication of forces with
 computation on the coprocessor when the "newton"_newton.html setting
 is "on".  The default is dependent on the style being used, however,
 better performance may be achieved by setting this option
 explictly. :l,ule
 
 [Restrictions:]
 
 When offloading to a coprocessor, "hybrid"_pair_hybrid.html styles
 that require skip lists for neighbor builds cannot be offloaded.
 Using "hybrid/overlay"_pair_hybrid.html is allowed.  Only one intel
 accelerated style may be used with hybrid styles.
 "Special_bonds"_special_bonds.html exclusion lists are not currently
 supported with offload, however, the same effect can often be
 accomplished by setting cutoffs for excluded atom types to 0.  None of
 the pair styles in the USER-INTEL package currently support the
 "inner", "middle", "outer" options for rRESPA integration via the
 "run_style respa"_run_style.html command; only the "pair" option is
 supported.
diff --git a/doc/accelerate_omp.html b/doc/accelerate_omp.html
index 0663e1268..2a78e2f1f 100644
--- a/doc/accelerate_omp.html
+++ b/doc/accelerate_omp.html
@@ -1,355 +1,355 @@
 
 
 <!DOCTYPE html>
 <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
 <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
 <head>
   <meta charset="utf-8">
   
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
   
   <title>5.USER-OMP package &mdash; LAMMPS 15 May 2015 version documentation</title>
   
 
   
   
 
   
 
   
   
     
 
   
 
   
   
     <link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
   
 
   
     <link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" />
   
 
   
     <link rel="top" title="LAMMPS 15 May 2015 version documentation" href="index.html"/> 
 
   
   <script src="_static/js/modernizr.min.js"></script>
 
 </head>
 
 <body class="wy-body-for-nav" role="document">
 
   <div class="wy-grid-for-nav">
 
     
     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
       <div class="wy-side-nav-search">
         
 
         
           <a href="Manual.html" class="icon icon-home"> LAMMPS
         
 
         
         </a>
 
         
 <div role="search">
   <form id="rtd-search-form" class="wy-form" action="search.html" method="get">
     <input type="text" name="q" placeholder="Search docs" />
     <input type="hidden" name="check_keywords" value="yes" />
     <input type="hidden" name="area" value="default" />
   </form>
 </div>
 
         
       </div>
 
       <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
         
           
           
               <ul>
 <li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance &amp; scalability</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying &amp; extending LAMMPS</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li>
 </ul>
 
           
         
       </div>
       &nbsp;
     </nav>
 
     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
 
       
       <nav class="wy-nav-top" role="navigation" aria-label="top navigation">
         <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
         <a href="Manual.html">LAMMPS</a>
       </nav>
 
 
       
       <div class="wy-nav-content">
         <div class="rst-content">
           <div role="navigation" aria-label="breadcrumbs navigation">
   <ul class="wy-breadcrumbs">
     <li><a href="Manual.html">Docs</a> &raquo;</li>
       
     <li>5.USER-OMP package</li>
       <li class="wy-breadcrumbs-aside">
         
           
             <a href="http://lammps.sandia.gov">Website</a>
             <a href="Section_commands.html#comm">Commands</a>
         
       </li>
   </ul>
   <hr/>
   
 </div>
           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
            <div itemprop="articleBody">
             
   <p><a class="reference internal" href="Section_accelerate.html"><em>Return to Section accelerate overview</em></a></p>
 <div class="section" id="user-omp-package">
 <h1>5.USER-OMP package<a class="headerlink" href="#user-omp-package" title="Permalink to this headline">¶</a></h1>
 <p>The USER-OMP package was developed by Axel Kohlmeyer at Temple
 University.  It provides multi-threaded versions of most pair styles,
 nearly all bonded styles (bond, angle, dihedral, improper), several
 Kspace styles, and a few fix styles.  The package currently
 uses the OpenMP interface for multi-threading.</p>
 <p>Here is a quick overview of how to use the USER-OMP package:</p>
 <ul class="simple">
 <li>use the -fopenmp flag for compiling and linking in your Makefile.machine</li>
 <li>include the USER-OMP package and build LAMMPS</li>
 <li>use the mpirun command to set the number of MPI tasks/node</li>
 <li>specify how many threads per MPI task to use</li>
 <li>use USER-OMP styles in your input script</li>
 </ul>
 <p>The latter two steps can be done using the &#8220;-pk omp&#8221; and &#8220;-sf omp&#8221;
 <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> respectively.  Or
 the effect of the &#8220;-pk&#8221; or &#8220;-sf&#8221; switches can be duplicated by adding
 the <a class="reference internal" href="package.html"><em>package omp</em></a> or <a class="reference internal" href="suffix.html"><em>suffix omp</em></a> commands
 respectively to your input script.</p>
 <p><strong>Required hardware/software:</strong></p>
 <p>Your compiler must support the OpenMP interface.  You should have one
 or more multi-core CPUs so that multiple threads can be launched by an
 MPI task running on a CPU.</p>
 <p><strong>Building LAMMPS with the USER-OMP package:</strong></p>
 <p>To do this in one line, use the src/Make.py script, described in
 <a class="reference internal" href="Section_start.html#start-4"><span>Section 2.4</span></a> of the manual.  Type &#8220;Make.py
 -h&#8221; for help.  If run from the src directory, this command will create
 src/lmp_omp using src/MAKE/Makefile.mpi as the starting
 Makefile.machine:</p>
-<div class="highlight-python"><div class="highlight"><pre>Make.py -p omp -o omp file mpi
+<div class="highlight-python"><div class="highlight"><pre>Make.py -p omp -o omp -a file mpi
 </pre></div>
 </div>
 <p>Or you can follow these steps:</p>
 <div class="highlight-python"><div class="highlight"><pre>cd lammps/src
 make yes-user-omp
 make machine
 </pre></div>
 </div>
 <p>The CCFLAGS setting in Makefile.machine needs &#8220;-fopenmp&#8221; to add OpenMP
 support.  This works for both the GNU and Intel compilers.  Without
 this flag the USER-OMP styles will still be compiled and work, but
 will not support multi-threading.  For the Intel compilers the CCFLAGS
 setting also needs to include &#8220;-restrict&#8221;.</p>
 <p><strong>Run with the USER-OMP package from the command line:</strong></p>
 <p>The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.</p>
 <p>You need to choose how many threads per MPI task will be used by the
 USER-OMP package.  Note that the product of MPI tasks * threads/task
 should not exceed the physical number of cores (on a node), otherwise
 performance will suffer.</p>
 <p>Use the &#8220;-sf omp&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>,
 which will automatically append &#8220;omp&#8221; to styles that support it.  Use
 the &#8220;-pk omp Nt&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>, to
 set Nt = # of OpenMP threads per MPI task to use.</p>
 <div class="highlight-python"><div class="highlight"><pre>lmp_machine -sf omp -pk omp 16 -in in.script                       # 1 MPI task on a 16-core node
 mpirun -np 4 lmp_machine -sf omp -pk omp 4 -in in.script           # 4 MPI tasks each with 4 threads on a single 16-core node
 mpirun -np 32 -ppn 4 lmp_machine -sf omp -pk omp 4 -in in.script   # ditto on 8 16-core nodes
 </pre></div>
 </div>
 <p>Note that if the &#8220;-sf omp&#8221; switch is used, it also issues a default
 <a class="reference internal" href="package.html"><em>package omp 0</em></a> command, which sets the number of threads
 per MPI task via the OMP_NUM_THREADS environment variable.</p>
 <p>Using the &#8220;-pk&#8221; switch explicitly allows for direct setting of the
 number of threads and additional options.  Its syntax is the same as
 the &#8220;package omp&#8221; command.  See the <a class="reference internal" href="package.html"><em>package</em></a> command doc
 page for details, including the default values used for all its
 options if it is not specified, and how to set the number of threads
 via the OMP_NUM_THREADS environment variable if desired.</p>
 <p><strong>Or run with the USER-OMP package by editing an input script:</strong></p>
 <p>The discussion above for the mpirun/mpiexec command, MPI tasks/node,
 and threads/MPI task is the same.</p>
 <p>Use the <a class="reference internal" href="suffix.html"><em>suffix omp</em></a> command, or you can explicitly add an
 &#8220;omp&#8221; suffix to individual styles in your input script, e.g.</p>
 <div class="highlight-python"><div class="highlight"><pre>pair_style lj/cut/omp 2.5
 </pre></div>
 </div>
 <p>You must also use the <a class="reference internal" href="package.html"><em>package omp</em></a> command to enable the
 USER-OMP package, unless the &#8220;-sf omp&#8221; or &#8220;-pk omp&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> were used.  It specifies how many
 threads per MPI task to use, as well as other options.  Its doc page
 explains how to set the number of threads via an environment variable
 if desired.</p>
 <p><strong>Speed-ups to expect:</strong></p>
 <p>Depending on which styles are accelerated, you should look for a
 reduction in the &#8220;Pair time&#8221;, &#8220;Bond time&#8221;, &#8220;KSpace time&#8221;, and &#8220;Loop
 time&#8221; values printed at the end of a run.</p>
 <p>You may see a small performance advantage (5 to 20%) when running a
 USER-OMP style (in serial or parallel) with a single thread per MPI
 task, versus running standard LAMMPS with its standard
 (un-accelerated) styles (in serial or all-MPI parallelization with 1
 task/core).  This is because many of the USER-OMP styles contain
 similar optimizations to those used in the OPT package, as described
 above.</p>
 <p>With multiple threads/task, the optimal choice of MPI tasks/node and
 OpenMP threads/task can vary a lot and should always be tested via
 benchmark runs for a specific simulation running on a specific
 machine, paying attention to guidelines discussed in the next
 sub-section.</p>
 <p>A description of the multi-threading strategy used in the USER-OMP
 package and some performance examples are <a class="reference external" href="http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&amp;d=1">presented here</a></p>
 <p><strong>Guidelines for best performance:</strong></p>
 <p>For many problems on current generation CPUs, running the USER-OMP
 package with a single thread/task is faster than running with multiple
 threads/task.  This is because the MPI parallelization in LAMMPS is
 often more efficient than multi-threading as implemented in the
 USER-OMP package.  The parallel efficiency (in a threaded sense) also
 varies for different USER-OMP styles.</p>
 <p>Using multiple threads/task can be more effective under the following
 circumstances:</p>
 <ul class="simple">
 <li>Individual compute nodes have a significant number of CPU cores but
 the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
 (Clovertown) and 54xx (Harpertown) quad core processors. Running one
 MPI task per CPU core will result in significant performance
 degradation, so that running with 4 or even only 2 MPI tasks per node
 is faster.  Running in hybrid MPI+OpenMP mode will reduce the
 inter-node communication bandwidth contention in the same way, but
 offers an additional speedup by utilizing the otherwise idle CPU
 cores.</li>
 <li>The interconnect used for MPI communication does not provide
 sufficient bandwidth for a large number of MPI tasks per node.  For
 example, this applies to running over gigabit ethernet or on Cray XT4
 or XT5 series supercomputers.  As in the aforementioned case, this
 effect worsens when using an increasing number of nodes.</li>
 <li>The system has a spatially inhomogeneous particle density which does
 not map well to the <a class="reference internal" href="processors.html"><em>domain decomposition scheme</em></a> or
 <a class="reference internal" href="balance.html"><em>load-balancing</em></a> options that LAMMPS provides.  This is
 because multi-threading achives parallelism over the number of
 particles, not via their distribution in space.</li>
 <li>A machine is being used in &#8220;capability mode&#8221;, i.e. near the point
 where MPI parallelism is maxed out.  For example, this can happen when
 using the <a class="reference internal" href="kspace_style.html"><em>PPPM solver</em></a> for long-range
 electrostatics on large numbers of nodes.  The scaling of the KSpace
 calculation (see the <a class="reference internal" href="kspace_style.html"><em>kspace_style</em></a> command) becomes
 the performance-limiting factor.  Using multi-threading allows less
 MPI tasks to be invoked and can speed-up the long-range solver, while
 increasing overall performance by parallelizing the pairwise and
 bonded calculations via OpenMP.  Likewise additional speedup can be
 sometimes be achived by increasing the length of the Coulombic cutoff
 and thus reducing the work done by the long-range solver.  Using the
 <a class="reference internal" href="run_style.html"><em>run_style verlet/split</em></a> command, which is compatible
 with the USER-OMP package, is an alternative way to reduce the number
 of MPI tasks assigned to the KSpace calculation.</li>
 </ul>
 <p>Additional performance tips are as follows:</p>
 <ul class="simple">
 <li>The best parallel efficiency from <em>omp</em> styles is typically achieved
 when there is at least one MPI task per physical processor,
 i.e. socket or die.</li>
 <li>It is usually most efficient to restrict threading to a single
 socket, i.e. use one or more MPI task per socket.</li>
 <li>Several current MPI implementation by default use a processor affinity
 setting that restricts each MPI task to a single CPU core.  Using
 multi-threading in this mode will force the threads to share that core
 and thus is likely to be counterproductive.  Instead, binding MPI
 tasks to a (multi-core) socket, should solve this issue.</li>
 </ul>
 <div class="section" id="restrictions">
 <h2>Restrictions<a class="headerlink" href="#restrictions" title="Permalink to this headline">¶</a></h2>
 <p>None.</p>
 </div>
 </div>
 
 
            </div>
           </div>
           <footer>
   
 
   <hr/>
 
   <div role="contentinfo">
     <p>
         &copy; Copyright .
     </p>
   </div>
   Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
 
 </footer>
 
         </div>
       </div>
 
     </section>
 
   </div>
   
 
 
   
 
     <script type="text/javascript">
         var DOCUMENTATION_OPTIONS = {
             URL_ROOT:'./',
             VERSION:'15 May 2015 version',
             COLLAPSE_INDEX:false,
             FILE_SUFFIX:'.html',
             HAS_SOURCE:  true
         };
     </script>
       <script type="text/javascript" src="_static/jquery.js"></script>
       <script type="text/javascript" src="_static/underscore.js"></script>
       <script type="text/javascript" src="_static/doctools.js"></script>
       <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script>
 
   
 
   
   
     <script type="text/javascript" src="_static/js/theme.js"></script>
   
 
   
   
   <script type="text/javascript">
       jQuery(function () {
           SphinxRtdTheme.StickyNav.enable();
       });
   </script>
    
 
 </body>
 </html>
\ No newline at end of file
diff --git a/doc/accelerate_omp.txt b/doc/accelerate_omp.txt
index 08b9f3c75..9d461f519 100644
--- a/doc/accelerate_omp.txt
+++ b/doc/accelerate_omp.txt
@@ -1,201 +1,201 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.5 USER-OMP package :h4
 
 The USER-OMP package was developed by Axel Kohlmeyer at Temple
 University.  It provides multi-threaded versions of most pair styles,
 nearly all bonded styles (bond, angle, dihedral, improper), several
 Kspace styles, and a few fix styles.  The package currently
 uses the OpenMP interface for multi-threading.
 
 Here is a quick overview of how to use the USER-OMP package:
 
 use the -fopenmp flag for compiling and linking in your Makefile.machine
 include the USER-OMP package and build LAMMPS
 use the mpirun command to set the number of MPI tasks/node
 specify how many threads per MPI task to use
 use USER-OMP styles in your input script :ul
 
 The latter two steps can be done using the "-pk omp" and "-sf omp"
 "command-line switches"_Section_start.html#start_7 respectively.  Or
 the effect of the "-pk" or "-sf" switches can be duplicated by adding
 the "package omp"_package.html or "suffix omp"_suffix.html commands
 respectively to your input script.
 
 [Required hardware/software:]
 
 Your compiler must support the OpenMP interface.  You should have one
 or more multi-core CPUs so that multiple threads can be launched by an
 MPI task running on a CPU.
 
 [Building LAMMPS with the USER-OMP package:]
 
 To do this in one line, use the src/Make.py script, described in
 "Section 2.4"_Section_start.html#start_4 of the manual.  Type "Make.py
 -h" for help.  If run from the src directory, this command will create
 src/lmp_omp using src/MAKE/Makefile.mpi as the starting
 Makefile.machine:
 
-Make.py -p omp -o omp file mpi :pre
+Make.py -p omp -o omp -a file mpi :pre
 
 Or you can follow these steps:
 
 cd lammps/src
 make yes-user-omp
 make machine :pre
 
 The CCFLAGS setting in Makefile.machine needs "-fopenmp" to add OpenMP
 support.  This works for both the GNU and Intel compilers.  Without
 this flag the USER-OMP styles will still be compiled and work, but
 will not support multi-threading.  For the Intel compilers the CCFLAGS
 setting also needs to include "-restrict".
 
 [Run with the USER-OMP package from the command line:]
 
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 
 You need to choose how many threads per MPI task will be used by the
 USER-OMP package.  Note that the product of MPI tasks * threads/task
 should not exceed the physical number of cores (on a node), otherwise
 performance will suffer.
 
 Use the "-sf omp" "command-line switch"_Section_start.html#start_7,
 which will automatically append "omp" to styles that support it.  Use
 the "-pk omp Nt" "command-line switch"_Section_start.html#start_7, to
 set Nt = # of OpenMP threads per MPI task to use.
 
 lmp_machine -sf omp -pk omp 16 -in in.script                       # 1 MPI task on a 16-core node
 mpirun -np 4 lmp_machine -sf omp -pk omp 4 -in in.script           # 4 MPI tasks each with 4 threads on a single 16-core node
 mpirun -np 32 -ppn 4 lmp_machine -sf omp -pk omp 4 -in in.script   # ditto on 8 16-core nodes :pre
 
 Note that if the "-sf omp" switch is used, it also issues a default
 "package omp 0"_package.html command, which sets the number of threads
 per MPI task via the OMP_NUM_THREADS environment variable.
 
 Using the "-pk" switch explicitly allows for direct setting of the
 number of threads and additional options.  Its syntax is the same as
 the "package omp" command.  See the "package"_package.html command doc
 page for details, including the default values used for all its
 options if it is not specified, and how to set the number of threads
 via the OMP_NUM_THREADS environment variable if desired.
 
 [Or run with the USER-OMP package by editing an input script:]
 
 The discussion above for the mpirun/mpiexec command, MPI tasks/node,
 and threads/MPI task is the same.
 
 Use the "suffix omp"_suffix.html command, or you can explicitly add an
 "omp" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/omp 2.5 :pre
 
 You must also use the "package omp"_package.html command to enable the
 USER-OMP package, unless the "-sf omp" or "-pk omp" "command-line
 switches"_Section_start.html#start_7 were used.  It specifies how many
 threads per MPI task to use, as well as other options.  Its doc page
 explains how to set the number of threads via an environment variable
 if desired.
 
 [Speed-ups to expect:]
 
 Depending on which styles are accelerated, you should look for a
 reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
 time" values printed at the end of a run.  
 
 You may see a small performance advantage (5 to 20%) when running a
 USER-OMP style (in serial or parallel) with a single thread per MPI
 task, versus running standard LAMMPS with its standard
 (un-accelerated) styles (in serial or all-MPI parallelization with 1
 task/core).  This is because many of the USER-OMP styles contain
 similar optimizations to those used in the OPT package, as described
 above.
 
 With multiple threads/task, the optimal choice of MPI tasks/node and
 OpenMP threads/task can vary a lot and should always be tested via
 benchmark runs for a specific simulation running on a specific
 machine, paying attention to guidelines discussed in the next
 sub-section.
 
 A description of the multi-threading strategy used in the USER-OMP
 package and some performance examples are "presented
 here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
 
 [Guidelines for best performance:]
 
 For many problems on current generation CPUs, running the USER-OMP
 package with a single thread/task is faster than running with multiple
 threads/task.  This is because the MPI parallelization in LAMMPS is
 often more efficient than multi-threading as implemented in the
 USER-OMP package.  The parallel efficiency (in a threaded sense) also
 varies for different USER-OMP styles.
 
 Using multiple threads/task can be more effective under the following
 circumstances:
 
 Individual compute nodes have a significant number of CPU cores but
 the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
 (Clovertown) and 54xx (Harpertown) quad core processors. Running one
 MPI task per CPU core will result in significant performance
 degradation, so that running with 4 or even only 2 MPI tasks per node
 is faster.  Running in hybrid MPI+OpenMP mode will reduce the
 inter-node communication bandwidth contention in the same way, but
 offers an additional speedup by utilizing the otherwise idle CPU
 cores. :ulb,l
 
 The interconnect used for MPI communication does not provide
 sufficient bandwidth for a large number of MPI tasks per node.  For
 example, this applies to running over gigabit ethernet or on Cray XT4
 or XT5 series supercomputers.  As in the aforementioned case, this
 effect worsens when using an increasing number of nodes. :l
 
 The system has a spatially inhomogeneous particle density which does
 not map well to the "domain decomposition scheme"_processors.html or
 "load-balancing"_balance.html options that LAMMPS provides.  This is
 because multi-threading achives parallelism over the number of
 particles, not via their distribution in space. :l
 
 A machine is being used in "capability mode", i.e. near the point
 where MPI parallelism is maxed out.  For example, this can happen when
 using the "PPPM solver"_kspace_style.html for long-range
 electrostatics on large numbers of nodes.  The scaling of the KSpace
 calculation (see the "kspace_style"_kspace_style.html command) becomes
 the performance-limiting factor.  Using multi-threading allows less
 MPI tasks to be invoked and can speed-up the long-range solver, while
 increasing overall performance by parallelizing the pairwise and
 bonded calculations via OpenMP.  Likewise additional speedup can be
 sometimes be achived by increasing the length of the Coulombic cutoff
 and thus reducing the work done by the long-range solver.  Using the
 "run_style verlet/split"_run_style.html command, which is compatible
 with the USER-OMP package, is an alternative way to reduce the number
 of MPI tasks assigned to the KSpace calculation. :l,ule
 
 Additional performance tips are as follows:
 
 The best parallel efficiency from {omp} styles is typically achieved
 when there is at least one MPI task per physical processor,
 i.e. socket or die. :ulb,l
 
 It is usually most efficient to restrict threading to a single
 socket, i.e. use one or more MPI task per socket. :l
 
 Several current MPI implementation by default use a processor affinity
 setting that restricts each MPI task to a single CPU core.  Using
 multi-threading in this mode will force the threads to share that core
 and thus is likely to be counterproductive.  Instead, binding MPI
 tasks to a (multi-core) socket, should solve this issue. :l,ule
 
 [Restrictions:]
 
 None.
diff --git a/doc/accelerate_opt.html b/doc/accelerate_opt.html
index 60f1afb2a..fc5c77952 100644
--- a/doc/accelerate_opt.html
+++ b/doc/accelerate_opt.html
@@ -1,250 +1,250 @@
 
 
 <!DOCTYPE html>
 <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
 <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
 <head>
   <meta charset="utf-8">
   
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
   
   <title>5.OPT package &mdash; LAMMPS 15 May 2015 version documentation</title>
   
 
   
   
 
   
 
   
   
     
 
   
 
   
   
     <link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
   
 
   
     <link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" />
   
 
   
     <link rel="top" title="LAMMPS 15 May 2015 version documentation" href="index.html"/> 
 
   
   <script src="_static/js/modernizr.min.js"></script>
 
 </head>
 
 <body class="wy-body-for-nav" role="document">
 
   <div class="wy-grid-for-nav">
 
     
     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
       <div class="wy-side-nav-search">
         
 
         
           <a href="Manual.html" class="icon icon-home"> LAMMPS
         
 
         
         </a>
 
         
 <div role="search">
   <form id="rtd-search-form" class="wy-form" action="search.html" method="get">
     <input type="text" name="q" placeholder="Search docs" />
     <input type="hidden" name="check_keywords" value="yes" />
     <input type="hidden" name="area" value="default" />
   </form>
 </div>
 
         
       </div>
 
       <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
         
           
           
               <ul>
 <li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance &amp; scalability</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying &amp; extending LAMMPS</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li>
 <li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li>
 </ul>
 
           
         
       </div>
       &nbsp;
     </nav>
 
     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
 
       
       <nav class="wy-nav-top" role="navigation" aria-label="top navigation">
         <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
         <a href="Manual.html">LAMMPS</a>
       </nav>
 
 
       
       <div class="wy-nav-content">
         <div class="rst-content">
           <div role="navigation" aria-label="breadcrumbs navigation">
   <ul class="wy-breadcrumbs">
     <li><a href="Manual.html">Docs</a> &raquo;</li>
       
     <li>5.OPT package</li>
       <li class="wy-breadcrumbs-aside">
         
           
             <a href="http://lammps.sandia.gov">Website</a>
             <a href="Section_commands.html#comm">Commands</a>
         
       </li>
   </ul>
   <hr/>
   
 </div>
           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
            <div itemprop="articleBody">
             
   <p><a class="reference internal" href="Section_accelerate.html"><em>Return to Section accelerate overview</em></a></p>
 <div class="section" id="opt-package">
 <h1>5.OPT package<a class="headerlink" href="#opt-package" title="Permalink to this headline">¶</a></h1>
 <p>The OPT package was developed by James Fischer (High Performance
 Technologies), David Richie, and Vincent Natoli (Stone Ridge
 Technologies).  It contains a handful of pair styles whose compute()
 methods were rewritten in C++ templated form to reduce the overhead
 due to if tests and other conditional code.</p>
 <p>Here is a quick overview of how to use the OPT package:</p>
 <ul class="simple">
 <li>include the OPT package and build LAMMPS</li>
 <li>use OPT pair styles in your input script</li>
 </ul>
 <p>The last step can be done using the &#8220;-sf opt&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>.  Or the effect of the &#8220;-sf&#8221; switch
 can be duplicated by adding a <a class="reference internal" href="suffix.html"><em>suffix opt</em></a> command to your
 input script.</p>
 <p><strong>Required hardware/software:</strong></p>
 <p>None.</p>
 <p><strong>Building LAMMPS with the OPT package:</strong></p>
 <p>Include the package and build LAMMPS:</p>
 <p>To do this in one line, use the src/Make.py script, described in
 <a class="reference internal" href="Section_start.html#start-4"><span>Section 2.4</span></a> of the manual.  Type &#8220;Make.py
 -h&#8221; for help.  If run from the src directory, this command will create
 src/lmp_opt using src/MAKE/Makefile.mpi as the starting
 Makefile.machine:</p>
-<div class="highlight-python"><div class="highlight"><pre>Make.py -p opt -o opt file mpi
+<div class="highlight-python"><div class="highlight"><pre>Make.py -p opt -o opt -a file mpi
 </pre></div>
 </div>
 <p>Or you can follow these steps:</p>
 <div class="highlight-python"><div class="highlight"><pre>cd lammps/src
 make yes-opt
 make machine
 </pre></div>
 </div>
 <p>If you are using Intel compilers, then the CCFLAGS setting in
 Makefile.machine needs to include &#8220;-restrict&#8221;.</p>
 <p><strong>Run with the OPT package from the command line:</strong></p>
 <p>Use the &#8220;-sf opt&#8221; <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>,
 which will automatically append &#8220;opt&#8221; to styles that support it.</p>
 <div class="highlight-python"><div class="highlight"><pre>lmp_machine -sf opt -in in.script
 mpirun -np 4 lmp_machine -sf opt -in in.script
 </pre></div>
 </div>
 <p><strong>Or run with the OPT package by editing an input script:</strong></p>
 <p>Use the <a class="reference internal" href="suffix.html"><em>suffix opt</em></a> command, or you can explicitly add an
 &#8220;opt&#8221; suffix to individual styles in your input script, e.g.</p>
 <div class="highlight-python"><div class="highlight"><pre>pair_style lj/cut/opt 2.5
 </pre></div>
 </div>
 <p><strong>Speed-ups to expect:</strong></p>
 <p>You should see a reduction in the &#8220;Pair time&#8221; value printed at the end
 of a run.  On most machines for reasonable problem sizes, it will be a
 5 to 20% savings.</p>
 <p><strong>Guidelines for best performance:</strong></p>
 <p>None.  Just try out an OPT pair style to see how it performs.</p>
 <div class="section" id="restrictions">
 <h2>Restrictions<a class="headerlink" href="#restrictions" title="Permalink to this headline">¶</a></h2>
 <p>None.</p>
 </div>
 </div>
 
 
            </div>
           </div>
           <footer>
   
 
   <hr/>
 
   <div role="contentinfo">
     <p>
         &copy; Copyright .
     </p>
   </div>
   Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
 
 </footer>
 
         </div>
       </div>
 
     </section>
 
   </div>
   
 
 
   
 
     <script type="text/javascript">
         var DOCUMENTATION_OPTIONS = {
             URL_ROOT:'./',
             VERSION:'15 May 2015 version',
             COLLAPSE_INDEX:false,
             FILE_SUFFIX:'.html',
             HAS_SOURCE:  true
         };
     </script>
       <script type="text/javascript" src="_static/jquery.js"></script>
       <script type="text/javascript" src="_static/underscore.js"></script>
       <script type="text/javascript" src="_static/doctools.js"></script>
       <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script>
       <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script>
 
   
 
   
   
     <script type="text/javascript" src="_static/js/theme.js"></script>
   
 
   
   
   <script type="text/javascript">
       jQuery(function () {
           SphinxRtdTheme.StickyNav.enable();
       });
   </script>
    
 
 </body>
 </html>
\ No newline at end of file
diff --git a/doc/accelerate_opt.txt b/doc/accelerate_opt.txt
index 726f32686..23e853aec 100644
--- a/doc/accelerate_opt.txt
+++ b/doc/accelerate_opt.txt
@@ -1,82 +1,82 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.6 OPT package :h4
 
 The OPT package was developed by James Fischer (High Performance
 Technologies), David Richie, and Vincent Natoli (Stone Ridge
 Technologies).  It contains a handful of pair styles whose compute()
 methods were rewritten in C++ templated form to reduce the overhead
 due to if tests and other conditional code.
 
 Here is a quick overview of how to use the OPT package:
 
 include the OPT package and build LAMMPS
 use OPT pair styles in your input script :ul
 
 The last step can be done using the "-sf opt" "command-line
 switch"_Section_start.html#start_7.  Or the effect of the "-sf" switch
 can be duplicated by adding a "suffix opt"_suffix.html command to your
 input script.
 
 [Required hardware/software:]
 
 None.
 
 [Building LAMMPS with the OPT package:]
 
 Include the package and build LAMMPS:
 
 To do this in one line, use the src/Make.py script, described in
 "Section 2.4"_Section_start.html#start_4 of the manual.  Type "Make.py
 -h" for help.  If run from the src directory, this command will create
 src/lmp_opt using src/MAKE/Makefile.mpi as the starting
 Makefile.machine:
 
-Make.py -p opt -o opt file mpi :pre
+Make.py -p opt -o opt -a file mpi :pre
 
 Or you can follow these steps:
 
 cd lammps/src
 make yes-opt
 make machine :pre
 
 If you are using Intel compilers, then the CCFLAGS setting in
 Makefile.machine needs to include "-restrict".
 
 [Run with the OPT package from the command line:]
 
 Use the "-sf opt" "command-line switch"_Section_start.html#start_7,
 which will automatically append "opt" to styles that support it.
 
 lmp_machine -sf opt -in in.script
 mpirun -np 4 lmp_machine -sf opt -in in.script :pre
 
 [Or run with the OPT package by editing an input script:]
 
 Use the "suffix opt"_suffix.html command, or you can explicitly add an
 "opt" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/opt 2.5 :pre
 
 [Speed-ups to expect:]
 
 You should see a reduction in the "Pair time" value printed at the end
 of a run.  On most machines for reasonable problem sizes, it will be a
 5 to 20% savings.
 
 [Guidelines for best performance:]
 
 None.  Just try out an OPT pair style to see how it performs.
 
 [Restrictions:]
 
 None.