diff --git a/doc/accelerate_cuda.html b/doc/accelerate_cuda.html
index 4b9cceaaf..4d005e3f9 100644
--- a/doc/accelerate_cuda.html
+++ b/doc/accelerate_cuda.html
@@ -1,212 +1,218 @@
 <HTML>
 <CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
 <A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> 
 </CENTER>
 
 
 
 
 
 
 <HR>
 
 <P><A HREF = "Section_accelerate.html">Return to Section accelerate overview</A>
 </P>
 <H4>5.3.1 USER-CUDA package 
 </H4>
 <P>The USER-CUDA package was developed by Christian Trott (Sandia) while
 at U Technology Ilmenau in Germany.  It provides NVIDIA GPU versions
 of many pair styles, many fixes, a few computes, and for long-range
 Coulombics via the PPPM command.  It has the following general
 features:
 </P>
 <UL><LI>The package is designed to allow an entire LAMMPS calculation, for
 many timesteps, to run entirely on the GPU (except for inter-processor
 MPI communication), so that atom-based data (e.g. coordinates, forces)
 do not have to move back-and-forth between the CPU and GPU. 
 
 <LI>The speed-up advantage of this approach is typically better when the
 number of atoms per GPU is large 
 
 <LI>Data will stay on the GPU until a timestep where a non-USER-CUDA fix
 or compute is invoked.  Whenever a non-GPU operation occurs (fix,
 compute, output), data automatically moves back to the CPU as needed.
 This may incur a performance penalty, but should otherwise work
 transparently. 
 
 <LI>Neighbor lists are constructed on the GPU. 
 
 <LI>The package only supports use of a single MPI task, running on a
 single CPU (core), assigned to each GPU. 
 </UL>
 <P>Here is a quick overview of how to use the USER-CUDA package:
 </P>
 <UL><LI>build the library in lib/cuda for your GPU hardware with desired precision
 <LI>include the USER-CUDA package and build LAMMPS
 <LI>use the mpirun command to specify 1 MPI task per GPU (on each node)
 <LI>enable the USER-CUDA package via the "-c on" command-line switch
 <LI>specify the # of GPUs per node
 <LI>use USER-CUDA styles in your input script 
 </UL>
 <P>The latter two steps can be done using the "-pk cuda" and "-sf cuda"
 <A HREF = "Section_start.html#start_7">command-line switches</A> respectively.  Or
 the effect of the "-pk" or "-sf" switches can be duplicated by adding
 the <A HREF = "package.html">package cuda</A> or <A HREF = "suffix.html">suffix cuda</A> commands
 respectively to your input script.
 </P>
 <P><B>Required hardware/software:</B>
 </P>
 <P>To use this package, you need to have one or more NVIDIA GPUs and
 install the NVIDIA Cuda software on your system:
 </P>
 <P>Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
 help you to find out the Compute Capability of your card:
 </P>
 <P>http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
 </P>
 <P>Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the
 corresponding GPU drivers.  The Nvidia Cuda SDK is not required, but
 we recommend it also be installed.  You can then make sure its sample
 projects can be compiled without problems.
 </P>
 <P><B>Building LAMMPS with the USER-CUDA package:</B>
 </P>
 <P>This requires two steps (a,b): build the USER-CUDA library, then build
 LAMMPS with the USER-CUDA package.
 </P>
 <P>(a) Build the USER-CUDA library
 </P>
 <P>The USER-CUDA library is in lammps/lib/cuda.  If your <I>CUDA</I> toolkit
 is not installed in the default system directoy <I>/usr/local/cuda</I> edit
 the file <I>lib/cuda/Makefile.common</I> accordingly.
 </P>
 <P>To set options for the library build, type "make OPTIONS", where
 <I>OPTIONS</I> are one or more of the following. The settings will be
 written to the <I>lib/cuda/Makefile.defaults</I> and used when
 the library is built.
 </P>
 <PRE><I>precision=N</I> to set the precision level
   N = 1 for single precision (default)
   N = 2 for double precision
   N = 3 for positions in double precision
   N = 4 for positions and velocities in double precision
 <I>arch=M</I> to set GPU compute capability
   M = 35 for Kepler GPUs
   M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
   M = 21 for CC2.1 (GF104/114,  e.g. GTX560, GTX460, GTX450)
   M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
 <I>prec_timer=0/1</I> to use hi-precision timers
   0 = do not use them (default)
   1 = use them
   this is usually only useful for Mac machines 
 <I>dbg=0/1</I> to activate debug mode
   0 = no debug mode (default)
   1 = yes debug mode
   this is only useful for developers
 <I>cufft=1</I> for use of the CUDA FFT library
   0 = no CUFFT support (default)
   in the future other CUDA-enabled FFT libraries might be supported 
 </PRE>
 <P>To build the library, simply type:
 </P>
 <PRE>make 
 </PRE>
 <P>If successful, it will produce the files libcuda.a and Makefile.lammps.
 </P>
 <P>Note that if you change any of the options (like precision), you need
 to re-build the entire library.  Do a "make clean" first, followed by
 "make".
 </P>
 <P>(b) Build LAMMPS with the USER-CUDA package
 </P>
 <PRE>cd lammps/src
 make yes-user-cuda
 make machine 
 </PRE>
 <P>No additional compile/link flags are needed in your Makefile.machine
 in src/MAKE.
 </P>
 <P>Note that if you change the USER-CUDA library precision (discussed
 above) and rebuild the USER-CUDA library, then you also need to
 re-install the USER-CUDA package and re-build LAMMPS, so that all
 affected files are re-compiled and linked to the new USER-CUDA
 library.
 </P>
 <P><B>Run with the USER-CUDA package from the command line:</B>
 </P>
 <P>The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command does this via its -np
 and -ppn switches.
 </P>
 <P>When using the USER-CUDA package, you must use exactly one MPI task
 per physical GPU.
 </P>
 <P>You must use the "-c on" <A HREF = "Section_start.html#start_7">command-line
 switch</A> to enable the USER-CUDA package.
 The "-c on" switch also issues a default <A HREF = "package.html">package cuda 1</A>
 command which sets various USER-CUDA options to default values, as
 discussed on the <A HREF = "package.html">package</A> command doc page.
 </P>
 <P>Use the "-sf cuda" <A HREF = "Section_start.html#start_7">command-line switch</A>,
 which will automatically append "cuda" to styles that support it.  Use
 the "-pk cuda Ng" <A HREF = "Section_start.html#start_7">command-line switch</A> to
 set Ng = # of GPUs per node to a different value than the default set
 by the "-c on" switch (1 GPU) or change other <A HREF = "package.html">package
 cuda</A> options.
 </P>
 <PRE>lmp_machine -c on -sf cuda -pk cuda 1 -in in.script                       # 1 MPI task uses 1 GPU
 mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script          # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node
 mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script  # ditto on 12 16-core nodes 
 </PRE>
 <P>The syntax for the "-pk" switch is the same as same as the "package
 cuda" command.  See the <A HREF = "package.html">package</A> command doc page for
 details, including the default values used for all its options if it
 is not specified.
 </P>
+<P>Note that the default for the <A HREF = "package.html">package cuda</A> command is
+to set the Newton flag to "off" for both pairwise and bonded
+interactions.  This typically gives fastest performance.  If the
+<A HREF = "newton.html">newton</A> command is used in the input script, it can
+override these defaults.
+</P>
 <P><B>Or run with the USER-CUDA package by editing an input script:</B>
 </P>
 <P>The discussion above for the mpirun/mpiexec command and the requirement
 of one MPI task per GPU is the same.
 </P>
 <P>You must still use the "-c on" <A HREF = "Section_start.html#start_7">command-line
 switch</A> to enable the USER-CUDA package.
 </P>
 <P>Use the <A HREF = "suffix.html">suffix cuda</A> command, or you can explicitly add a
 "cuda" suffix to individual styles in your input script, e.g.
 </P>
 <PRE>pair_style lj/cut/cuda 2.5 
 </PRE>
 <P>You only need to use the <A HREF = "package.html">package cuda</A> command if you
 wish to change any of its option defaults, including the number of
 GPUs/node (default = 1), as set by the "-c on" <A HREF = "Section_start.html#start_7">command-line
 switch</A>.
 </P>
 <P><B>Speed-ups to expect:</B>
 </P>
 <P>The performance of a GPU versus a multi-core CPU is a function of your
 hardware, which pair style is used, the number of atoms/GPU, and the
 precision used on the GPU (double, single, mixed).
 </P>
 <P>See the <A HREF = "http://lammps.sandia.gov/bench.html">Benchmark page</A> of the
 LAMMPS web site for performance of the USER-CUDA package on different
 hardware.
 </P>
 <P><B>Guidelines for best performance:</B>
 </P>
 <UL><LI>The USER-CUDA package offers more speed-up relative to CPU performance
 when the number of atoms per GPU is large, e.g. on the order of tens
 or hundreds of 1000s. 
 
 <LI>As noted above, this package will continue to run a simulation
 entirely on the GPU(s) (except for inter-processor MPI communication),
 for multiple timesteps, until a CPU calculation is required, either by
 a fix or compute that is non-GPU-ized, or until output is performed
 (thermo or dump snapshot or restart file).  The less often this
 occurs, the faster your simulation will run. 
 </UL>
 <P><B>Restrictions:</B>
 </P>
 <P>None.
 </P>
 </HTML>
diff --git a/doc/accelerate_cuda.txt b/doc/accelerate_cuda.txt
index 6b88abd90..d88094ecb 100644
--- a/doc/accelerate_cuda.txt
+++ b/doc/accelerate_cuda.txt
@@ -1,207 +1,213 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.1 USER-CUDA package :h4
 
 The USER-CUDA package was developed by Christian Trott (Sandia) while
 at U Technology Ilmenau in Germany.  It provides NVIDIA GPU versions
 of many pair styles, many fixes, a few computes, and for long-range
 Coulombics via the PPPM command.  It has the following general
 features:
 
 The package is designed to allow an entire LAMMPS calculation, for
 many timesteps, to run entirely on the GPU (except for inter-processor
 MPI communication), so that atom-based data (e.g. coordinates, forces)
 do not have to move back-and-forth between the CPU and GPU. :ulb,l
 
 The speed-up advantage of this approach is typically better when the
 number of atoms per GPU is large :l
 
 Data will stay on the GPU until a timestep where a non-USER-CUDA fix
 or compute is invoked.  Whenever a non-GPU operation occurs (fix,
 compute, output), data automatically moves back to the CPU as needed.
 This may incur a performance penalty, but should otherwise work
 transparently. :l
 
 Neighbor lists are constructed on the GPU. :l
 
 The package only supports use of a single MPI task, running on a
 single CPU (core), assigned to each GPU. :l,ule
 
 Here is a quick overview of how to use the USER-CUDA package:
 
 build the library in lib/cuda for your GPU hardware with desired precision
 include the USER-CUDA package and build LAMMPS
 use the mpirun command to specify 1 MPI task per GPU (on each node)
 enable the USER-CUDA package via the "-c on" command-line switch
 specify the # of GPUs per node
 use USER-CUDA styles in your input script :ul
 
 The latter two steps can be done using the "-pk cuda" and "-sf cuda"
 "command-line switches"_Section_start.html#start_7 respectively.  Or
 the effect of the "-pk" or "-sf" switches can be duplicated by adding
 the "package cuda"_package.html or "suffix cuda"_suffix.html commands
 respectively to your input script.
 
 [Required hardware/software:]
 
 To use this package, you need to have one or more NVIDIA GPUs and
 install the NVIDIA Cuda software on your system:
 
 Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
 help you to find out the Compute Capability of your card:
 
 http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
 
 Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the
 corresponding GPU drivers.  The Nvidia Cuda SDK is not required, but
 we recommend it also be installed.  You can then make sure its sample
 projects can be compiled without problems.
 
 [Building LAMMPS with the USER-CUDA package:]
 
 This requires two steps (a,b): build the USER-CUDA library, then build
 LAMMPS with the USER-CUDA package.
 
 (a) Build the USER-CUDA library
 
 The USER-CUDA library is in lammps/lib/cuda.  If your {CUDA} toolkit
 is not installed in the default system directoy {/usr/local/cuda} edit
 the file {lib/cuda/Makefile.common} accordingly.
 
 To set options for the library build, type "make OPTIONS", where
 {OPTIONS} are one or more of the following. The settings will be
 written to the {lib/cuda/Makefile.defaults} and used when
 the library is built.
 
 {precision=N} to set the precision level
   N = 1 for single precision (default)
   N = 2 for double precision
   N = 3 for positions in double precision
   N = 4 for positions and velocities in double precision
 {arch=M} to set GPU compute capability
   M = 35 for Kepler GPUs
   M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
   M = 21 for CC2.1 (GF104/114,  e.g. GTX560, GTX460, GTX450)
   M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
 {prec_timer=0/1} to use hi-precision timers
   0 = do not use them (default)
   1 = use them
   this is usually only useful for Mac machines 
 {dbg=0/1} to activate debug mode
   0 = no debug mode (default)
   1 = yes debug mode
   this is only useful for developers
 {cufft=1} for use of the CUDA FFT library
   0 = no CUFFT support (default)
   in the future other CUDA-enabled FFT libraries might be supported :pre
 
 To build the library, simply type:
 
 make :pre
 
 If successful, it will produce the files libcuda.a and Makefile.lammps.
 
 Note that if you change any of the options (like precision), you need
 to re-build the entire library.  Do a "make clean" first, followed by
 "make".
 
 (b) Build LAMMPS with the USER-CUDA package
 
 cd lammps/src
 make yes-user-cuda
 make machine :pre
 
 No additional compile/link flags are needed in your Makefile.machine
 in src/MAKE.
 
 Note that if you change the USER-CUDA library precision (discussed
 above) and rebuild the USER-CUDA library, then you also need to
 re-install the USER-CUDA package and re-build LAMMPS, so that all
 affected files are re-compiled and linked to the new USER-CUDA
 library.
 
 [Run with the USER-CUDA package from the command line:]
 
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command does this via its -np
 and -ppn switches.
 
 When using the USER-CUDA package, you must use exactly one MPI task
 per physical GPU.
 
 You must use the "-c on" "command-line
 switch"_Section_start.html#start_7 to enable the USER-CUDA package.
 The "-c on" switch also issues a default "package cuda 1"_package.html
 command which sets various USER-CUDA options to default values, as
 discussed on the "package"_package.html command doc page.
 
 Use the "-sf cuda" "command-line switch"_Section_start.html#start_7,
 which will automatically append "cuda" to styles that support it.  Use
 the "-pk cuda Ng" "command-line switch"_Section_start.html#start_7 to
 set Ng = # of GPUs per node to a different value than the default set
 by the "-c on" switch (1 GPU) or change other "package
 cuda"_package.html options.
 
 lmp_machine -c on -sf cuda -pk cuda 1 -in in.script                       # 1 MPI task uses 1 GPU
 mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script          # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node
 mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script  # ditto on 12 16-core nodes :pre
 
 The syntax for the "-pk" switch is the same as same as the "package
 cuda" command.  See the "package"_package.html command doc page for
 details, including the default values used for all its options if it
 is not specified.
 
+Note that the default for the "package cuda"_package.html command is
+to set the Newton flag to "off" for both pairwise and bonded
+interactions.  This typically gives fastest performance.  If the
+"newton"_newton.html command is used in the input script, it can
+override these defaults.
+
 [Or run with the USER-CUDA package by editing an input script:]
 
 The discussion above for the mpirun/mpiexec command and the requirement
 of one MPI task per GPU is the same.
 
 You must still use the "-c on" "command-line
 switch"_Section_start.html#start_7 to enable the USER-CUDA package.
 
 Use the "suffix cuda"_suffix.html command, or you can explicitly add a
 "cuda" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/cuda 2.5 :pre
 
 You only need to use the "package cuda"_package.html command if you
 wish to change any of its option defaults, including the number of
 GPUs/node (default = 1), as set by the "-c on" "command-line
 switch"_Section_start.html#start_7.
 
 [Speed-ups to expect:]
 
 The performance of a GPU versus a multi-core CPU is a function of your
 hardware, which pair style is used, the number of atoms/GPU, and the
 precision used on the GPU (double, single, mixed).
 
 See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
 LAMMPS web site for performance of the USER-CUDA package on different
 hardware.
 
 [Guidelines for best performance:]
 
 The USER-CUDA package offers more speed-up relative to CPU performance
 when the number of atoms per GPU is large, e.g. on the order of tens
 or hundreds of 1000s. :ulb,l
 
 As noted above, this package will continue to run a simulation
 entirely on the GPU(s) (except for inter-processor MPI communication),
 for multiple timesteps, until a CPU calculation is required, either by
 a fix or compute that is non-GPU-ized, or until output is performed
 (thermo or dump snapshot or restart file).  The less often this
 occurs, the faster your simulation will run. :l,ule
 
 [Restrictions:]
 
 None.
diff --git a/doc/accelerate_gpu.html b/doc/accelerate_gpu.html
index 79cae3832..d09eb331c 100644
--- a/doc/accelerate_gpu.html
+++ b/doc/accelerate_gpu.html
@@ -1,242 +1,248 @@
 <HTML>
 <CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
 <A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> 
 </CENTER>
 
 
 
 
 
 
 <HR>
 
 <P><A HREF = "Section_accelerate.html">Return to Section accelerate overview</A>
 </P>
 <H4>5.3.2 GPU package 
 </H4>
 <P>The GPU package was developed by Mike Brown at ORNL and his
 collaborators, particularly Trung Nguyen (ORNL).  It provides GPU
 versions of many pair styles, including the 3-body Stillinger-Weber
 pair style, and for <A HREF = "kspace_style.html">kspace_style pppm</A> for
 long-range Coulombics.  It has the following general features:
 </P>
 <UL><LI>It is designed to exploit common GPU hardware configurations where one
 or more GPUs are coupled to many cores of one or more multi-core CPUs,
 e.g. within a node of a parallel machine. 
 
 <LI>Atom-based data (e.g. coordinates, forces) moves back-and-forth
 between the CPU(s) and GPU every timestep. 
 
 <LI>Neighbor lists can be built on the CPU or on the GPU 
 
 <LI>The charge assignement and force interpolation portions of PPPM can be
 run on the GPU.  The FFT portion, which requires MPI communication
 between processors, runs on the CPU. 
 
 <LI>Asynchronous force computations can be performed simultaneously on the
 CPU(s) and GPU. 
 
 <LI>It allows for GPU computations to be performed in single or double
 precision, or in mixed-mode precision, where pairwise forces are
 computed in single precision, but accumulated into double-precision
 force vectors. 
 
 <LI>LAMMPS-specific code is in the GPU package.  It makes calls to a
 generic GPU library in the lib/gpu directory.  This library provides
 NVIDIA support as well as more general OpenCL support, so that the
 same functionality can eventually be supported on a variety of GPU
 hardware. 
 </UL>
 <P>Here is a quick overview of how to use the GPU package:
 </P>
 <UL><LI>build the library in lib/gpu for your GPU hardware wity desired precision
 <LI>include the GPU package and build LAMMPS
 <LI>use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU
 <LI>specify the # of GPUs per node
 <LI>use GPU styles in your input script 
 </UL>
 <P>The latter two steps can be done using the "-pk gpu" and "-sf gpu"
 <A HREF = "Section_start.html#start_7">command-line switches</A> respectively.  Or
 the effect of the "-pk" or "-sf" switches can be duplicated by adding
 the <A HREF = "package.html">package gpu</A> or <A HREF = "suffix.html">suffix gpu</A> commands
 respectively to your input script.
 </P>
 <P><B>Required hardware/software:</B>
 </P>
 <P>To use this package, you currently need to have an NVIDIA GPU and
 install the NVIDIA Cuda software on your system:
 </P>
 <UL><LI>Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information
 <LI>Go to http://www.nvidia.com/object/cuda_get.html
 <LI>Install a driver and toolkit appropriate for your system (SDK is not necessary)
 <LI>Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties 
 </UL>
 <P><B>Building LAMMPS with the GPU package:</B>
 </P>
 <P>This requires two steps (a,b): build the GPU library, then build
 LAMMPS with the GPU package.
 </P>
 <P>(a) Build the GPU library
 </P>
 <P>The GPU library is in lammps/lib/gpu.  Select a Makefile.machine (in
 lib/gpu) appropriate for your system.  You should pay special
 attention to 3 settings in this makefile.
 </P>
 <UL><LI>CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system
 <LI>CUDA_ARCH = needs to be appropriate to your GPUs
 <LI>CUDA_PREC = precision (double, mixed, single) you desire 
 </UL>
 <P>See lib/gpu/Makefile.linux.double for examples of the ARCH settings
 for different GPU choices, e.g. Fermi vs Kepler.  It also lists the
 possible precision settings:
 </P>
 <PRE>CUDA_PREC = -D_SINGLE_SINGLE  # single precision for all calculations
 CUDA_PREC = -D_DOUBLE_DOUBLE  # double precision for all calculations
 CUDA_PREC = -D_SINGLE_DOUBLE  # accumulation of forces, etc, in double 
 </PRE>
 <P>The last setting is the mixed mode referred to above.  Note that your
 GPU must support double precision to use either the 2nd or 3rd of
 these settings.
 </P>
 <P>To build the library, type:
 </P>
 <PRE>make -f Makefile.machine 
 </PRE>
 <P>If successful, it will produce the files libgpu.a and Makefile.lammps.
 </P>
 <P>The latter file has 3 settings that need to be appropriate for the
 paths and settings for the CUDA system software on your machine.
 Makefile.lammps is a copy of the file specified by the EXTRAMAKE
 setting in Makefile.machine.  You can change EXTRAMAKE or create your
 own Makefile.lammps.machine if needed.
 </P>
 <P>Note that to change the precision of the GPU library, you need to
 re-build the entire library.  Do a "clean" first, e.g. "make -f
 Makefile.linux clean", followed by the make command above.
 </P>
 <P>(b) Build LAMMPS with the GPU package
 </P>
 <PRE>cd lammps/src
 make yes-gpu
 make machine 
 </PRE>
 <P>No additional compile/link flags are needed in your Makefile.machine
 in src/MAKE.
 </P>
 <P>Note that if you change the GPU library precision (discussed above)
 and rebuild the GPU library, then you also need to re-install the GPU
 package and re-build LAMMPS, so that all affected files are
 re-compiled and linked to the new GPU library.
 </P>
 <P><B>Run with the GPU package from the command line:</B>
 </P>
 <P>The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command does this via its -np
 and -ppn switches.
 </P>
 <P>When using the GPU package, you cannot assign more than one GPU to a
 single MPI task.  However multiple MPI tasks can share the same GPU,
 and in many cases it will be more efficient to run this way.  Likewise
 it may be more efficient to use less MPI tasks/node than the available
 # of CPU cores.  Assignment of multiple MPI tasks to a GPU will happen
 automatically if you create more MPI tasks/node than there are
 GPUs/mode.  E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be
 shared by 4 MPI tasks.
 </P>
 <P>Use the "-sf gpu" <A HREF = "Section_start.html#start_7">command-line switch</A>,
 which will automatically append "gpu" to styles that support it.  Use
 the "-pk gpu Ng" <A HREF = "Section_start.html#start_7">command-line switch</A> to
 set Ng = # of GPUs/node to use.
 </P>
 <PRE>lmp_machine -sf gpu -pk gpu 1 -in in.script                         # 1 MPI task uses 1 GPU
 mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script           # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
 mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script   # ditto on 4 16-core nodes 
 </PRE>
 <P>Note that if the "-sf gpu" switch is used, it also issues a default
 <A HREF = "package.html">package gpu 1</A> command, which sets the number of
 GPUs/node to 1.
 </P>
 <P>Using the "-pk" switch explicitly allows for setting of the number of
 GPUs/node to use and additional options.  Its syntax is the same as
 same as the "package gpu" command.  See the <A HREF = "package.html">package</A>
 command doc page for details, including the default values used for
 all its options if it is not specified.
 </P>
+<P>Note that the default for the <A HREF = "package.html">package gpu</A> command is to
+set the Newton flag to "off" pairwise interactions.  It does not
+affect the setting for bonded interactions (LAMMPS default is "on").
+The "off" setting for pairwise interaction is currently required for
+GPU package pair styles.
+</P>
 <P><B>Or run with the GPU package by editing an input script:</B>
 </P>
 <P>The discussion above for the mpirun/mpiexec command, MPI tasks/node,
 and use of multiple MPI tasks/GPU is the same.
 </P>
 <P>Use the <A HREF = "suffix.html">suffix gpu</A> command, or you can explicitly add an
 "gpu" suffix to individual styles in your input script, e.g.
 </P>
 <PRE>pair_style lj/cut/gpu 2.5 
 </PRE>
 <P>You must also use the <A HREF = "package.html">package gpu</A> command to enable the
 GPU package, unless the "-sf gpu" or "-pk gpu" <A HREF = "Section_start.html#start_7">command-line
 switches</A> were used.  It specifies the
 number of GPUs/node to use, as well as other options.
 </P>
 <P><B>Speed-ups to expect:</B>
 </P>
 <P>The performance of a GPU versus a multi-core CPU is a function of your
 hardware, which pair style is used, the number of atoms/GPU, and the
 precision used on the GPU (double, single, mixed).
 </P>
 <P>See the <A HREF = "http://lammps.sandia.gov/bench.html">Benchmark page</A> of the
 LAMMPS web site for performance of the GPU package on various
 hardware, including the Titan HPC platform at ORNL.
 </P>
 <P>You should also experiment with how many MPI tasks per GPU to use to
 give the best performance for your problem and machine.  This is also
 a function of the problem size and the pair style being using.
 Likewise, you should experiment with the precision setting for the GPU
 library to see if single or mixed precision will give accurate
 results, since they will typically be faster.
 </P>
 <P><B>Guidelines for best performance:</B>
 </P>
 <UL><LI>Using multiple MPI tasks per GPU will often give the best performance,
 as allowed my most multi-core CPU/GPU configurations. 
 
 <LI>If the number of particles per MPI task is small (e.g. 100s of
 particles), it can be more efficient to run with fewer MPI tasks per
 GPU, even if you do not use all the cores on the compute node. 
 
 <LI>The <A HREF = "package.html">package gpu</A> command has several options for tuning
 performance.  Neighbor lists can be built on the GPU or CPU.  Force
 calculations can be dynamically balanced across the CPU cores and
 GPUs.  GPU-specific settings can be made which can be optimized
 for different hardware.  See the <A HREF = "package.html">packakge</A> command
 doc page for details. 
 
 <LI>As described by the <A HREF = "package.html">package gpu</A> command, GPU
 accelerated pair styles can perform computations asynchronously with
 CPU computations. The "Pair" time reported by LAMMPS will be the
 maximum of the time required to complete the CPU pair style
 computations and the time required to complete the GPU pair style
 computations. Any time spent for GPU-enabled pair styles for
 computations that run simultaneously with <A HREF = "bond_style.html">bond</A>,
 <A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
 <A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
 calculations will not be included in the "Pair" time. 
 
 <LI>When the <I>mode</I> setting for the package gpu command is force/neigh,
 the time for neighbor list calculations on the GPU will be added into
 the "Pair" time, not the "Neigh" time.  An additional breakdown of the
 times required for various tasks on the GPU (data copy, neighbor
 calculations, force computations, etc) are output only with the LAMMPS
 screen output (not in the log file) at the end of each run.  These
 timings represent total time spent on the GPU for each routine,
 regardless of asynchronous CPU calculations. 
 
 <LI>The output section "GPU Time Info (average)" reports "Max Mem / Proc".
 This is the maximum memory used at one time on the GPU for data
 storage by a single MPI process. 
 </UL>
 <P><B>Restrictions:</B>
 </P>
 <P>None.
 </P>
 </HTML>
diff --git a/doc/accelerate_gpu.txt b/doc/accelerate_gpu.txt
index 607147408..e221e2342 100644
--- a/doc/accelerate_gpu.txt
+++ b/doc/accelerate_gpu.txt
@@ -1,237 +1,243 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.2 GPU package :h4
 
 The GPU package was developed by Mike Brown at ORNL and his
 collaborators, particularly Trung Nguyen (ORNL).  It provides GPU
 versions of many pair styles, including the 3-body Stillinger-Weber
 pair style, and for "kspace_style pppm"_kspace_style.html for
 long-range Coulombics.  It has the following general features:
 
 It is designed to exploit common GPU hardware configurations where one
 or more GPUs are coupled to many cores of one or more multi-core CPUs,
 e.g. within a node of a parallel machine. :ulb,l
 
 Atom-based data (e.g. coordinates, forces) moves back-and-forth
 between the CPU(s) and GPU every timestep. :l
 
 Neighbor lists can be built on the CPU or on the GPU :l
 
 The charge assignement and force interpolation portions of PPPM can be
 run on the GPU.  The FFT portion, which requires MPI communication
 between processors, runs on the CPU. :l
 
 Asynchronous force computations can be performed simultaneously on the
 CPU(s) and GPU. :l
 
 It allows for GPU computations to be performed in single or double
 precision, or in mixed-mode precision, where pairwise forces are
 computed in single precision, but accumulated into double-precision
 force vectors. :l
 
 LAMMPS-specific code is in the GPU package.  It makes calls to a
 generic GPU library in the lib/gpu directory.  This library provides
 NVIDIA support as well as more general OpenCL support, so that the
 same functionality can eventually be supported on a variety of GPU
 hardware. :l,ule
 
 Here is a quick overview of how to use the GPU package:
 
 build the library in lib/gpu for your GPU hardware wity desired precision
 include the GPU package and build LAMMPS
 use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU
 specify the # of GPUs per node
 use GPU styles in your input script :ul
 
 The latter two steps can be done using the "-pk gpu" and "-sf gpu"
 "command-line switches"_Section_start.html#start_7 respectively.  Or
 the effect of the "-pk" or "-sf" switches can be duplicated by adding
 the "package gpu"_package.html or "suffix gpu"_suffix.html commands
 respectively to your input script.
 
 [Required hardware/software:]
 
 To use this package, you currently need to have an NVIDIA GPU and
 install the NVIDIA Cuda software on your system:
 
 Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information
 Go to http://www.nvidia.com/object/cuda_get.html
 Install a driver and toolkit appropriate for your system (SDK is not necessary)
 Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties :ul
 
 [Building LAMMPS with the GPU package:]
 
 This requires two steps (a,b): build the GPU library, then build
 LAMMPS with the GPU package.
 
 (a) Build the GPU library
 
 The GPU library is in lammps/lib/gpu.  Select a Makefile.machine (in
 lib/gpu) appropriate for your system.  You should pay special
 attention to 3 settings in this makefile.
 
 CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system
 CUDA_ARCH = needs to be appropriate to your GPUs
 CUDA_PREC = precision (double, mixed, single) you desire :ul
 
 See lib/gpu/Makefile.linux.double for examples of the ARCH settings
 for different GPU choices, e.g. Fermi vs Kepler.  It also lists the
 possible precision settings:
 
 CUDA_PREC = -D_SINGLE_SINGLE  # single precision for all calculations
 CUDA_PREC = -D_DOUBLE_DOUBLE  # double precision for all calculations
 CUDA_PREC = -D_SINGLE_DOUBLE  # accumulation of forces, etc, in double :pre
 
 The last setting is the mixed mode referred to above.  Note that your
 GPU must support double precision to use either the 2nd or 3rd of
 these settings.
 
 To build the library, type:
 
 make -f Makefile.machine :pre
 
 If successful, it will produce the files libgpu.a and Makefile.lammps.
 
 The latter file has 3 settings that need to be appropriate for the
 paths and settings for the CUDA system software on your machine.
 Makefile.lammps is a copy of the file specified by the EXTRAMAKE
 setting in Makefile.machine.  You can change EXTRAMAKE or create your
 own Makefile.lammps.machine if needed.
 
 Note that to change the precision of the GPU library, you need to
 re-build the entire library.  Do a "clean" first, e.g. "make -f
 Makefile.linux clean", followed by the make command above.
 
 (b) Build LAMMPS with the GPU package
 
 cd lammps/src
 make yes-gpu
 make machine :pre
 
 No additional compile/link flags are needed in your Makefile.machine
 in src/MAKE.
 
 Note that if you change the GPU library precision (discussed above)
 and rebuild the GPU library, then you also need to re-install the GPU
 package and re-build LAMMPS, so that all affected files are
 re-compiled and linked to the new GPU library.
 
 [Run with the GPU package from the command line:]
 
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command does this via its -np
 and -ppn switches.
 
 When using the GPU package, you cannot assign more than one GPU to a
 single MPI task.  However multiple MPI tasks can share the same GPU,
 and in many cases it will be more efficient to run this way.  Likewise
 it may be more efficient to use less MPI tasks/node than the available
 # of CPU cores.  Assignment of multiple MPI tasks to a GPU will happen
 automatically if you create more MPI tasks/node than there are
 GPUs/mode.  E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be
 shared by 4 MPI tasks.
 
 Use the "-sf gpu" "command-line switch"_Section_start.html#start_7,
 which will automatically append "gpu" to styles that support it.  Use
 the "-pk gpu Ng" "command-line switch"_Section_start.html#start_7 to
 set Ng = # of GPUs/node to use.
 
 lmp_machine -sf gpu -pk gpu 1 -in in.script                         # 1 MPI task uses 1 GPU
 mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script           # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
 mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script   # ditto on 4 16-core nodes :pre
 
 Note that if the "-sf gpu" switch is used, it also issues a default
 "package gpu 1"_package.html command, which sets the number of
 GPUs/node to 1.
 
 Using the "-pk" switch explicitly allows for setting of the number of
 GPUs/node to use and additional options.  Its syntax is the same as
 same as the "package gpu" command.  See the "package"_package.html
 command doc page for details, including the default values used for
 all its options if it is not specified.
 
+Note that the default for the "package gpu"_package.html command is to
+set the Newton flag to "off" pairwise interactions.  It does not
+affect the setting for bonded interactions (LAMMPS default is "on").
+The "off" setting for pairwise interaction is currently required for
+GPU package pair styles.
+
 [Or run with the GPU package by editing an input script:]
 
 The discussion above for the mpirun/mpiexec command, MPI tasks/node,
 and use of multiple MPI tasks/GPU is the same.
 
 Use the "suffix gpu"_suffix.html command, or you can explicitly add an
 "gpu" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/gpu 2.5 :pre
 
 You must also use the "package gpu"_package.html command to enable the
 GPU package, unless the "-sf gpu" or "-pk gpu" "command-line
 switches"_Section_start.html#start_7 were used.  It specifies the
 number of GPUs/node to use, as well as other options.
 
 [Speed-ups to expect:]
 
 The performance of a GPU versus a multi-core CPU is a function of your
 hardware, which pair style is used, the number of atoms/GPU, and the
 precision used on the GPU (double, single, mixed).
 
 See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
 LAMMPS web site for performance of the GPU package on various
 hardware, including the Titan HPC platform at ORNL.
 
 You should also experiment with how many MPI tasks per GPU to use to
 give the best performance for your problem and machine.  This is also
 a function of the problem size and the pair style being using.
 Likewise, you should experiment with the precision setting for the GPU
 library to see if single or mixed precision will give accurate
 results, since they will typically be faster.
 
 [Guidelines for best performance:]
 
 Using multiple MPI tasks per GPU will often give the best performance,
 as allowed my most multi-core CPU/GPU configurations. :ulb,l
 
 If the number of particles per MPI task is small (e.g. 100s of
 particles), it can be more efficient to run with fewer MPI tasks per
 GPU, even if you do not use all the cores on the compute node. :l
 
 The "package gpu"_package.html command has several options for tuning
 performance.  Neighbor lists can be built on the GPU or CPU.  Force
 calculations can be dynamically balanced across the CPU cores and
 GPUs.  GPU-specific settings can be made which can be optimized
 for different hardware.  See the "packakge"_package.html command
 doc page for details. :l
 
 As described by the "package gpu"_package.html command, GPU
 accelerated pair styles can perform computations asynchronously with
 CPU computations. The "Pair" time reported by LAMMPS will be the
 maximum of the time required to complete the CPU pair style
 computations and the time required to complete the GPU pair style
 computations. Any time spent for GPU-enabled pair styles for
 computations that run simultaneously with "bond"_bond_style.html,
 "angle"_angle_style.html, "dihedral"_dihedral_style.html,
 "improper"_improper_style.html, and "long-range"_kspace_style.html
 calculations will not be included in the "Pair" time. :l
 
 When the {mode} setting for the package gpu command is force/neigh,
 the time for neighbor list calculations on the GPU will be added into
 the "Pair" time, not the "Neigh" time.  An additional breakdown of the
 times required for various tasks on the GPU (data copy, neighbor
 calculations, force computations, etc) are output only with the LAMMPS
 screen output (not in the log file) at the end of each run.  These
 timings represent total time spent on the GPU for each routine,
 regardless of asynchronous CPU calculations. :l
 
 The output section "GPU Time Info (average)" reports "Max Mem / Proc".
 This is the maximum memory used at one time on the GPU for data
 storage by a single MPI process. :l,ule
 
 [Restrictions:]
 
 None.
diff --git a/doc/accelerate_kokkos.html b/doc/accelerate_kokkos.html
index 6e73d407b..4192df77c 100644
--- a/doc/accelerate_kokkos.html
+++ b/doc/accelerate_kokkos.html
@@ -1,426 +1,438 @@
 <HTML>
 <CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
 <A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> 
 </CENTER>
 
 
 
 
 
 
 <HR>
 
 <P><A HREF = "Section_accelerate.html">Return to Section accelerate overview</A>
 </P>
 <H4>5.3.4 KOKKOS package 
 </H4>
 <P>The KOKKOS package was developed primaritly by Christian Trott
 (Sandia) with contributions of various styles by others, including
 Sikandar Mashayak (UIUC).  The underlying Kokkos library was written
 primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all
 Sandia).
 </P>
 <P>The KOKKOS package contains versions of pair, fix, and atom styles
 that use data structures and macros provided by the Kokkos library,
 which is included with LAMMPS in lib/kokkos.
 </P>
 <P>The Kokkos library is part of
 <A HREF = "http://trilinos.sandia.gov/packages/kokkos">Trilinos</A> and is a
 templated C++ library that provides two key abstractions for an
 application like LAMMPS.  First, it allows a single implementation of
 an application kernel (e.g. a pair style) to run efficiently on
 different kinds of hardware, such as a GPU, Intel Phi, or many-core
 chip.
 </P>
 <P>The Kokkos library also provides data abstractions to adjust (at
 compile time) the memory layout of basic data structures like 2d and
 3d arrays and allow the transparent utilization of special hardware
 load and store operations.  Such data structures are used in LAMMPS to
 store atom coordinates or forces or neighbor lists.  The layout is
 chosen to optimize performance on different platforms.  Again this
 functionality is hidden from the developer, and does not affect how
 the kernel is coded.
 </P>
 <P>These abstractions are set at build time, when LAMMPS is compiled with
 the KOKKOS package installed.  This is done by selecting a "host" and
 "device" to build for, compatible with the compute nodes in your
 machine (one on a desktop machine or 1000s on a supercomputer).
 </P>
 <P>All Kokkos operations occur within the context of an individual MPI
 task running on a single node of the machine.  The total number of MPI
 tasks used by LAMMPS (one or multiple per compute node) is set in the
 usual manner via the mpirun or mpiexec commands, and is independent of
 Kokkos.
 </P>
 <P>Kokkos provides support for two different modes of execution per MPI
 task.  This means that computational tasks (pairwise interactions,
 neighbor list builds, time integration, etc) can be parallelized for
 one or the other of the two modes.  The first mode is called the
 "host" and is one or more threads running on one or more physical CPUs
 (within the node).  Currently, both multi-core CPUs and an Intel Phi
 processor (running in native mode, not offload mode like the
 USER-INTEL package) are supported.  The second mode is called the
 "device" and is an accelerator chip of some kind.  Currently only an
 NVIDIA GPU is supported.  If your compute node does not have a GPU,
 then there is only one mode of execution, i.e. the host and device are
 the same.
 </P>
 <P>Here is a quick overview of how to use the KOKKOS package
 for GPU acceleration:
 </P>
 <UL><LI>specify variables and settings in your Makefile.machine that enable GPU, Phi, or OpenMP support
 <LI>include the KOKKOS package and build LAMMPS
 <LI>enable the KOKKOS package and its hardware options via the "-k on" command-line switch
 <LI>use KOKKOS styles in your input script 
 </UL>
 <P>The latter two steps can be done using the "-k on", "-pk kokkos" and
 "-sf kk" <A HREF = "Section_start.html#start_7">command-line switches</A>
 respectively.  Or the effect of the "-pk" or "-sf" switches can be
 duplicated by adding the <A HREF = "package.html">package kokkos</A> or <A HREF = "suffix.html">suffix
 kk</A> commands respectively to your input script.
 </P>
 <P><B>Required hardware/software:</B>
 </P>
 <P>The KOKKOS package can be used to build and run LAMMPS on the
 following kinds of hardware:
 </P>
 <UL><LI>CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS styles)
 <LI>CPU-only: one or a few MPI tasks per node with additional threading via OpenMP
 <LI>Phi: on one or more Intel Phi coprocessors (per node)
 <LI>GPU: on the GPUs of a node with additional OpenMP threading on the CPUs 
 </UL>
 <P>Note that Intel Xeon Phi coprocessors are supported in "native" mode,
 not "offload" mode like the USER-INTEL package supports.
 </P>
 <P>Only NVIDIA GPUs are currently supported.
 </P>
 <P>IMPORTANT NOTE: For good performance of the KOKKOS package on GPUs,
 you must have Kepler generation GPUs (or later).  The Kokkos library
 exploits texture cache options not supported by Telsa generation GPUs
 (or older).
 </P>
 <P>To build the KOKKOS package for GPUs, NVIDIA Cuda software must be
 installed on your system.  See the discussion above for the USER-CUDA
 and GPU packages for details of how to check and do this.
 </P>
 <P><B>Building LAMMPS with the KOKKOS package:</B>
 </P>
 <P>Unlike other acceleration packages discussed in this section, the
 Kokkos library in lib/kokkos does not have to be pre-built before
 building LAMMPS itself.  Instead, options for the Kokkos library are
 specified at compile time, when LAMMPS itself is built.  This can be
 done in one of two ways, as discussed below.
 </P>
 <P>Here are examples of how to build LAMMPS for the different compute-node
 configurations listed above.
 </P>
 <P>CPU-only (run all-MPI or with OpenMP threading):
 </P>
 <PRE>cd lammps/src
 make yes-kokkos
 make g++ OMP=yes 
 </PRE>
 <P>Intel Xeon Phi:
 </P>
 <PRE>cd lammps/src
 make yes-kokkos
 make g++ OMP=yes MIC=yes 
 </PRE>
 <P>CPUs and GPUs:
 </P>
 <PRE>cd lammps/src
 make yes-kokkos
 make cuda CUDA=yes 
 </PRE>
 <P>These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
 make command line which requires a GNU-compatible make command.  Try
 "gmake" if your system's standard make complains.  
 </P>
 <P>IMPORTANT NOTE: If you build using make line variables and re-build
 LAMMPS twice with different KOKKOS options and the *same* target,
 e.g. g++ in the first two examples above, then you *must* perform a
 "make clean-all" or "make clean-machine" before each build.  This is
 to force all the KOKKOS-dependent files to be re-compiled with the new
 options.
 </P>
 <P>You can also hardwire these make variables in the specified machine
 makefile, e.g. src/MAKE/Makefile.g++ in the first two examples above,
 with a line like:
 </P>
 <PRE>MIC = yes 
 </PRE>
 <P>Note that if you build LAMMPS multiple times in this manner, using
 different KOKKOS options (defined in different machine makefiles), you
 do not have to worry about doing a "clean" in between.  This is
 because the targets will be different.
 </P>
 <P>IMPORTANT NOTE: The 3rd example above for a GPU, uses a different
 machine makefile, in this case src/MAKE/Makefile.cuda, which is
 included in the LAMMPS distribution.  To build the KOKKOS package for
 a GPU, this makefile must use the NVIDA "nvcc" compiler.  And it must
 have a CCFLAGS -arch setting that is appropriate for your NVIDIA
 hardware and installed software.  Typical values for -arch are given
 in <A HREF = "Section_start.html#start_3_4">Section 2.3.4</A> of the manual, as well
 as other settings that must be included in the machine makefile, if
 you create your own.
 </P>
 <P>There are other allowed options when building with the KOKKOS package.
 As above, They can be set either as variables on the make command line
 or in the machine makefile in the src/MAKE directory.  See <A HREF = "Section_start.html#start_3_4">Section
 2.3.4</A> of the manual for details.
 </P>
 <P>IMPORTANT NOTE: Currently, there are no precision options with the
 KOKKOS package.  All compilation and computation is performed in
 double precision.
 </P>
 <P><B>Run with the KOKKOS package from the command line:</B>
 </P>
 <P>The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command does this via its -np
 and -ppn switches.
 </P>
 <P>When using KOKKOS built with host=OMP, you need to choose how many
 OpenMP threads per MPI task will be used (via the "-k" command-line
 switch discussed below).  Note that the product of MPI tasks * OpenMP
 threads/task should not exceed the physical number of cores (on a
 node), otherwise performance will suffer.
 </P>
 <P>When using the KOKKOS package built with device=CUDA, you must use
 exactly one MPI task per physical GPU.
 </P>
 <P>When using the KOKKOS package built with host=MIC for Intel Xeon Phi
 coprocessor support you need to insure there are one or more MPI tasks
 per coprocessor, and choose the number of coprocessor threads to use
 per MPI task (via the "-k" command-line switch discussed below).  The
 product of MPI tasks * coprocessor threads/task should not exceed the
 maximum number of threads the coproprocessor is designed to run,
 otherwise performance will suffer.  This value is 240 for current
 generation Xeon Phi(TM) chips, which is 60 physical cores * 4
 threads/core.  Note that with the KOKKOS package you do not need to
 specify how many Phi coprocessors there are per node; each
 coprocessors is simply treated as running some number of MPI tasks.
 </P>
 <P>You must use the "-k on" <A HREF = "Section_start.html#start_7">command-line
 switch</A> to enable the KOKKOS package.  It
 takes additional arguments for hardware settings appropriate to your
 system.  Those arguments are <A HREF = "Section_start.html#start_7">documented
 here</A>.  The two most commonly used
 options are:
 </P>
 <PRE>-k on t Nt g Ng 
 </PRE>
 <P>The "t Nt" option applies to host=OMP (even if device=CUDA) and
 host=MIC.  For host=OMP, it specifies how many OpenMP threads per MPI
 task to use with a node.  For host=MIC, it specifies how many Xeon Phi
 threads per MPI task to use within a node.  The default is Nt = 1.
 Note that for host=OMP this is effectively MPI-only mode which may be
 fine.  But for host=MIC you will typically end up using far less than
 all the 240 available threads, which could give very poor performance.
 </P>
 <P>The "g Ng" option applies to device=CUDA.  It specifies how many GPUs
 per compute node to use.  The default is 1, so this only needs to be
 specified is you have 2 or more GPUs per compute node.
 </P>
 <P>The "-k on" switch also issues a "package kokkos" command (with no
 additional arguments) which sets various KOKKOS options to default
 values, as discussed on the <A HREF = "package.html">package</A> command doc page.
 </P>
 <P>Use the "-sf kk" <A HREF = "Section_start.html#start_7">command-line switch</A>,
 which will automatically append "kk" to styles that support it.  Use
 the "-pk kokkos" <A HREF = "Section_start.html#start_7">command-line switch</A> if
 you wish to change any of the default <A HREF = "package.html">package kokkos</A>
 optionns set by the "-k on" <A HREF = "Section_start.html#start_7">command-line
 switch</A>.
 </P>
 <PRE>host=OMP, dual hex-core nodes (12 threads/node):
 mpirun -np 12 lmp_g++ -in in.lj                           # MPI-only mode with no Kokkos
 mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj              # MPI-only mode with Kokkos
 mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj          # one MPI task, 12 threads
 mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj           # two MPI tasks, 6 threads/task 
 mpirun -np 32 -ppn 2 lmp_g++ -k on t 6 -sf kk -in in.lj   # ditto on 16 nodes 
 </PRE>
 <P>host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
 mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj           # 1 MPI task on 1 Phi, 1*240 = 240
 mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj            # 30 MPI tasks on 1 Phi, 30*8 = 240
 mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj           # 12 MPI tasks on 1 Phi, 12*20 = 240
 mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj   # ditto on 8 Phis
 </P>
 <PRE>host=OMP, device=CUDA, node = dual hex-core CPUs and a single GPU:
 mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj          # one MPI task, 6 threads on CPU
 mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj   # ditto on 4 nodes 
 </PRE>
 <PRE>host=OMP, device=CUDA, node = dual 8-core CPUs and 2 GPUs:
 mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj           # two MPI tasks, 8 threads per CPU
 mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj   # ditto on 16 nodes 
 </PRE>
+<P>Note that the default for the <A HREF = "package.html">package kokkos</A> command is
+to use "full" neighbor lists and set the Newton flag to "off" for both
+pairwise and bonded interactions.  This typically gives fastest
+performance.  If the <A HREF = "newton.html">newton</A> command is used in the input
+script, it can override the Newton flag defaults.
+</P>
+<P>However, when running in MPI-only mode with 1 thread per MPI task, it
+will typically be faster to use "half" neighbor lists and set the
+Newton flag to "on", just as is the case for non-accelerated pair
+styles.  You can do this with the "-pk" <A HREF = "Section_start.html#start_7">command-line
+switch</A>.
+</P>
 <P><B>Or run with the KOKKOS package by editing an input script:</B>
 </P>
 <P>The discussion above for the mpirun/mpiexec command and setting
 appropriate thread and GPU values for host=OMP or host=MIC or
 device=CUDA are the same.
 </P>
 <P>You must still use the "-k on" <A HREF = "Section_start.html#start_7">command-line
 switch</A> to enable the KOKKOS package, and
 specify its additional arguments for hardware options appopriate to
 your system, as documented above.
 </P>
 <P>Use the <A HREF = "suffix.html">suffix kk</A> command, or you can explicitly add a
 "kk" suffix to individual styles in your input script, e.g.
 </P>
 <PRE>pair_style lj/cut/kk 2.5 
 </PRE>
 <P>You only need to use the <A HREF = "package.html">package kokkos</A> command if you
 wish to change any of its option defaults, as set by the "-k on"
 <A HREF = "Section_start.html#start_7">command-line switch</A>.
 </P>
 <P><B>Speed-ups to expect:</B>
 </P>
 <P>The performance of KOKKOS running in different modes is a function of
 your hardware, which KOKKOS-enable styles are used, and the problem
 size.
 </P>
 <P>Generally speaking, the following rules of thumb apply:
 </P>
 <UL><LI>When running on CPUs only, with a single thread per MPI task,
 performance of a KOKKOS style is somewhere between the standard
 (un-accelerated) styles (MPI-only mode), and those provided by the
 USER-OMP package.  However the difference between all 3 is small (less
 than 20%). 
 
 <LI>When running on CPUs only, with multiple threads per MPI task,
 performance of a KOKKOS style is a bit slower than the USER-OMP
 package. 
 
 <LI>When running on GPUs, KOKKOS is typically faster than the USER-CUDA
 and GPU packages. 
 
 <LI>When running on Intel Xeon Phi, KOKKOS is not as fast as
 the USER-INTEL package, which is optimized for that hardware. 
 </UL>
 <P>See the <A HREF = "http://lammps.sandia.gov/bench.html">Benchmark page</A> of the
 LAMMPS web site for performance of the KOKKOS package on different
 hardware.
 </P>
 <P><B>Guidelines for best performance:</B>
 </P>
 <P>Here are guidline for using the KOKKOS package on the different
 hardware configurations listed above.
 </P>
 <P>Many of the guidelines use the <A HREF = "package.html">package kokkos</A> command
 See its doc page for details and default settings.  Experimenting with
 its options can provide a speed-up for specific calculations.
 </P>
 <P><B>Running on a multi-core CPU:</B>
 </P>
 <P>If N is the number of physical cores/node, then the number of MPI
 tasks/node * number of threads/task should not exceed N, and should
 typically equal N.  Note that the default threads/task is 1, as set by
 the "t" keyword of the "-k" <A HREF = "Section_start.html#start_7">command-line
 switch</A>.  If you do not change this, no
 additional parallelism (beyond MPI) will be invoked on the host
 CPU(s).
 </P>
 <P>You can compare the performance running in different modes:
 </P>
 <UL><LI>run with 1 MPI task/node and N threads/task
 <LI>run with N MPI tasks/node and 1 thread/task
 <LI>run with settings in between these extremes 
 </UL>
 <P>Examples of mpirun commands in these modes are shown above.
 </P>
 <P>When using KOKKOS to perform multi-threading, it is important for
 performance to bind both MPI tasks to physical cores, and threads to
 physical cores, so they do not migrate during a simulation.
 </P>
 <P>If you are not certain MPI tasks are being bound (check the defaults
 for your MPI installation), binding can be forced with these flags:
 </P>
 <PRE>OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
 Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... 
 </PRE>
 <P>For binding threads with the KOKKOS OMP option, use thread affinity
 environment variables to force binding.  With OpenMP 3.1 (gcc 4.7 or
 later, intel 12 or later) setting the environment variable
 OMP_PROC_BIND=true should be sufficient.  For binding threads with the
 KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option, as
 discussed in <A HREF = "Sections_start.html#start_3_4">Section 2.3.4</A> of the
 manual.
 </P>
 <P><B>Running on GPUs:</B>
 </P>
 <P>Insure the -arch setting in the machine makefile you are using,
 e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software
 (see <A HREF = "Section_start.html#start_3_4">this section</A> of the manual for
 details).
 </P>
 <P>The -np setting of the mpirun command should set the number of MPI
 tasks/node to be equal to the # of physical GPUs on the node. 
 </P>
 <P>Use the "-k" <A HREF = "Section_commands.html#start_7">command-line switch</A> to
 specify the number of GPUs per node, and the number of threads per MPI
 task.  As above for multi-core CPUs (and no GPU), if N is the number
 of physical cores/node, then the number of MPI tasks/node * number of
 threads/task should not exceed N.  With one GPU (and one MPI task) it
 may be faster to use less than all the available cores, by setting
 threads/task to a smaller value.  This is because using all the cores
 on a dual-socket node will incur extra cost to copy memory from the
 2nd socket to the GPU.
 </P>
 <P>Examples of mpirun commands that follow these rules are shown above.
 </P>
 <P>IMPORTANT NOTE: When using a GPU, you will achieve the best
 performance if your input script does not use any fix or compute
 styles which are not yet Kokkos-enabled.  This allows data to stay on
 the GPU for multiple timesteps, without being copied back to the host
 CPU.  Invoking a non-Kokkos fix or compute, or performing I/O for
 <A HREF = "thermo_style.html">thermo</A> or <A HREF = "dump.html">dump</A> output will cause data
 to be copied back to the CPU.
 </P>
 <P>You cannot yet assign multiple MPI tasks to the same GPU with the
 KOKKOS package.  We plan to support this in the future, similar to the
 GPU package in LAMMPS.
 </P>
 <P>You cannot yet use both the host (multi-threaded) and device (GPU)
 together to compute pairwise interactions with the KOKKOS package.  We
 hope to support this in the future, similar to the GPU package in
 LAMMPS.
 </P>
 <P><B>Running on an Intel Phi:</B>
 </P>
 <P>Kokkos only uses Intel Phi processors in their "native" mode, i.e.
 not hosted by a CPU.
 </P>
 <P>As illustrated above, build LAMMPS with OMP=yes (the default) and
 MIC=yes.  The latter insures code is correctly compiled for the Intel
 Phi.  The OMP setting means OpenMP will be used for parallelization on
 the Phi, which is currently the best option within Kokkos.  In the
 future, other options may be added.
 </P>
 <P>Current-generation Intel Phi chips have either 61 or 57 cores.  One
 core should be excluded for running the OS, leaving 60 or 56 cores.
 Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
 N = 224 (4*56) cores to run on.
 </P>
 <P>The -np setting of the mpirun command sets the number of MPI
 tasks/node.  The "-k on t Nt" command-line switch sets the number of
 threads/task as Nt.  The product of these 2 values should be N, i.e.
 240 or 224.  Also, the number of threads/task should be a multiple of
 4 so that logical threads from more than one MPI task do not run on
 the same physical core.
 </P>
 <P>Examples of mpirun commands that follow these rules are shown above.
 </P>
 <P><B>Restrictions:</B>
 </P>
 <P>As noted above, if using GPUs, the number of MPI tasks per compute
 node should equal to the number of GPUs per compute node.  In the
 future Kokkos will support assigning multiple MPI tasks to a single
 GPU.
 </P>
 <P>Currently Kokkos does not support AMD GPUs due to limits in the
 available backend programming models.  Specifically, Kokkos requires
 extensive C++ support from the Kernel language.  This is expected to
 change in the future.
 </P>
 </HTML>
diff --git a/doc/accelerate_kokkos.txt b/doc/accelerate_kokkos.txt
index b01ed1366..b8dbcd0e0 100644
--- a/doc/accelerate_kokkos.txt
+++ b/doc/accelerate_kokkos.txt
@@ -1,421 +1,433 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.4 KOKKOS package :h4
 
 The KOKKOS package was developed primaritly by Christian Trott
 (Sandia) with contributions of various styles by others, including
 Sikandar Mashayak (UIUC).  The underlying Kokkos library was written
 primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all
 Sandia).
 
 The KOKKOS package contains versions of pair, fix, and atom styles
 that use data structures and macros provided by the Kokkos library,
 which is included with LAMMPS in lib/kokkos.
 
 The Kokkos library is part of
 "Trilinos"_http://trilinos.sandia.gov/packages/kokkos and is a
 templated C++ library that provides two key abstractions for an
 application like LAMMPS.  First, it allows a single implementation of
 an application kernel (e.g. a pair style) to run efficiently on
 different kinds of hardware, such as a GPU, Intel Phi, or many-core
 chip.
 
 The Kokkos library also provides data abstractions to adjust (at
 compile time) the memory layout of basic data structures like 2d and
 3d arrays and allow the transparent utilization of special hardware
 load and store operations.  Such data structures are used in LAMMPS to
 store atom coordinates or forces or neighbor lists.  The layout is
 chosen to optimize performance on different platforms.  Again this
 functionality is hidden from the developer, and does not affect how
 the kernel is coded.
 
 These abstractions are set at build time, when LAMMPS is compiled with
 the KOKKOS package installed.  This is done by selecting a "host" and
 "device" to build for, compatible with the compute nodes in your
 machine (one on a desktop machine or 1000s on a supercomputer).
 
 All Kokkos operations occur within the context of an individual MPI
 task running on a single node of the machine.  The total number of MPI
 tasks used by LAMMPS (one or multiple per compute node) is set in the
 usual manner via the mpirun or mpiexec commands, and is independent of
 Kokkos.
 
 Kokkos provides support for two different modes of execution per MPI
 task.  This means that computational tasks (pairwise interactions,
 neighbor list builds, time integration, etc) can be parallelized for
 one or the other of the two modes.  The first mode is called the
 "host" and is one or more threads running on one or more physical CPUs
 (within the node).  Currently, both multi-core CPUs and an Intel Phi
 processor (running in native mode, not offload mode like the
 USER-INTEL package) are supported.  The second mode is called the
 "device" and is an accelerator chip of some kind.  Currently only an
 NVIDIA GPU is supported.  If your compute node does not have a GPU,
 then there is only one mode of execution, i.e. the host and device are
 the same.
 
 Here is a quick overview of how to use the KOKKOS package
 for GPU acceleration:
 
 specify variables and settings in your Makefile.machine that enable GPU, Phi, or OpenMP support
 include the KOKKOS package and build LAMMPS
 enable the KOKKOS package and its hardware options via the "-k on" command-line switch
 use KOKKOS styles in your input script :ul
 
 The latter two steps can be done using the "-k on", "-pk kokkos" and
 "-sf kk" "command-line switches"_Section_start.html#start_7
 respectively.  Or the effect of the "-pk" or "-sf" switches can be
 duplicated by adding the "package kokkos"_package.html or "suffix
 kk"_suffix.html commands respectively to your input script.
 
 [Required hardware/software:]
 
 The KOKKOS package can be used to build and run LAMMPS on the
 following kinds of hardware:
 
 CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS styles)
 CPU-only: one or a few MPI tasks per node with additional threading via OpenMP
 Phi: on one or more Intel Phi coprocessors (per node)
 GPU: on the GPUs of a node with additional OpenMP threading on the CPUs :ul
 
 Note that Intel Xeon Phi coprocessors are supported in "native" mode,
 not "offload" mode like the USER-INTEL package supports.
 
 Only NVIDIA GPUs are currently supported.
 
 IMPORTANT NOTE: For good performance of the KOKKOS package on GPUs,
 you must have Kepler generation GPUs (or later).  The Kokkos library
 exploits texture cache options not supported by Telsa generation GPUs
 (or older).
 
 To build the KOKKOS package for GPUs, NVIDIA Cuda software must be
 installed on your system.  See the discussion above for the USER-CUDA
 and GPU packages for details of how to check and do this.
 
 [Building LAMMPS with the KOKKOS package:]
 
 Unlike other acceleration packages discussed in this section, the
 Kokkos library in lib/kokkos does not have to be pre-built before
 building LAMMPS itself.  Instead, options for the Kokkos library are
 specified at compile time, when LAMMPS itself is built.  This can be
 done in one of two ways, as discussed below.
 
 Here are examples of how to build LAMMPS for the different compute-node
 configurations listed above.
 
 CPU-only (run all-MPI or with OpenMP threading):
 
 cd lammps/src
 make yes-kokkos
 make g++ OMP=yes :pre
 
 Intel Xeon Phi:
 
 cd lammps/src
 make yes-kokkos
 make g++ OMP=yes MIC=yes :pre
 
 CPUs and GPUs:
 
 cd lammps/src
 make yes-kokkos
 make cuda CUDA=yes :pre
 
 These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
 make command line which requires a GNU-compatible make command.  Try
 "gmake" if your system's standard make complains.  
 
 IMPORTANT NOTE: If you build using make line variables and re-build
 LAMMPS twice with different KOKKOS options and the *same* target,
 e.g. g++ in the first two examples above, then you *must* perform a
 "make clean-all" or "make clean-machine" before each build.  This is
 to force all the KOKKOS-dependent files to be re-compiled with the new
 options.
 
 You can also hardwire these make variables in the specified machine
 makefile, e.g. src/MAKE/Makefile.g++ in the first two examples above,
 with a line like:
 
 MIC = yes :pre
 
 Note that if you build LAMMPS multiple times in this manner, using
 different KOKKOS options (defined in different machine makefiles), you
 do not have to worry about doing a "clean" in between.  This is
 because the targets will be different.
 
 IMPORTANT NOTE: The 3rd example above for a GPU, uses a different
 machine makefile, in this case src/MAKE/Makefile.cuda, which is
 included in the LAMMPS distribution.  To build the KOKKOS package for
 a GPU, this makefile must use the NVIDA "nvcc" compiler.  And it must
 have a CCFLAGS -arch setting that is appropriate for your NVIDIA
 hardware and installed software.  Typical values for -arch are given
 in "Section 2.3.4"_Section_start.html#start_3_4 of the manual, as well
 as other settings that must be included in the machine makefile, if
 you create your own.
 
 There are other allowed options when building with the KOKKOS package.
 As above, They can be set either as variables on the make command line
 or in the machine makefile in the src/MAKE directory.  See "Section
 2.3.4"_Section_start.html#start_3_4 of the manual for details.
 
 IMPORTANT NOTE: Currently, there are no precision options with the
 KOKKOS package.  All compilation and computation is performed in
 double precision.
 
 [Run with the KOKKOS package from the command line:]
 
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command does this via its -np
 and -ppn switches.
 
 When using KOKKOS built with host=OMP, you need to choose how many
 OpenMP threads per MPI task will be used (via the "-k" command-line
 switch discussed below).  Note that the product of MPI tasks * OpenMP
 threads/task should not exceed the physical number of cores (on a
 node), otherwise performance will suffer.
 
 When using the KOKKOS package built with device=CUDA, you must use
 exactly one MPI task per physical GPU.
 
 When using the KOKKOS package built with host=MIC for Intel Xeon Phi
 coprocessor support you need to insure there are one or more MPI tasks
 per coprocessor, and choose the number of coprocessor threads to use
 per MPI task (via the "-k" command-line switch discussed below).  The
 product of MPI tasks * coprocessor threads/task should not exceed the
 maximum number of threads the coproprocessor is designed to run,
 otherwise performance will suffer.  This value is 240 for current
 generation Xeon Phi(TM) chips, which is 60 physical cores * 4
 threads/core.  Note that with the KOKKOS package you do not need to
 specify how many Phi coprocessors there are per node; each
 coprocessors is simply treated as running some number of MPI tasks.
 
 You must use the "-k on" "command-line
 switch"_Section_start.html#start_7 to enable the KOKKOS package.  It
 takes additional arguments for hardware settings appropriate to your
 system.  Those arguments are "documented
 here"_Section_start.html#start_7.  The two most commonly used
 options are:
 
 -k on t Nt g Ng :pre
 
 The "t Nt" option applies to host=OMP (even if device=CUDA) and
 host=MIC.  For host=OMP, it specifies how many OpenMP threads per MPI
 task to use with a node.  For host=MIC, it specifies how many Xeon Phi
 threads per MPI task to use within a node.  The default is Nt = 1.
 Note that for host=OMP this is effectively MPI-only mode which may be
 fine.  But for host=MIC you will typically end up using far less than
 all the 240 available threads, which could give very poor performance.
 
 The "g Ng" option applies to device=CUDA.  It specifies how many GPUs
 per compute node to use.  The default is 1, so this only needs to be
 specified is you have 2 or more GPUs per compute node.
 
 The "-k on" switch also issues a "package kokkos" command (with no
 additional arguments) which sets various KOKKOS options to default
 values, as discussed on the "package"_package.html command doc page.
 
 Use the "-sf kk" "command-line switch"_Section_start.html#start_7,
 which will automatically append "kk" to styles that support it.  Use
 the "-pk kokkos" "command-line switch"_Section_start.html#start_7 if
 you wish to change any of the default "package kokkos"_package.html
 optionns set by the "-k on" "command-line
 switch"_Section_start.html#start_7.
 
 host=OMP, dual hex-core nodes (12 threads/node):
 mpirun -np 12 lmp_g++ -in in.lj                           # MPI-only mode with no Kokkos
 mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj              # MPI-only mode with Kokkos
 mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj          # one MPI task, 12 threads
 mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj           # two MPI tasks, 6 threads/task 
 mpirun -np 32 -ppn 2 lmp_g++ -k on t 6 -sf kk -in in.lj   # ditto on 16 nodes :pre
 
 host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
 mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj           # 1 MPI task on 1 Phi, 1*240 = 240
 mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj            # 30 MPI tasks on 1 Phi, 30*8 = 240
 mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj           # 12 MPI tasks on 1 Phi, 12*20 = 240
 mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj   # ditto on 8 Phis
 
 host=OMP, device=CUDA, node = dual hex-core CPUs and a single GPU:
 mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj          # one MPI task, 6 threads on CPU
 mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj   # ditto on 4 nodes :pre
 
 host=OMP, device=CUDA, node = dual 8-core CPUs and 2 GPUs:
 mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj           # two MPI tasks, 8 threads per CPU
 mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj   # ditto on 16 nodes :pre
 
+Note that the default for the "package kokkos"_package.html command is
+to use "full" neighbor lists and set the Newton flag to "off" for both
+pairwise and bonded interactions.  This typically gives fastest
+performance.  If the "newton"_newton.html command is used in the input
+script, it can override the Newton flag defaults.
+
+However, when running in MPI-only mode with 1 thread per MPI task, it
+will typically be faster to use "half" neighbor lists and set the
+Newton flag to "on", just as is the case for non-accelerated pair
+styles.  You can do this with the "-pk" "command-line
+switch"_Section_start.html#start_7.
+
 [Or run with the KOKKOS package by editing an input script:]
 
 The discussion above for the mpirun/mpiexec command and setting
 appropriate thread and GPU values for host=OMP or host=MIC or
 device=CUDA are the same.
 
 You must still use the "-k on" "command-line
 switch"_Section_start.html#start_7 to enable the KOKKOS package, and
 specify its additional arguments for hardware options appopriate to
 your system, as documented above.
 
 Use the "suffix kk"_suffix.html command, or you can explicitly add a
 "kk" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/kk 2.5 :pre
 
 You only need to use the "package kokkos"_package.html command if you
 wish to change any of its option defaults, as set by the "-k on"
 "command-line switch"_Section_start.html#start_7.
 
 [Speed-ups to expect:]
 
 The performance of KOKKOS running in different modes is a function of
 your hardware, which KOKKOS-enable styles are used, and the problem
 size.
 
 Generally speaking, the following rules of thumb apply:
 
 When running on CPUs only, with a single thread per MPI task,
 performance of a KOKKOS style is somewhere between the standard
 (un-accelerated) styles (MPI-only mode), and those provided by the
 USER-OMP package.  However the difference between all 3 is small (less
 than 20%). :ulb,l
 
 When running on CPUs only, with multiple threads per MPI task,
 performance of a KOKKOS style is a bit slower than the USER-OMP
 package. :l
 
 When running on GPUs, KOKKOS is typically faster than the USER-CUDA
 and GPU packages. :l
 
 When running on Intel Xeon Phi, KOKKOS is not as fast as
 the USER-INTEL package, which is optimized for that hardware. :l,ule
 
 See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
 LAMMPS web site for performance of the KOKKOS package on different
 hardware.
 
 [Guidelines for best performance:]
 
 Here are guidline for using the KOKKOS package on the different
 hardware configurations listed above.
 
 Many of the guidelines use the "package kokkos"_package.html command
 See its doc page for details and default settings.  Experimenting with
 its options can provide a speed-up for specific calculations.
 
 [Running on a multi-core CPU:]
 
 If N is the number of physical cores/node, then the number of MPI
 tasks/node * number of threads/task should not exceed N, and should
 typically equal N.  Note that the default threads/task is 1, as set by
 the "t" keyword of the "-k" "command-line
 switch"_Section_start.html#start_7.  If you do not change this, no
 additional parallelism (beyond MPI) will be invoked on the host
 CPU(s).
 
 You can compare the performance running in different modes:
   
 run with 1 MPI task/node and N threads/task
 run with N MPI tasks/node and 1 thread/task
 run with settings in between these extremes :ul
 
 Examples of mpirun commands in these modes are shown above.
 
 When using KOKKOS to perform multi-threading, it is important for
 performance to bind both MPI tasks to physical cores, and threads to
 physical cores, so they do not migrate during a simulation.
 
 If you are not certain MPI tasks are being bound (check the defaults
 for your MPI installation), binding can be forced with these flags:
 
 OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
 Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
 
 For binding threads with the KOKKOS OMP option, use thread affinity
 environment variables to force binding.  With OpenMP 3.1 (gcc 4.7 or
 later, intel 12 or later) setting the environment variable
 OMP_PROC_BIND=true should be sufficient.  For binding threads with the
 KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option, as
 discussed in "Section 2.3.4"_Sections_start.html#start_3_4 of the
 manual.
 
 [Running on GPUs:]
 
 Insure the -arch setting in the machine makefile you are using,
 e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software
 (see "this section"_Section_start.html#start_3_4 of the manual for
 details).
 
 The -np setting of the mpirun command should set the number of MPI
 tasks/node to be equal to the # of physical GPUs on the node. 
 
 Use the "-k" "command-line switch"_Section_commands.html#start_7 to
 specify the number of GPUs per node, and the number of threads per MPI
 task.  As above for multi-core CPUs (and no GPU), if N is the number
 of physical cores/node, then the number of MPI tasks/node * number of
 threads/task should not exceed N.  With one GPU (and one MPI task) it
 may be faster to use less than all the available cores, by setting
 threads/task to a smaller value.  This is because using all the cores
 on a dual-socket node will incur extra cost to copy memory from the
 2nd socket to the GPU.
 
 Examples of mpirun commands that follow these rules are shown above.
 
 IMPORTANT NOTE: When using a GPU, you will achieve the best
 performance if your input script does not use any fix or compute
 styles which are not yet Kokkos-enabled.  This allows data to stay on
 the GPU for multiple timesteps, without being copied back to the host
 CPU.  Invoking a non-Kokkos fix or compute, or performing I/O for
 "thermo"_thermo_style.html or "dump"_dump.html output will cause data
 to be copied back to the CPU.
 
 You cannot yet assign multiple MPI tasks to the same GPU with the
 KOKKOS package.  We plan to support this in the future, similar to the
 GPU package in LAMMPS.
 
 You cannot yet use both the host (multi-threaded) and device (GPU)
 together to compute pairwise interactions with the KOKKOS package.  We
 hope to support this in the future, similar to the GPU package in
 LAMMPS.
 
 [Running on an Intel Phi:]
 
 Kokkos only uses Intel Phi processors in their "native" mode, i.e.
 not hosted by a CPU.
 
 As illustrated above, build LAMMPS with OMP=yes (the default) and
 MIC=yes.  The latter insures code is correctly compiled for the Intel
 Phi.  The OMP setting means OpenMP will be used for parallelization on
 the Phi, which is currently the best option within Kokkos.  In the
 future, other options may be added.
 
 Current-generation Intel Phi chips have either 61 or 57 cores.  One
 core should be excluded for running the OS, leaving 60 or 56 cores.
 Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
 N = 224 (4*56) cores to run on.
 
 The -np setting of the mpirun command sets the number of MPI
 tasks/node.  The "-k on t Nt" command-line switch sets the number of
 threads/task as Nt.  The product of these 2 values should be N, i.e.
 240 or 224.  Also, the number of threads/task should be a multiple of
 4 so that logical threads from more than one MPI task do not run on
 the same physical core.
 
 Examples of mpirun commands that follow these rules are shown above.
 
 [Restrictions:]
 
 As noted above, if using GPUs, the number of MPI tasks per compute
 node should equal to the number of GPUs per compute node.  In the
 future Kokkos will support assigning multiple MPI tasks to a single
 GPU.
 
 Currently Kokkos does not support AMD GPUs due to limits in the
 available backend programming models.  Specifically, Kokkos requires
 extensive C++ support from the Kernel language.  This is expected to
 change in the future.