diff --git a/doc/src/Section_start.txt b/doc/src/Section_start.txt
index 5f9ac36a4..3711342f7 100644
--- a/doc/src/Section_start.txt
+++ b/doc/src/Section_start.txt
@@ -1,1819 +1,1819 @@
 "Previous Section"_Section_intro.html - "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc - "Next Section"_Section_commands.html :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 2. Getting Started :h3
 
 This section describes how to build and run LAMMPS, for both new and
 experienced users.
 
 2.1 "What's in the LAMMPS distribution"_#start_1
 2.2 "Making LAMMPS"_#start_2
 2.3 "Making LAMMPS with optional packages"_#start_3
 2.4 "Building LAMMPS as a library"_#start_4
 2.5 "Running LAMMPS"_#start_5
 2.6 "Command-line options"_#start_6
 2.7 "Screen output"_#start_7
 2.8 "Tips for users of previous versions"_#start_8 :all(b)
 
 :line
 
 2.1 What's in the LAMMPS distribution :h4,link(start_1)
 
 When you download a LAMMPS tarball you will need to unzip and untar
 the downloaded file with the following commands, after placing the
 tarball in an appropriate directory.
 
 tar -xzvf lammps*.tar.gz :pre
 
 This will create a LAMMPS directory containing two files and several
 sub-directories:
 
 README: text file
 LICENSE: the GNU General Public License (GPL)
 bench: benchmark problems
 doc: documentation
 examples: simple test problems
 potentials: embedded atom method (EAM) potential files
 src: source files
 tools: pre- and post-processing tools :tb(s=:)
 
 Note that the "download page"_download also has links to download
 pre-build Windows installers, as well as pre-built packages for
 several widely used Linux distributions.  It also has instructions
 for how to download/install LAMMPS for Macs (via Homebrew), and to
 download and update LAMMPS from SVN and Git repositories, which gives
 you access to the up-to-date sources that are used by the LAMMPS
 core developers.
 
 :link(download,http://lammps.sandia.gov/download.html)
 
 The Windows and Linux packages for serial or parallel include
 only selected packages and bug-fixes/upgrades listed on "this
 page"_http://lammps.sandia.gov/bug.html up to a certain date, as
 stated on the download page.  If you want an executable with
 non-included packages or that is more current, then you'll need to
 build LAMMPS yourself, as discussed in the next section.
 
 Skip to the "Running LAMMPS"_#start_6 sections for info on how to
 launch a LAMMPS Windows executable on a Windows box.
 
 :line
 
 2.2 Making LAMMPS :h4,link(start_2)
 
 This section has the following sub-sections:
 
 2.2.1 "Read this first"_#start_2_1
 2.2.1 "Steps to build a LAMMPS executable"_#start_2_2
 2.2.3 "Common errors that can occur when making LAMMPS"_#start_2_3
 2.2.4 "Additional build tips"_#start_2_4
 2.2.5 "Building for a Mac"_#start_2_5
 2.2.6 "Building for Windows"_#start_2_6 :all(b)
 
 :line
 
 Read this first :h5,link(start_2_1)
 
 If you want to avoid building LAMMPS yourself, read the preceeding
 section about options available for downloading and installing
 executables.  Details are discussed on the "download"_download page.
 
 Building LAMMPS can be simple or not-so-simple.  If all you need are
 the default packages installed in LAMMPS, and MPI is already installed
 on your machine, or you just want to run LAMMPS in serial, then you
 can typically use the Makefile.mpi or Makefile.serial files in
 src/MAKE by typing one of these lines (from the src dir):
 
 make mpi
 make serial :pre
 
 Note that on a facility supercomputer, there are often "modules"
 loaded in your environment that provide the compilers and MPI you
 should use.  In this case, the "mpicxx" compile/link command in
 Makefile.mpi should simply work by accessing those modules.
 
 It may be the case that one of the other Makefile.machine files in the
 src/MAKE sub-directories is a better match to your system (type "make"
 to see a list), you can use it as-is by typing (for example):
 
 make stampede :pre
 
 If any of these builds (with an existing Makefile.machine) works on
 your system, then you're done!
 
 If you need to install an optional package with a LAMMPS command you
 want to use, and the package does not depend on an extra library, you
 can simply type
 
 make name :pre
 
 before invoking (or re-invoking) the above steps.  "Name" is the
 lower-case name of the package, e.g. replica or user-misc.
 
 If you want to do one of the following:
 
 use a LAMMPS command that requires an extra library (e.g. "dump image"_dump_image.html)
 build with a package that requires an extra library
 build with an accelerator package that requires special compiler/linker settings
 run on a machine that has its own compilers, settings, or libraries :ul
 
 then building LAMMPS is more complicated.  You may need to find where
 extra libraries exist on your machine or install them if they don't.
 You may need to build extra libraries that are included in the LAMMPS
 distribution, before building LAMMPS itself.  You may need to edit a
 Makefile.machine file to make it compatible with your system.
 
 Please read the following sections carefully.  If you are not
 comfortable with makefiles, or building codes on a Unix platform, or
 running an MPI job on your machine, please find a local expert to help
 you.  Many compilation, linking, and run problems users experience are
 often not LAMMPS issues - they are peculiar to the user's system,
 compilers, libraries, etc.  Such questions are better answered by a
 local expert.
 
 If you have a build problem that you are convinced is a LAMMPS issue
 (e.g. the compiler complains about a line of LAMMPS source code), then
 please post the issue to the "LAMMPS mail
 list"_http://lammps.sandia.gov/mail.html.
 
 If you succeed in building LAMMPS on a new kind of machine, for which
 there isn't a similar machine Makefile included in the
 src/MAKE/MACHINES directory, then send it to the developers and we can
 include it in the LAMMPS distribution.
 
 :line
 
 Steps to build a LAMMPS executable :h5,link(start_2_2)
 
 Step 0 :h6
 
 The src directory contains the C++ source and header files for LAMMPS.
 It also contains a top-level Makefile and a MAKE sub-directory with
 low-level Makefile.* files for many systems and machines.  See the
 src/MAKE/README file for a quick overview of what files are available
 and what sub-directories they are in.
 
 The src/MAKE dir has a few files that should work as-is on many
 platforms.  The src/MAKE/OPTIONS dir has more that invoke additional
 compiler, MPI, and other setting options commonly used by LAMMPS, to
 illustrate their syntax.  The src/MAKE/MACHINES dir has many more that
 have been tweaked or optimized for specific machines.  These files are
 all good starting points if you find you need to change them for your
 machine.  Put any file you edit into the src/MAKE/MINE directory and
 it will be never be touched by any LAMMPS updates.
 
 >From within the src directory, type "make" or "gmake".  You should see
 a list of available choices from src/MAKE and all of its
 sub-directories.  If one of those has the options you want or is the
 machine you want, you can type a command like:
 
 make mpi :pre
 or
 
 make serial :pre
 or
 
 gmake mac :pre
 
 Note that the corresponding Makefile.machine can exist in src/MAKE or
 any of its sub-directories.  If a file with the same name appears in
 multiple places (not a good idea), the order they are used is as
 follows: src/MAKE/MINE, src/MAKE, src/MAKE/OPTIONS, src/MAKE/MACHINES.
 This gives preference to a file you have created/edited and put in
 src/MAKE/MINE.
 
 Note that on a multi-processor or multi-core platform you can launch a
 parallel make, by using the "-j" switch with the make command, which
 will build LAMMPS more quickly.
 
 If you get no errors and an executable like [lmp_mpi] or [lmp_serial]
 or [lmp_mac] is produced, then you're done; it's your lucky day.
 
 Note that by default only a few of LAMMPS optional packages are
 installed.  To build LAMMPS with optional packages, see "this
 section"_#start_3 below.
 
 Step 1 :h6
 
 If Step 0 did not work, you will need to create a low-level Makefile
 for your machine, like Makefile.foo.  You should make a copy of an
 existing Makefile.* in src/MAKE or one of its sub-directories as a
 starting point.  The only portions of the file you need to edit are
 the first line, the "compiler/linker settings" section, and the
 "LAMMPS-specific settings" section.  When it works, put the edited
 file in src/MAKE/MINE and it will not be altered by any future LAMMPS
 updates.
 
 Step 2 :h6
 
 Change the first line of Makefile.foo to list the word "foo" after the
 "#", and whatever other options it will set.  This is the line you
 will see if you just type "make".
 
 Step 3 :h6
 
 The "compiler/linker settings" section lists compiler and linker
 settings for your C++ compiler, including optimization flags.  You can
 use g++, the open-source GNU compiler, which is available on all Unix
 systems.  You can also use mpicxx which will typically be available if
 MPI is installed on your system, though you should check which actual
 compiler it wraps.  Vendor compilers often produce faster code.  On
 boxes with Intel CPUs, we suggest using the Intel icc compiler, which
 can be downloaded from "Intel's compiler site"_intel.
 
 :link(intel,http://www.intel.com/software/products/noncom)
 
 If building a C++ code on your machine requires additional libraries,
 then you should list them as part of the LIB variable.  You should
 not need to do this if you use mpicxx.
 
 The DEPFLAGS setting is what triggers the C++ compiler to create a
 dependency list for a source file.  This speeds re-compilation when
 source (*.cpp) or header (*.h) files are edited.  Some compilers do
 not support dependency file creation, or may use a different switch
 than -D.  GNU g++ and Intel icc works with -D.  If your compiler can't
 create dependency files, then you'll need to create a Makefile.foo
 patterned after Makefile.storm, which uses different rules that do not
 involve dependency files.  Note that when you build LAMMPS for the
 first time on a new platform, a long list of *.d files will be printed
 out rapidly.  This is not an error; it is the Makefile doing its
 normal creation of dependencies.
 
 Step 4 :h6
 
 The "system-specific settings" section has several parts.  Note that
 if you change any -D setting in this section, you should do a full
 re-compile, after typing "make clean" (which will describe different
 clean options).
 
 The LMP_INC variable is used to include options that turn on ifdefs
 within the LAMMPS code.  The options that are currently recogized are:
 
 -DLAMMPS_GZIP
 -DLAMMPS_JPEG
 -DLAMMPS_PNG
 -DLAMMPS_FFMPEG
 -DLAMMPS_MEMALIGN
 -DLAMMPS_XDR
 -DLAMMPS_SMALLBIG
 -DLAMMPS_BIGBIG
 -DLAMMPS_SMALLSMALL
 -DLAMMPS_LONGLONG_TO_LONG
 -DLAMMPS_EXCEPTIONS
 -DPACK_ARRAY
 -DPACK_POINTER
 -DPACK_MEMCPY :ul
 
 The read_data and dump commands will read/write gzipped files if you
 compile with -DLAMMPS_GZIP.  It requires that your machine supports
 the "popen()" function in the standard runtime library and that a gzip
 executable can be found by LAMMPS during a run.
 
 NOTE: on some clusters with high-speed networks, using the fork()
 library calls (required by popen()) can interfere with the fast
 communication library and lead to simulations using compressed output
 or input to hang or crash. For selected operations, compressed file
 I/O is also available using a compression library instead, which are
 provided in the COMPRESS package. From more details about compiling
 LAMMPS with packages, please see below.
 
 If you use -DLAMMPS_JPEG, the "dump image"_dump_image.html command
 will be able to write out JPEG image files. For JPEG files, you must
 also link LAMMPS with a JPEG library, as described below. If you use
 -DLAMMPS_PNG, the "dump image"_dump.html command will be able to write
 out PNG image files.  For PNG files, you must also link LAMMPS with a
 PNG library, as described below.  If neither of those two defines are
 used, LAMMPS will only be able to write out uncompressed PPM image
 files.
 
 If you use -DLAMMPS_FFMPEG, the "dump movie"_dump_image.html command
 will be available to support on-the-fly generation of rendered movies
 the need to store intermediate image files. It requires that your
 machines supports the "popen" function in the standard runtime library
 and that an FFmpeg executable can be found by LAMMPS during the run.
 
 NOTE: Similar to the note above, this option can conflict with
 high-speed networks, because it uses popen().
 
 Using -DLAMMPS_MEMALIGN=<bytes> enables the use of the
 posix_memalign() call instead of malloc() when large chunks or memory
 are allocated by LAMMPS.  This can help to make more efficient use of
 vector instructions of modern CPUS, since dynamically allocated memory
 has to be aligned on larger than default byte boundaries (e.g. 16
 bytes instead of 8 bytes on x86 type platforms) for optimal
 performance.
 
 If you use -DLAMMPS_XDR, the build will include XDR compatibility
 files for doing particle dumps in XTC format.  This is only necessary
 if your platform does have its own XDR files available.  See the
 Restrictions section of the "dump"_dump.html command for details.
 
 Use at most one of the -DLAMMPS_SMALLBIG, -DLAMMPS_BIGBIG,
 -DLAMMPS_SMALLSMALL settings.  The default is -DLAMMPS_SMALLBIG. These
 settings refer to use of 4-byte (small) vs 8-byte (big) integers
 within LAMMPS, as specified in src/lmptype.h.  The only reason to use
 the BIGBIG setting is to enable simulation of huge molecular systems
 (which store bond topology info) with more than 2 billion atoms, or to
 track the image flags of moving atoms that wrap around a periodic box
 more than 512 times.  Normally, the only reason to use SMALLSMALL is
 if your machine does not support 64-bit integers, though you can use
 SMALLSMALL setting if you are running in serial or on a desktop
 machine or small cluster where you will never run large systems or for
 long time (more than 2 billion atoms, more than 2 billion timesteps).
 See the "Additional build tips"_#start_2_4 section below for more
 details on these settings.
 
 Note that the USER-ATC package is not currently compatible with
 -DLAMMPS_BIGBIG.  Also the GPU package requires the lib/gpu library to
 be compiled with the same setting, or the link will fail.
 
 The -DLAMMPS_LONGLONG_TO_LONG setting may be needed if your system or
 MPI version does not recognize "long long" data types.  In this case a
 "long" data type is likely already 64-bits, in which case this setting
 will convert to that data type.
 
 The -DLAMMPS_EXCEPTIONS setting can be used to activate alternative
 versions of error handling inside of LAMMPS.  This is useful when
 external codes drive LAMMPS as a library.  Using this option, LAMMPS
 errors do not kill the caller.  Instead, the call stack is unwound and
 control returns to the caller.  The library interface provides the
 lammps_has_error() and lammps_get_last_error_message() functions to
 detect and find out more about a LAMMPS error.
 
 Using one of the -DPACK_ARRAY, -DPACK_POINTER, and -DPACK_MEMCPY
 options can make for faster parallel FFTs (in the PPPM solver) on some
 platforms.  The -DPACK_ARRAY setting is the default.  See the
 "kspace_style"_kspace_style.html command for info about PPPM.  See
 Step 6 below for info about building LAMMPS with an FFT library.
 
 Step 5 :h6
 
 The 3 MPI variables are used to specify an MPI library to build LAMMPS
 with.  Note that you do not need to set these if you use the MPI
 compiler mpicxx for your CC and LINK setting in the section above.
 The MPI wrapper knows where to find the needed files.
 
 If you want LAMMPS to run in parallel, you must have an MPI library
 installed on your platform.  If MPI is installed on your system in the
 usual place (under /usr/local), you also may not need to specify these
 3 variables, assuming /usr/local is in your path.  On some large
 parallel machines which use "modules" for their compile/link
 environements, you may simply need to include the correct module in
 your build environment, before building LAMMPS.  Or the parallel
 machine may have a vendor-provided MPI which the compiler has no
 trouble finding.
 
 Failing this, these 3 variables can be used to specify where the mpi.h
 file (MPI_INC) and the MPI library file (MPI_PATH) are found and the
 name of the library file (MPI_LIB).
 
 If you are installing MPI yourself, we recommend Argonne's MPICH2
 or OpenMPI.  MPICH can be downloaded from the "Argonne MPI
 site"_http://www.mcs.anl.gov/research/projects/mpich2/.  OpenMPI can
 be downloaded from the "OpenMPI site"_http://www.open-mpi.org.
 Other MPI packages should also work. If you are running on a big
 parallel platform, your system people or the vendor should have
 already installed a version of MPI, which is likely to be faster
 than a self-installed MPICH or OpenMPI, so find out how to build
 and link with it.  If you use MPICH or OpenMPI, you will have to
 configure and build it for your platform.  The MPI configure script
 should have compiler options to enable you to use the same compiler
 you are using for the LAMMPS build, which can avoid problems that can
 arise when linking LAMMPS to the MPI library.
 
 If you just want to run LAMMPS on a single processor, you can use the
 dummy MPI library provided in src/STUBS, since you don't need a true
 MPI library installed on your system.  See src/MAKE/Makefile.serial
 for how to specify the 3 MPI variables in this case.  You will also
 need to build the STUBS library for your platform before making LAMMPS
 itself.  Note that if you are building with src/MAKE/Makefile.serial,
 e.g. by typing "make serial", then the STUBS library is built for you.
 
 To build the STUBS library from the src directory, type "make
 mpi-stubs", or from the src/STUBS dir, type "make".  This should
 create a libmpi_stubs.a file suitable for linking to LAMMPS.  If the
 build fails, you will need to edit the STUBS/Makefile for your
 platform.
 
 The file STUBS/mpi.c provides a CPU timer function called MPI_Wtime()
 that calls gettimeofday() .  If your system doesn't support
 gettimeofday() , you'll need to insert code to call another timer.
 Note that the ANSI-standard function clock() rolls over after an hour
 or so, and is therefore insufficient for timing long LAMMPS
 simulations.
 
 Step 6 :h6
 
 The 3 FFT variables allow you to specify an FFT library which LAMMPS
 uses (for performing 1d FFTs) when running the particle-particle
 particle-mesh (PPPM) option for long-range Coulombics via the
 "kspace_style"_kspace_style.html command.
 
 LAMMPS supports common open-source or vendor-supplied FFT libraries
 for this purpose.  If you leave these 3 variables blank, LAMMPS will
 use the open-source "KISS FFT library"_http://kissfft.sf.net, which is
 included in the LAMMPS distribution.  This library is portable to all
 platforms and for typical LAMMPS simulations is almost as fast as FFTW
 or vendor optimized libraries.  If you are not including the KSPACE
 package in your build, you can also leave the 3 variables blank.
 
 Otherwise, select which kinds of FFTs to use as part of the FFT_INC
 setting by a switch of the form -DFFT_XXX.  Recommended values for XXX
 are: MKL or FFTW3.  FFTW2 and NONE are supported as legacy options.
 Selecting -DFFT_FFTW will use the FFTW3 library and -DFFT_NONE will
 use the KISS library described above.
 
 You may also need to set the FFT_INC, FFT_PATH, and FFT_LIB variables,
 so the compiler and linker can find the needed FFT header and library
 files.  Note that on some large parallel machines which use "modules"
 for their compile/link environements, you may simply need to include
 the correct module in your build environment.  Or the parallel machine
 may have a vendor-provided FFT library which the compiler has no
 trouble finding.  See the src/MAKE/OPTIONS/Makefile.fftw file for an
 example of how to specify these variables to use the FFTW3 library.
 
 FFTW is fast, portable library that should also work on any platform
 and typically be faster than KISS FFT.  You can download it from
 "www.fftw.org"_http://www.fftw.org.  Both the legacy version 2.1.X and
 the newer 3.X versions are supported as -DFFT_FFTW2 or -DFFT_FFTW3.
 Building FFTW for your box should be as simple as ./configure; make;
 make install.  The install command typically requires root privileges
 (e.g. invoke it via sudo), unless you specify a local directory with
 the "--prefix" option of configure.  Type "./configure --help" to see
 various options.
 
 If you wish to have FFTW support for single-precision FFTs (see below
 about -DFFT_SINGLE) in addition to the default double-precision FFTs,
 you will need to build FFTW a second time for single-precision.  For
 FFTW3, do this via:
 
 make clean
 ./configure --enable-single; make; make install :pre
 
 which should produce the additional library libfftw3f.a.
 
 For FFTW2, do this:
 
 make clean
 ./configure --enable-float --enable-type-prefix; make; make install :pre
 
 which should produce the additional library libsfftw.a and additional
 include file sfttw.a.  Note that on some platforms FFTW2 has been
 pre-installed for both single- and double-precision, and may already
 have these files as well as libdfftw.a and dfftw.h for double
 precision.
 
 The FFT_INC variable also allows for a -DFFT_SINGLE setting that will
 use single-precision FFTs with PPPM, which can speed-up long-range
 calulations, particularly in parallel or on GPUs.  Fourier transform
 and related PPPM operations are somewhat insensitive to floating point
 truncation errors and thus do not always need to be performed in
 double precision.  Using the -DFFT_SINGLE setting trades off a little
 accuracy for reduced memory use and parallel communication costs for
 transposing 3d FFT data.  Note that single precision FFTs have only
 been tested with the FFTW3, FFTW2, MKL, and KISS FFT options.
 
 When using -DFFT_SINGLE with FFTW3 or FFTW2, you need to build FFTW
 with support for single-precision, as explained above.  For FFTW3 you
 also need to include -lfftw3f with the FFT_LIB setting, in addition to
 -lfftw3.  For FFTW2, you also need to specify -DFFT_SIZE with the
 FFT_INC setting and -lsfftw with the FFT_LIB setting (in place of
 -lfftw).  Similarly, if FFTW2 has been preinstalled with an explicit
 double-precision library (libdfftw.a and not the default libfftw.a),
 then you can specify -DFFT_SIZE (and not -DFFT_SINGLE), and specify
 -ldfftw to use double-precision FFTs.
 
 Step 7 :h6
 
 The 3 JPG variables allow you to specify a JPEG and/or PNG library
 which LAMMPS uses when writing out JPEG or PNG files via the "dump
 image"_dump_image.html command.  These can be left blank if you do not
 use the -DLAMMPS_JPEG or -DLAMMPS_PNG switches discussed above in Step
 4, since in that case JPEG/PNG output will be disabled.
 
 A standard JPEG library usually goes by the name libjpeg.a or
 libjpeg.so and has an associated header file jpeglib.h.  Whichever
 JPEG library you have on your platform, you'll need to set the
 appropriate JPG_INC, JPG_PATH, and JPG_LIB variables, so that the
 compiler and linker can find it.
 
 A standard PNG library usually goes by the name libpng.a or libpng.so
 and has an associated header file png.h.  Whichever PNG library you
 have on your platform, you'll need to set the appropriate JPG_INC,
 JPG_PATH, and JPG_LIB variables, so that the compiler and linker can
 find it.
 
 As before, if these header and library files are in the usual place on
 your machine, you may not need to set these variables.
 
 Step 8 :h6
 
 Note that by default only a few of LAMMPS optional packages are
 installed.  To build LAMMPS with optional packages, see "this
 section"_#start_3 below, before proceeding to Step 9.
 
 Step 9 :h6
 
 That's it.  Once you have a correct Makefile.foo, and you have
 pre-built any other needed libraries (e.g. MPI, FFT, etc) all you need
 to do from the src directory is type something like this:
 
 make foo
 make -j N foo
 gmake foo
 gmake -j N foo :pre
 
 The -j or -j N switches perform a parallel build which can be much
 faster, depending on how many cores your compilation machine has.  N
 is the number of cores the build runs on.
 
 You should get the executable lmp_foo when the build is complete.
 
 :line
 
 Errors that can occur when making LAMMPS :h5 :link(start_2_3)
 
 If an error occurs when building LAMMPS, the compiler or linker will
 state very explicitly what the problem is.  The error message should
 give you a hint as to which of the steps above has failed, and what
 you need to do in order to fix it.  Building a code with a Makefile is
 a very logical process.  The compiler and linker need to find the
 appropriate files and those files need to be compatible with LAMMPS
 settings and source files.  When a make fails, there is usually a very
 simple reason, which you or a local expert will need to fix.
 
 Here are two non-obvious errors that can occur:
 
 (1) If the make command breaks immediately with errors that indicate
 it can't find files with a "*" in their names, this can be because
 your machine's native make doesn't support wildcard expansion in a
 makefile.  Try gmake instead of make.  If that doesn't work, try using
 a -f switch with your make command to use a pre-generated
 Makefile.list which explicitly lists all the needed files, e.g.
 
 make makelist
 make -f Makefile.list linux
 gmake -f Makefile.list mac :pre
 
 The first "make" command will create a current Makefile.list with all
 the file names in your src dir.  The 2nd "make" command (make or
 gmake) will use it to build LAMMPS.  Note that you should
 include/exclude any desired optional packages before using the "make
 makelist" command.
 
 (2) If you get an error that says something like 'identifier "atoll"
 is undefined', then your machine does not support "long long"
 integers.  Try using the -DLAMMPS_LONGLONG_TO_LONG setting described
 above in Step 4.
 
 :line
 
 Additional build tips :h5,link(start_2_4)
 
 Building LAMMPS for multiple platforms. :h6
 
 You can make LAMMPS for multiple platforms from the same src
 directory.  Each target creates its own object sub-directory called
 Obj_target where it stores the system-specific *.o files.
 
 Cleaning up. :h6
 
 Typing "make clean-all" or "make clean-machine" will delete *.o object
 files created when LAMMPS is built, for either all builds or for a
 particular machine.
 
 Changing the LAMMPS size limits via -DLAMMPS_SMALLBIG or -DLAMMPS_BIGBIG or -DLAMMPS_SMALLSMALL :h6
 
 As explained above, any of these 3 settings can be specified on the
 LMP_INC line in your low-level src/MAKE/Makefile.foo.
 
 The default is -DLAMMPS_SMALLBIG which allows for systems with up to
 2^63 atoms and 2^63 timesteps (about 9e18). The atom limit is for
 atomic systems which do not store bond topology info and thus do not
 require atom IDs.  If you use atom IDs for atomic systems (which is
 the default) or if you use a molecular model, which stores bond
 topology info and thus requires atom IDs, the limit is 2^31 atoms
 (about 2 billion).  This is because the IDs are stored in 32-bit
 integers.
 
 Likewise, with this setting, the 3 image flags for each atom (see the
 "dump"_dump.html doc page for a discussion) are stored in a 32-bit
 integer, which means the atoms can only wrap around a periodic box (in
 each dimension) at most 512 times.  If atoms move through the periodic
 box more than this many times, the image flags will "roll over",
 e.g. from 511 to -512, which can cause diagnostics like the
 mean-squared displacement, as calculated by the "compute
 msd"_compute_msd.html command, to be faulty.
 
 To allow for larger atomic systems with atom IDs or larger molecular
 systems or larger image flags, compile with -DLAMMPS_BIGBIG.  This
 stores atom IDs and image flags in 64-bit integers.  This enables
 atomic or molecular systems with atom IDS of up to 2^63 atoms (about
 9e18).  And image flags will not "roll over" until they reach 2^20 =
 1048576.
 
 If your system does not support 8-byte integers, you will need to
 compile with the -DLAMMPS_SMALLSMALL setting.  This will restrict the
 total number of atoms (for atomic or molecular systems) and timesteps
 to 2^31 (about 2 billion).  Image flags will roll over at 2^9 = 512.
 
 Note that in src/lmptype.h there are definitions of all these data
 types as well as the MPI data types associated with them.  The MPI
 types need to be consistent with the associated C data types, or else
 LAMMPS will generate a run-time error.  As far as we know, the
 settings defined in src/lmptype.h are portable and work on every
 current system.
 
 In all cases, the size of problem that can be run on a per-processor
 basis is limited by 4-byte integer storage to 2^31 atoms per processor
 (about 2 billion). This should not normally be a limitation since such
 a problem would have a huge per-processor memory footprint due to
 neighbor lists and would run very slowly in terms of CPU secs/timestep.
 
 :line
 
 Building for a Mac :h5,link(start_2_5)
 
 OS X is a derivative of BSD Unix, so it should just work.  See the
 src/MAKE/MACHINES/Makefile.mac and Makefile.mac_mpi files.
 
 :line
 
 Building for Windows :h5,link(start_2_6)
 
 If you want to build a Windows version of LAMMPS, you can build it
 yourself, but it may require some effort. LAMMPS expects a Unix-like
 build environment for the default build procedure. This can be done
 using either Cygwin or MinGW; the latter also exists as a ready-to-use
 Linux-to-Windows cross-compiler in several Linux distributions. In
 these cases, you can do the installation after installing several
 unix-style commands like make, grep, sed and bash with some shell
 utilities.
 
 For Cygwin and the MinGW cross-compilers, suitable makefiles are
 provided in src/MAKE/MACHINES. When using other compilers, like
 Visual C++ or Intel compilers for Windows, you may have to implement
 your own build system. Due to differences between the Windows OS
 and Windows system libraries to Unix-like environments like Linux
 or MacOS, when compiling for Windows a few adjustments may be needed:
 
 Do [not] set the -DLAMMPS_MEMALIGN define (see LMP_INC makefile variable)
 Add -lwsock32 -lpsapi to the linker flags (see LIB makefile variable)
 Try adding -static-libgcc or -static or both to the linker flags when your LAMMPS executable complains about missing .dll files  :ul
 
 Since none of the current LAMMPS core developers has significant
 experience building executables on Windows, we are happy to distribute
 contributed instructions and modifications to improve the situation,
 but we cannot provide support for those.
 
 With the so-called "Anniversary Update" to Windows 10, there is a
 Ubuntu Linux subsystem available for Windows, that can be installed
 and then used to compile/install LAMMPS as if you are running on a
 Ubuntu Linux system instead of Windows.
 
 As an alternative, you can download pre-compiled installer packages from
 "packages.lammps.org/windows.html"_http://packages.lammps.org/windows.html.
 These executables are built with most optional packages included and the
 download includes documentation, potential files, some tools and many
 examples, but no source code.
 
 :line
 
 2.3 Making LAMMPS with optional packages :h4,link(start_3)
 
 This section has the following sub-sections:
 
 2.3.1 "Package basics"_#start_3_1
 2.3.2 "Including/excluding packages"_#start_3_2
 2.3.3 "Packages that require extra libraries"_#start_3_3 :all(b)
 
 :line
 
 Package basics: :h5,link(start_3_1)
 
 The source code for LAMMPS is structured as a set of core files which
 are always included, plus optional packages.  Packages are groups of
 files that enable a specific set of features.  For example, force
 fields for molecular systems or granular systems are in packages.
 
 "Section 4"_Section_packages.html in the manual has details about all
 the packages, which come in two flavors: [standard] and [user]
 packages. It also has specific instructions for building LAMMPS with
 any package which requires an extra library.  General instructions are
 below.
 
 You can see the list of all packages by typing "make package" from
 within the src directory of the LAMMPS distribution.  It will also
 list various make commands that can be used to manage packages.
 
 If you use a command in a LAMMPS input script that is part of a
 package, you must have built LAMMPS with that package, else you will
 get an error that the style is invalid or the command is unknown.
 Every command's doc page specfies if it is part of a package.  You can
 type
 
 lmp_machine -h :pre
 
 to run your executable with the optional "-h command-line
 switch"_#start_6 for "help", which will list the styles and commands
 known to your executable, and immediately exit.
 
 :line
 
 Including/excluding packages :h5,link(start_3_2)
 
 To use (or not use) a package you must install it (or un-install it)
 before building LAMMPS.  From the src directory, this is as simple as:
 
 make yes-colloid
 make mpi :pre
 
 or
 
 make no-user-omp
 make mpi :pre
 
 NOTE: You should NOT install/un-install packages and build LAMMPS in a
 single make command using multiple targets, e.g. make yes-colloid mpi.
 This is because the make procedure creates a list of source files that
 will be out-of-date for the build if the package configuration changes
 within the same command.
 
 Any package can be installed or not in a LAMMPS build, independent of
 all other packages.  However, some packages include files derived from
 files in other packages.  LAMMPS checks for this and does the right
 thing.  I.e. individual files are only included if their dependencies
 are already included.  Likewise, if a package is excluded, other files
 dependent on that package are also excluded.
 
 NOTE: The one exception is that we do not recommend building with both
 the KOKKOS package installed and any of the other acceleration
 packages (GPU, OPT, USER-INTEL, USER-OMP) also installed.  This is
 because of how Kokkos sometimes builds using a wrapper compiler which
 can make it difficult to invoke all the compile/link flags correctly
 for both Kokkos and non-Kokkos files.
 
 If you will never run simulations that use the features in a
 particular packages, there is no reason to include it in your build.
 For some packages, this will keep you from having to build extra
 libraries, and will also produce a smaller executable which may run a
 bit faster.
 
 When you download a LAMMPS tarball, three packages are pre-installed
 in the src directory -- KSPACE, MANYBODY, MOLECULE -- because they are
 so commonly used.  When you download LAMMPS source files from the SVN
 or Git repositories, no packages are pre-installed.
 
 Packages are installed or un-installed by typing
 
 make yes-name
 make no-name :pre
 
 where "name" is the name of the package in lower-case, e.g.  name =
 kspace for the KSPACE package or name = user-atc for the USER-ATC
 package.  You can also type any of these commands:
 
 make yes-all | install all packages
 make no-all | un-install all packages
 make yes-standard or make yes-std | install standard packages
 make no-standard or make no-std| un-install standard packages
 make yes-user | install user packages
 make no-user | un-install user packages
 make yes-lib | install packages that require extra libraries
 make no-lib | un-install packages that require extra libraries
 make yes-ext | install packages that require external libraries
 make no-ext | un-install packages that require external libraries :tb(s=|)
 
 which install/un-install various sets of packages.  Typing "make
 package" will list all the these commands.
 
 NOTE: Installing or un-installing a package works by simply moving
 files back and forth between the main src directory and
 sub-directories with the package name (e.g. src/KSPACE, src/USER-ATC),
 so that the files are included or excluded when LAMMPS is built.
 After you have installed or un-installed a package, you must re-build
 LAMMPS for the action to take effect.
 
 The following make commands help manage files that exist in both the
 src directory and in package sub-directories.  You do not normally
 need to use these commands unless you are editing LAMMPS files or have
 downloaded a patch from the LAMMPS web site.
 
 Typing "make package-status" or "make ps" will show which packages are
 currently installed. For those that are installed, it will list any
 files that are different in the src directory and package
 sub-directory.
 
 Typing "make package-update" or "make pu" will overwrite src files
 with files from the package sub-directories if the package is
 installed.  It should be used after a patch has been applied, since
 patches only update the files in the package sub-directory, but not
 the src files.
 
 Typing "make package-overwrite" will overwrite files in the package
 sub-directories with src files.
 
 Typing "make package-diff" lists all differences between these files.
 
 Again, just type "make package" to see all of the package-related make
 options.
 
 :line
 
 Packages that require extra libraries :h5,link(start_3_3)
 
 A few of the standard and user packages require extra libraries.  See
 "Section 4"_Section_packages.html for two tables of packages which
 indicate which ones require libraries.  For each such package, the
 Section 4 doc page gives details on how to build the extra library,
 including how to download it if necessary.  The basic ideas are
 summarized here.
 
 [System libraries:]
 
 Packages in the tables "Section 4"_Section_packages.html with a "sys"
 in the last column link to system libraries that typically already
 exist on your machine.  E.g. the python package links to a system
 Python library.  If your machine does not have the required library,
 you will have to download and install it on your machine, in either
 the system or user space.
 
 [Internal libraries:]
 
 Packages in the tables "Section 4"_Section_packages.html with an "int"
 in the last column link to internal libraries whose source code is
 included with LAMMPS, in the lib/name directory where name is the
 package name.  You must first build the library in that directory
 before building LAMMPS with that package installed.  E.g. the gpu
 package links to a library you build in the lib/gpu dir.  You can
 often do the build in one step by typing "make lib-name args=..."
 from the src dir, with appropriate arguments.  You can leave off the
 args to see a help message.  See "Section 4"_Section_packages.html for
 details for each package.
 
 [External libraries:]
 
 Packages in the tables "Section 4"_Section_packages.html with an "ext"
 in the last column link to exernal libraries whose source code is not
 included with LAMMPS.  You must first download and install the library
 before building LAMMPS with that package installed.  E.g. the voronoi
 package links to the freely available "Voro++ library"_voro_home2.  You
 can often do the download/build in one step by typing "make lib-name
 args=..." from the src dir, with appropriate arguments.  You can leave
 off the args to see a help message.  See "Section
 4"_Section_packages.html for details for each package.
 
 :link(voro_home2,http://math.lbl.gov/voro++)
 
 [Possible errors:]
 
 There are various common errors which can occur when building extra
 libraries or when building LAMMPS with packages that require the extra
 libraries.
 
 If you cannot build the extra library itself successfully, you may
 need to edit or create an appropriate Makefile for your machine, e.g.
 with appropriate compiler or system settings.  Provided makefiles are
 typically in the lib/name directory.  E.g. see the Makefile.* files in
 lib/gpu.
 
 The LAMMPS build often uses settings in a lib/name/Makefile.lammps
 file which either exists in the LAMMPS distribution or is created or
 copied from a lib/name/Makefile.lammps.* file when the library is
 built.  If those settings are not correct for your machine you will
 need to edit or create an appropriate Makefile.lammps file.
 
 Package-specific details for these steps are given in "Section
 4"_Section_packages.html an in README files in the lib/name
 directories.
 
 [Compiler options needed for accelerator packages:]
 
 Several packages contain code that is optimized for specific hardware,
 e.g. CPU, KNL, or GPU.  These are the OPT, GPU, KOKKOS, USER-INTEL,
 and USER-OMP packages.  Compiling and linking the source files in
 these accelerator packages for optimal performance requires specific
 settings in the Makefile.machine file you use.
 
 A summary of the Makefile.machine settings needed for each of these
 packages is given in "Section 4"_Section_packages.html.  More info is
 given on the doc pages that describe each package in detail:
 
 5.3.1 "USER-INTEL package"_accelerate_intel.html
 5.3.2 "GPU package"_accelerate_intel.html
 5.3.3 "KOKKOS package"_accelerate_kokkos.html
 5.3.4 "USER-OMP package"_accelerate_omp.html
 5.3.5 "OPT package"_accelerate_opt.html :all(b)
 
 You can also use or examine the following machine Makefiles in
 src/MAKE/OPTIONS, which include the settings.  Note that the
 USER-INTEL and KOKKOS packages can use settings that build LAMMPS for
 different hardware.  The USER-INTEL package can be compiled for Intel
 CPUs and KNLs; the KOKKOS package builds for CPUs (OpenMP), GPUs
 (CUDA), and Intel KNLs.
 
 Makefile.intel_cpu
 Makefile.intel_phi
 Makefile.kokkos_omp
-Makefile.kokkos_cuda
+Makefile.kokkos_cuda_mpi
 Makefile.kokkos_phi
 Makefile.omp
 Makefile.opt :ul
 
 :line
 
 2.4 Building LAMMPS as a library :h4,link(start_4)
 
 LAMMPS can be built as either a static or shared library, which can
 then be called from another application or a scripting language.  See
 "this section"_Section_howto.html#howto_10 for more info on coupling
 LAMMPS to other codes.  See "this section"_Section_python.html for
 more info on wrapping and running LAMMPS from Python.
 
 Static library :h5
 
 To build LAMMPS as a static library (*.a file on Linux), type
 
 make foo mode=lib :pre
 
 where foo is the machine name.  This kind of library is typically used
 to statically link a driver application to LAMMPS, so that you can
 insure all dependencies are satisfied at compile time.  This will use
 the ARCHIVE and ARFLAGS settings in src/MAKE/Makefile.foo.  The build
 will create the file liblammps_foo.a which another application can
 link to.  It will also create a soft link liblammps.a, which will
 point to the most recently built static library.
 
 Shared library :h5
 
 To build LAMMPS as a shared library (*.so file on Linux), which can be
 dynamically loaded, e.g. from Python, type
 
 make foo mode=shlib :pre
 
 where foo is the machine name.  This kind of library is required when
 wrapping LAMMPS with Python; see "Section 11"_Section_python.html
 for details.  This will use the SHFLAGS and SHLIBFLAGS settings in
 src/MAKE/Makefile.foo and perform the build in the directory
 Obj_shared_foo.  This is so that each file can be compiled with the
 -fPIC flag which is required for inclusion in a shared library.  The
 build will create the file liblammps_foo.so which another application
 can link to dyamically.  It will also create a soft link liblammps.so,
 which will point to the most recently built shared library.  This is
 the file the Python wrapper loads by default.
 
 Note that for a shared library to be usable by a calling program, all
 the auxiliary libraries it depends on must also exist as shared
 libraries.  This will be the case for libraries included with LAMMPS,
 such as the dummy MPI library in src/STUBS or any package libraries in
 lib/packages, since they are always built as shared libraries using
 the -fPIC switch.  However, if a library like MPI or FFTW does not
 exist as a shared library, the shared library build will generate an
 error.  This means you will need to install a shared library version
 of the auxiliary library.  The build instructions for the library
 should tell you how to do this.
 
 Here is an example of such errors when the system FFTW or provided
 lib/colvars library have not been built as shared libraries:
 
 /usr/bin/ld: /usr/local/lib/libfftw3.a(mapflags.o): relocation
 R_X86_64_32 against '.rodata' can not be used when making a shared
 object; recompile with -fPIC
 /usr/local/lib/libfftw3.a: could not read symbols: Bad value :pre
 
 /usr/bin/ld: ../../lib/colvars/libcolvars.a(colvarmodule.o):
 relocation R_X86_64_32 against '__pthread_key_create' can not be used
 when making a shared object; recompile with -fPIC
 ../../lib/colvars/libcolvars.a: error adding symbols: Bad value :pre
 
 As an example, here is how to build and install the "MPICH
 library"_mpich, a popular open-source version of MPI, distributed by
 Argonne National Labs, as a shared library in the default
 /usr/local/lib location:
 
 :link(mpich,http://www-unix.mcs.anl.gov/mpi)
 
 ./configure --enable-shared
 make
 make install :pre
 
 You may need to use "sudo make install" in place of the last line if
 you do not have write privileges for /usr/local/lib.  The end result
 should be the file /usr/local/lib/libmpich.so.
 
 [Additional requirement for using a shared library:] :h5
 
 The operating system finds shared libraries to load at run-time using
 the environment variable LD_LIBRARY_PATH.  So you may wish to copy the
 file src/liblammps.so or src/liblammps_g++.so (for example) to a place
 the system can find it by default, such as /usr/local/lib, or you may
 wish to add the LAMMPS src directory to LD_LIBRARY_PATH, so that the
 current version of the shared library is always available to programs
 that use it.
 
 For the csh or tcsh shells, you would add something like this to your
 ~/.cshrc file:
 
 setenv LD_LIBRARY_PATH $\{LD_LIBRARY_PATH\}:/home/sjplimp/lammps/src :pre
 
 Calling the LAMMPS library :h5
 
 Either flavor of library (static or shared) allows one or more LAMMPS
 objects to be instantiated from the calling program.
 
 When used from a C++ program, all of LAMMPS is wrapped in a LAMMPS_NS
 namespace; you can safely use any of its classes and methods from
 within the calling code, as needed.
 
 When used from a C or Fortran program or a scripting language like
 Python, the library has a simple function-style interface, provided in
 src/library.cpp and src/library.h.
 
 See the sample codes in examples/COUPLE/simple for examples of C++ and
 C and Fortran codes that invoke LAMMPS thru its library interface.
 There are other examples as well in the COUPLE directory which are
 discussed in "Section 6.10"_Section_howto.html#howto_10 of the
 manual.  See "Section 11"_Section_python.html of the manual for a
 description of the Python wrapper provided with LAMMPS that operates
 through the LAMMPS library interface.
 
 The files src/library.cpp and library.h define the C-style API for
 using LAMMPS as a library.  See "Section
 6.19"_Section_howto.html#howto_19 of the manual for a description of the
 interface and how to extend it for your needs.
 
 :line
 
 2.5 Running LAMMPS :h4,link(start_5)
 
 By default, LAMMPS runs by reading commands from standard input.  Thus
 if you run the LAMMPS executable by itself, e.g.
 
 lmp_linux :pre
 
 it will simply wait, expecting commands from the keyboard.  Typically
 you should put commands in an input script and use I/O redirection,
 e.g.
 
 lmp_linux < in.file :pre
 
 For parallel environments this should also work.  If it does not, use
 the '-in' command-line switch, e.g.
 
 lmp_linux -in in.file :pre
 
 "This section"_Section_commands.html describes how input scripts are
 structured and what commands they contain.
 
 You can test LAMMPS on any of the sample inputs provided in the
 examples or bench directory.  Input scripts are named in.* and sample
 outputs are named log.*.name.P where name is a machine and P is the
 number of processors it was run on.
 
 Here is how you might run a standard Lennard-Jones benchmark on a
 Linux box, using mpirun to launch a parallel job:
 
 cd src
 make linux
 cp lmp_linux ../bench
 cd ../bench
 mpirun -np 4 lmp_linux -in in.lj :pre
 
 See "this page"_bench for timings for this and the other benchmarks on
 various platforms.  Note that some of the example scripts require
 LAMMPS to be built with one or more of its optional packages.
 
 :link(bench,http://lammps.sandia.gov/bench.html)
 
 :line
 
 On a Windows box, you can skip making LAMMPS and simply download an
 installer package from "here"_http://packages.lammps.org/windows.html
 
 For running the non-MPI executable, follow these steps:
 
 Get a command prompt by going to Start->Run... ,
 then typing "cmd". :ulb,l
 
 Move to the directory where you have your input, e.g. a copy of
 the [in.lj] input from the bench folder. (e.g. by typing: cd "Documents"). :l
 
 At the command prompt, type "lmp_serial -in in.lj", replacing [in.lj]
 with the name of your LAMMPS input script. :l
 
 The serial executable includes support for multi-threading
 parallelization from the styles in the USER-OMP packages.
 
 To run with, e.g. 4 threads, type "lmp_serial -in in.lj -pk omp 4 -sf omp"
 :ule
 
 For the MPI version, which allows you to run LAMMPS under Windows with
 the more general message passing parallel library (LAMMPS has been
 designed from ground up to use MPI efficiently), follow these steps:
 
 Download and install a compatible MPI library binary package:
 for 32-bit Windows
 "mpich2-1.4.1p1-win-ia32.msi"_download.lammps.org/thirdparty/mpich2-1.4.1p1-win-ia32.msi
 and for 64-bit Windows
 "mpich2-1.4.1p1-win-x86-64.msi"_download.lammps.org/thirdparty/mpich2-1.4.1p1-win-x86-64.msi
 :ulb,l
 
 The LAMMPS Windows installer packages will automatically adjust your
 path for the default location of this MPI package. After the installation
 of the MPICH2 software, it needs to be integrated into the system.
 For this you need to start a Command Prompt in {Administrator Mode}
 (right click on the icon and select it). Change into the MPICH2
 installation directory, then into the subdirectory [bin] and execute
 [smpd.exe -install]. Exit the command window.
 
 Get a new, regular command prompt by going to Start->Run... ,
 then typing "cmd". :l
 
 Move to the directory where you have your input file
 (e.g. by typing: cd "Documents"). :l
 
 Then type something like this:
 
 mpiexec -localonly 4 lmp_mpi -in in.lj :pre
 or
 
 mpiexec -np 4 lmp_mpi -in in.lj :pre
 
 replacing [in.lj] with the name of your LAMMPS input script. For the latter
 case, you may be prompted to enter your password. :l
 
 In this mode, output may not immediately show up on the screen, so if
 your input script takes a long time to execute, you may need to be
 patient before the output shows up. :l
 
 The parallel executable can also run on a single processor by typing
 something like:
 
 lmp_mpi -in in.lj :pre
 
 And the parallel executable also includes OpenMP multi-threading, which
 can be combined with MPI using something like:
 
 mpiexec -localonly 2 lmp_mpi -in in.lj -pk omp 2 -sf omp :pre
 
 :ule
 
 :line
 
 The screen output from LAMMPS is described in a section below.  As it
 runs, LAMMPS also writes a log.lammps file with the same information.
 
 Note that this sequence of commands copies the LAMMPS executable
 (lmp_linux) to the directory with the input files.  This may not be
 necessary, but some versions of MPI reset the working directory to
 where the executable is, rather than leave it as the directory where
 you launch mpirun from (if you launch lmp_linux on its own and not
 under mpirun).  If that happens, LAMMPS will look for additional input
 files and write its output files to the executable directory, rather
 than your working directory, which is probably not what you want.
 
 If LAMMPS encounters errors in the input script or while running a
 simulation it will print an ERROR message and stop or a WARNING
 message and continue.  See "Section 12"_Section_errors.html for a
 discussion of the various kinds of errors LAMMPS can or can't detect,
 a list of all ERROR and WARNING messages, and what to do about them.
 
 LAMMPS can run a problem on any number of processors, including a
 single processor.  In theory you should get identical answers on any
 number of processors and on any machine.  In practice, numerical
 round-off can cause slight differences and eventual divergence of
 molecular dynamics phase space trajectories.
 
 LAMMPS can run as large a problem as will fit in the physical memory
 of one or more processors.  If you run out of memory, you must run on
 more processors or setup a smaller problem.
 
 :line
 
 2.6 Command-line options :h4,link(start_6)
 
 At run time, LAMMPS recognizes several optional command-line switches
 which may be used in any order.  Either the full word or a one-or-two
 letter abbreviation can be used:
 
 -e or -echo
 -h or -help
 -i or -in
 -k or -kokkos
 -l or -log
 -nc or -nocite
 -pk or -package
 -p or -partition
 -pl or -plog
 -ps or -pscreen
 -r or -restart
 -ro or -reorder
 -sc or -screen
 -sf or -suffix
 -v or -var :ul
 
 For example, lmp_ibm might be launched as follows:
 
 mpirun -np 16 lmp_ibm -v f tmp.out -l my.log -sc none -in in.alloy
 mpirun -np 16 lmp_ibm -var f tmp.out -log my.log -screen none -in in.alloy :pre
 
 Here are the details on the options:
 
 -echo style :pre
 
 Set the style of command echoing.  The style can be {none} or {screen}
 or {log} or {both}.  Depending on the style, each command read from
 the input script will be echoed to the screen and/or logfile.  This
 can be useful to figure out which line of your script is causing an
 input error.  The default value is {log}.  The echo style can also be
 set by using the "echo"_echo.html command in the input script itself.
 
 -help :pre
 
 Print a brief help summary and a list of options compiled into this
 executable for each LAMMPS style (atom_style, fix, compute,
 pair_style, bond_style, etc).  This can tell you if the command you
 want to use was included via the appropriate package at compile time.
 LAMMPS will print the info and immediately exit if this switch is
 used.
 
 -in file :pre
 
 Specify a file to use as an input script.  This is an optional switch
 when running LAMMPS in one-partition mode.  If it is not specified,
 LAMMPS reads its script from standard input, typically from a script
 via I/O redirection; e.g. lmp_linux < in.run.  I/O redirection should
 also work in parallel, but if it does not (in the unlikely case that
 an MPI implementation does not support it), then use the -in flag.
 Note that this is a required switch when running LAMMPS in
 multi-partition mode, since multiple processors cannot all read from
 stdin.
 
 -kokkos on/off keyword/value ... :pre
 
 Explicitly enable or disable KOKKOS support, as provided by the KOKKOS
 package.  Even if LAMMPS is built with this package, as described
 above in "Section 2.3"_#start_3, this switch must be set to enable
 running with the KOKKOS-enabled styles the package provides.  If the
 switch is not set (the default), LAMMPS will operate as if the KOKKOS
 package were not installed; i.e. you can run standard LAMMPS or with
 the GPU or USER-OMP packages, for testing or benchmarking purposes.
 
 Additional optional keyword/value pairs can be specified which
 determine how Kokkos will use the underlying hardware on your
 platform.  These settings apply to each MPI task you launch via the
 "mpirun" or "mpiexec" command.  You may choose to run one or more MPI
 tasks per physical node.  Note that if you are running on a desktop
 machine, you typically have one physical node.  On a cluster or
 supercomputer there may be dozens or 1000s of physical nodes.
 
 Either the full word or an abbreviation can be used for the keywords.
 Note that the keywords do not use a leading minus sign.  I.e. the
 keyword is "t", not "-t".  Also note that each of the keywords has a
 default setting.  Example of when to use these options and what
 settings to use on different platforms is given in "Section
 5.3"_Section_accelerate.html#acc_3.
 
 d or device
 g or gpus
 t or threads
 n or numa :ul
 
 device Nd :pre
 
 This option is only relevant if you built LAMMPS with CUDA=yes, you
 have more than one GPU per node, and if you are running with only one
 MPI task per node.  The Nd setting is the ID of the GPU on the node to
 run on.  By default Nd = 0.  If you have multiple GPUs per node, they
 have consecutive IDs numbered as 0,1,2,etc.  This setting allows you
 to launch multiple independent jobs on the node, each with a single
 MPI task per node, and assign each job to run on a different GPU.
 
 gpus Ng Ns :pre
 
 This option is only relevant if you built LAMMPS with CUDA=yes, you
 have more than one GPU per node, and you are running with multiple MPI
 tasks per node (up to one per GPU).  The Ng setting is how many GPUs
 you will use.  The Ns setting is optional.  If set, it is the ID of a
 GPU to skip when assigning MPI tasks to GPUs.  This may be useful if
 your desktop system reserves one GPU to drive the screen and the rest
 are intended for computational work like running LAMMPS.  By default
 Ng = 1 and Ns is not set.
 
 Depending on which flavor of MPI you are running, LAMMPS will look for
 one of these 3 environment variables
 
 SLURM_LOCALID (various MPI variants compiled with SLURM support)
 MV2_COMM_WORLD_LOCAL_RANK (Mvapich)
 OMPI_COMM_WORLD_LOCAL_RANK (OpenMPI) :pre
 
 which are initialized by the "srun", "mpirun" or "mpiexec" commands.
 The environment variable setting for each MPI rank is used to assign a
 unique GPU ID to the MPI task.
 
 threads Nt :pre
 
 This option assigns Nt number of threads to each MPI task for
 performing work when Kokkos is executing in OpenMP or pthreads mode.
 The default is Nt = 1, which essentially runs in MPI-only mode.  If
 there are Np MPI tasks per physical node, you generally want Np*Nt =
 the number of physical cores per node, to use your available hardware
 optimally.  This also sets the number of threads used by the host when
 LAMMPS is compiled with CUDA=yes.
 
 numa Nm :pre
 
 This option is only relevant when using pthreads with hwloc support.
 In this case Nm defines the number of NUMA regions (typicaly sockets)
 on a node which will be utilizied by a single MPI rank.  By default Nm
 = 1.  If this option is used the total number of worker-threads per
 MPI rank is threads*numa.  Currently it is always almost better to
 assign at least one MPI rank per NUMA region, and leave numa set to
 its default value of 1. This is because letting a single process span
 multiple NUMA regions induces a significant amount of cross NUMA data
 traffic which is slow.
 
 -log file :pre
 
 Specify a log file for LAMMPS to write status information to.  In
 one-partition mode, if the switch is not used, LAMMPS writes to the
 file log.lammps.  If this switch is used, LAMMPS writes to the
 specified file.  In multi-partition mode, if the switch is not used, a
 log.lammps file is created with hi-level status information.  Each
 partition also writes to a log.lammps.N file where N is the partition
 ID.  If the switch is specified in multi-partition mode, the hi-level
 logfile is named "file" and each partition also logs information to a
 file.N.  For both one-partition and multi-partition mode, if the
 specified file is "none", then no log files are created.  Using a
 "log"_log.html command in the input script will override this setting.
 Option -plog will override the name of the partition log files file.N.
 
 -nocite :pre
 
 Disable writing the log.cite file which is normally written to list
 references for specific cite-able features used during a LAMMPS run.
 See the "citation page"_http://lammps.sandia.gov/cite.html for more
 details.
 
 -package style args .... :pre
 
 Invoke the "package"_package.html command with style and args.  The
 syntax is the same as if the command appeared at the top of the input
 script.  For example "-package gpu 2" or "-pk gpu 2" is the same as
 "package gpu 2"_package.html in the input script.  The possible styles
 and args are documented on the "package"_package.html doc page.  This
 switch can be used multiple times, e.g. to set options for the
 USER-INTEL and USER-OMP packages which can be used together.
 
 Along with the "-suffix" command-line switch, this is a convenient
 mechanism for invoking accelerator packages and their options without
 having to edit an input script.
 
 -partition 8x2 4 5 ... :pre
 
 Invoke LAMMPS in multi-partition mode.  When LAMMPS is run on P
 processors and this switch is not used, LAMMPS runs in one partition,
 i.e. all P processors run a single simulation.  If this switch is
 used, the P processors are split into separate partitions and each
 partition runs its own simulation.  The arguments to the switch
 specify the number of processors in each partition.  Arguments of the
 form MxN mean M partitions, each with N processors.  Arguments of the
 form N mean a single partition with N processors.  The sum of
 processors in all partitions must equal P.  Thus the command
 "-partition 8x2 4 5" has 10 partitions and runs on a total of 25
 processors.
 
 Running with multiple partitions can e useful for running
 "multi-replica simulations"_Section_howto.html#howto_5, where each
 replica runs on on one or a few processors.  Note that with MPI
 installed on a machine (e.g. your desktop), you can run on more
 (virtual) processors than you have physical processors.
 
 To run multiple independent simulatoins from one input script, using
 multiple partitions, see "Section 6.4"_Section_howto.html#howto_4
 of the manual.  World- and universe-style "variables"_variable.html
 are useful in this context.
 
 -plog file :pre
 
 Specify the base name for the partition log files, so partition N
 writes log information to file.N. If file is none, then no partition
 log files are created.  This overrides the filename specified in the
 -log command-line option.  This option is useful when working with
 large numbers of partitions, allowing the partition log files to be
 suppressed (-plog none) or placed in a sub-directory (-plog
 replica_files/log.lammps) If this option is not used the log file for
 partition N is log.lammps.N or whatever is specified by the -log
 command-line option.
 
 -pscreen file :pre
 
 Specify the base name for the partition screen file, so partition N
 writes screen information to file.N. If file is none, then no
 partition screen files are created.  This overrides the filename
 specified in the -screen command-line option.  This option is useful
 when working with large numbers of partitions, allowing the partition
 screen files to be suppressed (-pscreen none) or placed in a
 sub-directory (-pscreen replica_files/screen).  If this option is not
 used the screen file for partition N is screen.N or whatever is
 specified by the -screen command-line option.
 
 -restart restartfile {remap} datafile keyword value ... :pre
 
 Convert the restart file into a data file and immediately exit.  This
 is the same operation as if the following 2-line input script were
 run:
 
 read_restart restartfile {remap}
 write_data datafile keyword value ... :pre
 
 Note that the specified restartfile and datafile can have wild-card
 characters ("*",%") as described by the
 "read_restart"_read_restart.html and "write_data"_write_data.html
 commands.  But a filename such as file.* will need to be enclosed in
 quotes to avoid shell expansion of the "*" character.
 
 Note that following restartfile, the optional flag {remap} can be
 used.  This has the same effect as adding it to the
 "read_restart"_read_restart.html command, as explained on its doc
 page.  This is only useful if the reading of the restart file triggers
 an error that atoms have been lost.  In that case, use of the remap
 flag should allow the data file to still be produced.
 
 Also note that following datafile, the same optional keyword/value
 pairs can be listed as used by the "write_data"_write_data.html
 command.
 
 -reorder nth N
 -reorder custom filename :pre
 
 Reorder the processors in the MPI communicator used to instantiate
 LAMMPS, in one of several ways.  The original MPI communicator ranks
 all P processors from 0 to P-1.  The mapping of these ranks to
 physical processors is done by MPI before LAMMPS begins.  It may be
 useful in some cases to alter the rank order.  E.g. to insure that
 cores within each node are ranked in a desired order.  Or when using
 the "run_style verlet/split"_run_style.html command with 2 partitions
 to insure that a specific Kspace processor (in the 2nd partition) is
 matched up with a specific set of processors in the 1st partition.
 See the "Section 5"_Section_accelerate.html doc pages for
 more details.
 
 If the keyword {nth} is used with a setting {N}, then it means every
 Nth processor will be moved to the end of the ranking.  This is useful
 when using the "run_style verlet/split"_run_style.html command with 2
 partitions via the -partition command-line switch.  The first set of
 processors will be in the first partition, the 2nd set in the 2nd
 partition.  The -reorder command-line switch can alter this so that
 the 1st N procs in the 1st partition and one proc in the 2nd partition
 will be ordered consecutively, e.g. as the cores on one physical node.
 This can boost performance.  For example, if you use "-reorder nth 4"
 and "-partition 9 3" and you are running on 12 processors, the
 processors will be reordered from
 
 0 1 2 3 4 5 6 7 8 9 10 11 :pre
 
 to
 
 0 1 2 4 5 6 8 9 10 3 7 11 :pre
 
 so that the processors in each partition will be
 
 0 1 2 4 5 6 8 9 10
 3 7 11 :pre
 
 See the "processors" command for how to insure processors from each
 partition could then be grouped optimally for quad-core nodes.
 
 If the keyword is {custom}, then a file that specifies a permutation
 of the processor ranks is also specified.  The format of the reorder
 file is as follows.  Any number of initial blank or comment lines
 (starting with a "#" character) can be present.  These should be
 followed by P lines of the form:
 
 I J :pre
 
 where P is the number of processors LAMMPS was launched with.  Note
 that if running in multi-partition mode (see the -partition switch
 above) P is the total number of processors in all partitions.  The I
 and J values describe a permutation of the P processors.  Every I and
 J should be values from 0 to P-1 inclusive.  In the set of P I values,
 every proc ID should appear exactly once.  Ditto for the set of P J
 values.  A single I,J pairing means that the physical processor with
 rank I in the original MPI communicator will have rank J in the
 reordered communicator.
 
 Note that rank ordering can also be specified by many MPI
 implementations, either by environment variables that specify how to
 order physical processors, or by config files that specify what
 physical processors to assign to each MPI rank.  The -reorder switch
 simply gives you a portable way to do this without relying on MPI
 itself.  See the "processors out"_processors.html command for how
 to output info on the final assignment of physical processors to
 the LAMMPS simulation domain.
 
 -screen file :pre
 
 Specify a file for LAMMPS to write its screen information to.  In
 one-partition mode, if the switch is not used, LAMMPS writes to the
 screen.  If this switch is used, LAMMPS writes to the specified file
 instead and you will see no screen output.  In multi-partition mode,
 if the switch is not used, hi-level status information is written to
 the screen.  Each partition also writes to a screen.N file where N is
 the partition ID.  If the switch is specified in multi-partition mode,
 the hi-level screen dump is named "file" and each partition also
 writes screen information to a file.N.  For both one-partition and
 multi-partition mode, if the specified file is "none", then no screen
 output is performed. Option -pscreen will override the name of the
 partition screen files file.N.
 
 -suffix style args :pre
 
 Use variants of various styles if they exist.  The specified style can
 be {cuda}, {gpu}, {intel}, {kk}, {omp}, {opt}, or {hybrid}.  These
 refer to optional packages that LAMMPS can be built with, as described
 above in "Section 2.3"_#start_3.  The "gpu" style corresponds to the
 GPU package, the "intel" style to the USER-INTEL package, the "kk"
 style to the KOKKOS package, the "opt" style to the OPT package, and
 the "omp" style to the USER-OMP package. The hybrid style is the only
 style that accepts arguments. It allows for two packages to be
 specified. The first package specified is the default and will be used
 if it is available. If no style is available for the first package,
 the style for the second package will be used if available. For
 example, "-suffix hybrid intel omp" will use styles from the
 USER-INTEL package if they are installed and available, but styles for
 the USER-OMP package otherwise.
 
 Along with the "-package" command-line switch, this is a convenient
 mechanism for invoking accelerator packages and their options without
 having to edit an input script.
 
 As an example, all of the packages provide a "pair_style
 lj/cut"_pair_lj.html variant, with style names lj/cut/gpu,
 lj/cut/intel, lj/cut/kk, lj/cut/omp, and lj/cut/opt.  A variant style
 can be specified explicitly in your input script, e.g. pair_style
 lj/cut/gpu.  If the -suffix switch is used the specified suffix
 (gpu,intel,kk,omp,opt) is automatically appended whenever your input
 script command creates a new "atom"_atom_style.html,
 "pair"_pair_style.html, "fix"_fix.html, "compute"_compute.html, or
 "run"_run_style.html style.  If the variant version does not exist,
 the standard version is created.
 
 For the GPU package, using this command-line switch also invokes the
 default GPU settings, as if the command "package gpu 1" were used at
 the top of your input script.  These settings can be changed by using
 the "-package gpu" command-line switch or the "package
 gpu"_package.html command in your script.
 
 For the USER-INTEL package, using this command-line switch also
 invokes the default USER-INTEL settings, as if the command "package
 intel 1" were used at the top of your input script.  These settings
 can be changed by using the "-package intel" command-line switch or
 the "package intel"_package.html command in your script. If the
 USER-OMP package is also installed, the hybrid style with "intel omp"
 arguments can be used to make the omp suffix a second choice, if a
 requested style is not available in the USER-INTEL package.  It will
 also invoke the default USER-OMP settings, as if the command "package
 omp 0" were used at the top of your input script.  These settings can
 be changed by using the "-package omp" command-line switch or the
 "package omp"_package.html command in your script.
 
 For the KOKKOS package, using this command-line switch also invokes
 the default KOKKOS settings, as if the command "package kokkos" were
 used at the top of your input script.  These settings can be changed
 by using the "-package kokkos" command-line switch or the "package
 kokkos"_package.html command in your script.
 
 For the OMP package, using this command-line switch also invokes the
 default OMP settings, as if the command "package omp 0" were used at
 the top of your input script.  These settings can be changed by using
 the "-package omp" command-line switch or the "package
 omp"_package.html command in your script.
 
 The "suffix"_suffix.html command can also be used within an input
 script to set a suffix, or to turn off or back on any suffix setting
 made via the command line.
 
 -var name value1 value2 ... :pre
 
 Specify a variable that will be defined for substitution purposes when
 the input script is read.  This switch can be used multiple times to
 define multiple variables.  "Name" is the variable name which can be a
 single character (referenced as $x in the input script) or a full
 string (referenced as $\{abc\}).  An "index-style
 variable"_variable.html will be created and populated with the
 subsequent values, e.g. a set of filenames.  Using this command-line
 option is equivalent to putting the line "variable name index value1
 value2 ..."  at the beginning of the input script.  Defining an index
 variable as a command-line argument overrides any setting for the same
 index variable in the input script, since index variables cannot be
 re-defined.  See the "variable"_variable.html command for more info on
 defining index and other kinds of variables and "this
 section"_Section_commands.html#cmd_2 for more info on using variables
 in input scripts.
 
 NOTE: Currently, the command-line parser looks for arguments that
 start with "-" to indicate new switches.  Thus you cannot specify
 multiple variable values if any of they start with a "-", e.g. a
 negative numeric value.  It is OK if the first value1 starts with a
 "-", since it is automatically skipped.
 
 :line
 
 2.7 LAMMPS screen output :h4,link(start_7)
 
 As LAMMPS reads an input script, it prints information to both the
 screen and a log file about significant actions it takes to setup a
 simulation.  When the simulation is ready to begin, LAMMPS performs
 various initializations and prints the amount of memory (in MBytes per
 processor) that the simulation requires.  It also prints details of
 the initial thermodynamic state of the system.  During the run itself,
 thermodynamic information is printed periodically, every few
 timesteps.  When the run concludes, LAMMPS prints the final
 thermodynamic state and a total run time for the simulation.  It then
 appends statistics about the CPU time and storage requirements for the
 simulation.  An example set of statistics is shown here:
 
 Loop time of 2.81192 on 4 procs for 300 steps with 2004 atoms :pre
 
 Performance: 18.436 ns/day  1.302 hours/ns  106.689 timesteps/s
 97.0% CPU use with 4 MPI tasks x no OpenMP threads :pre
 
 MPI task timings breakdown:
 Section |  min time  |  avg time  |  max time  |%varavg| %total
 ---------------------------------------------------------------
 Pair    | 1.9808     | 2.0134     | 2.0318     |   1.4 | 71.60
 Bond    | 0.0021894  | 0.0060319  | 0.010058   |   4.7 |  0.21
 Kspace  | 0.3207     | 0.3366     | 0.36616    |   3.1 | 11.97
 Neigh   | 0.28411    | 0.28464    | 0.28516    |   0.1 | 10.12
 Comm    | 0.075732   | 0.077018   | 0.07883    |   0.4 |  2.74
 Output  | 0.00030518 | 0.00042665 | 0.00078821 |   1.0 |  0.02
 Modify  | 0.086606   | 0.086631   | 0.086668   |   0.0 |  3.08
 Other   |            | 0.007178   |            |       |  0.26 :pre
 
 Nlocal:    501 ave 508 max 490 min
 Histogram: 1 0 0 0 0 0 1 1 0 1
 Nghost:    6586.25 ave 6628 max 6548 min
 Histogram: 1 0 1 0 0 0 1 0 0 1
 Neighs:    177007 ave 180562 max 170212 min
 Histogram: 1 0 0 0 0 0 0 1 1 1 :pre
 
 Total # of neighbors = 708028
 Ave neighs/atom = 353.307
 Ave special neighs/atom = 2.34032
 Neighbor list builds = 26
 Dangerous builds = 0 :pre
 
 The first section provides a global loop timing summary. The {loop time}
 is the total wall time for the section.  The {Performance} line is
 provided for convenience to help predicting the number of loop
 continuations required and for comparing performance with other,
 similar MD codes.  The {CPU use} line provides the CPU utilzation per
 MPI task; it should be close to 100% times the number of OpenMP
 threads (or 1 of no OpenMP). Lower numbers correspond to delays due
 to file I/O or insufficient thread utilization.
 
 The MPI task section gives the breakdown of the CPU run time (in
 seconds) into major categories:
 
 {Pair} stands for all non-bonded force computation
 {Bond} stands for bonded interactions: bonds, angles, dihedrals, impropers
 {Kspace} stands for reciprocal space interactions: Ewald, PPPM, MSM
 {Neigh} stands for neighbor list construction
 {Comm} stands for communicating atoms and their properties
 {Output} stands for writing dumps and thermo output
 {Modify} stands for fixes and computes called by them
 {Other} is the remaining time :ul
 
 For each category, there is a breakdown of the least, average and most
 amount of wall time a processor spent on this section. Also you have the
 variation from the average time. Together these numbers allow to gauge
 the amount of load imbalance in this segment of the calculation. Ideally
 the difference between minimum, maximum and average is small and thus
 the variation from the average close to zero. The final column shows
 the percentage of the total loop time is spent in this section.
 
 When using the "timer full"_timer.html setting, an additional column
 is present that also prints the CPU utilization in percent. In
 addition, when using {timer full} and the "package omp"_package.html
 command are active, a similar timing summary of time spent in threaded
 regions to monitor thread utilization and load balance is provided. A
 new entry is the {Reduce} section, which lists the time spent in
 reducing the per-thread data elements to the storage for non-threaded
 computation. These thread timings are taking from the first MPI rank
 only and and thus, as the breakdown for MPI tasks can change from MPI
 rank to MPI rank, this breakdown can be very different for individual
 ranks. Here is an example output for this section:
 
 Thread timings breakdown (MPI rank 0):
 Total threaded time 0.6846 / 90.6%
 Section |  min time  |  avg time  |  max time  |%varavg| %total
 ---------------------------------------------------------------
 Pair    | 0.5127     | 0.5147     | 0.5167     |   0.3 | 75.18
 Bond    | 0.0043139  | 0.0046779  | 0.0050418  |   0.5 |  0.68
 Kspace  | 0.070572   | 0.074541   | 0.07851    |   1.5 | 10.89
 Neigh   | 0.084778   | 0.086969   | 0.089161   |   0.7 | 12.70
 Reduce  | 0.0036485  | 0.003737   | 0.0038254  |   0.1 |  0.55 :pre
 
 The third section lists the number of owned atoms (Nlocal), ghost atoms
 (Nghost), and pair-wise neighbors stored per processor.  The max and min
 values give the spread of these values across processors with a 10-bin
 histogram showing the distribution. The total number of histogram counts
 is equal to the number of processors.
 
 The last section gives aggregate statistics for pair-wise neighbors
 and special neighbors that LAMMPS keeps track of (see the
 "special_bonds"_special_bonds.html command).  The number of times
 neighbor lists were rebuilt during the run is given as well as the
 number of potentially "dangerous" rebuilds.  If atom movement
 triggered neighbor list rebuilding (see the
 "neigh_modify"_neigh_modify.html command), then dangerous
 reneighborings are those that were triggered on the first timestep
 atom movement was checked for.  If this count is non-zero you may wish
 to reduce the delay factor to insure no force interactions are missed
 by atoms moving beyond the neighbor skin distance before a rebuild
 takes place.
 
 If an energy minimization was performed via the
 "minimize"_minimize.html command, additional information is printed,
 e.g.
 
 Minimization stats:
   Stopping criterion = linesearch alpha is zero
   Energy initial, next-to-last, final =
          -6372.3765206     -8328.46998942     -8328.46998942
   Force two-norm initial, final = 1059.36 5.36874
   Force max component initial, final = 58.6026 1.46872
   Final line search alpha, max atom move = 2.7842e-10 4.0892e-10
   Iterations, force evaluations = 701 1516 :pre
 
 The first line prints the criterion that determined the minimization
 to be completed. The third line lists the initial and final energy,
 as well as the energy on the next-to-last iteration.  The next 2 lines
 give a measure of the gradient of the energy (force on all atoms).
 The 2-norm is the "length" of this force vector; the inf-norm is the
 largest component. Then some information about the line search and
 statistics on how many iterations and force-evaluations the minimizer
 required.  Multiple force evaluations are typically done at each
 iteration to perform a 1d line minimization in the search direction.
 
 If a "kspace_style"_kspace_style.html long-range Coulombics solve was
 performed during the run (PPPM, Ewald), then additional information is
 printed, e.g.
 
 FFT time (% of Kspce) = 0.200313 (8.34477)
 FFT Gflps 3d 1d-only = 2.31074 9.19989 :pre
 
 The first line gives the time spent doing 3d FFTs (4 per timestep) and
 the fraction it represents of the total KSpace time (listed above).
 Each 3d FFT requires computation (3 sets of 1d FFTs) and communication
 (transposes).  The total flops performed is 5Nlog_2(N), where N is the
 number of points in the 3d grid.  The FFTs are timed with and without
 the communication and a Gflop rate is computed.  The 3d rate is with
 communication; the 1d rate is without (just the 1d FFTs).  Thus you
 can estimate what fraction of your FFT time was spent in
 communication, roughly 75% in the example above.
 
 :line
 
 2.8 Tips for users of previous LAMMPS versions :h4,link(start_8)
 
 The current C++ began with a complete rewrite of LAMMPS 2001, which
 was written in F90.  Features of earlier versions of LAMMPS are listed
 in "Section 13"_Section_history.html.  The F90 and F77 versions
 (2001 and 99) are also freely distributed as open-source codes; check
 the "LAMMPS WWW Site"_lws for distribution information if you prefer
 those versions.  The 99 and 2001 versions are no longer under active
 development; they do not have all the features of C++ LAMMPS.
 
 If you are a previous user of LAMMPS 2001, these are the most
 significant changes you will notice in C++ LAMMPS:
 
 (1) The names and arguments of many input script commands have
 changed.  All commands are now a single word (e.g. read_data instead
 of read data).
 
 (2) All the functionality of LAMMPS 2001 is included in C++ LAMMPS,
 but you may need to specify the relevant commands in different ways.
 
 (3) The format of the data file can be streamlined for some problems.
 See the "read_data"_read_data.html command for details.  The data file
 section "Nonbond Coeff" has been renamed to "Pair Coeff" in C++ LAMMPS.
 
 (4) Binary restart files written by LAMMPS 2001 cannot be read by C++
 LAMMPS with a "read_restart"_read_restart.html command.  This is
 because they were output by F90 which writes in a different binary
 format than C or C++ writes or reads.  Use the {restart2data} tool
 provided with LAMMPS 2001 to convert the 2001 restart file to a text
 data file.  Then edit the data file as necessary before using the C++
 LAMMPS "read_data"_read_data.html command to read it in.
 
 (5) There are numerous small numerical changes in C++ LAMMPS that mean
 you will not get identical answers when comparing to a 2001 run.
 However, your initial thermodynamic energy and MD trajectory should be
 close if you have setup the problem for both codes the same.
diff --git a/doc/src/accelerate_kokkos.txt b/doc/src/accelerate_kokkos.txt
index 8d87751f9..2b07ed035 100644
--- a/doc/src/accelerate_kokkos.txt
+++ b/doc/src/accelerate_kokkos.txt
@@ -1,493 +1,493 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.3 KOKKOS package :h5
 
 The KOKKOS package was developed primarily by Christian Trott (Sandia)
 with contributions of various styles by others, including Sikandar
 Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia).  The
 underlying Kokkos library was written primarily by Carter Edwards,
 Christian Trott, and Dan Sunderland (all Sandia).
 
 The KOKKOS package contains versions of pair, fix, and atom styles
 that use data structures and macros provided by the Kokkos library,
 which is included with LAMMPS in lib/kokkos.
 
 The Kokkos library is part of
 "Trilinos"_http://trilinos.sandia.gov/packages/kokkos and can also be
 downloaded from "Github"_https://github.com/kokkos/kokkos. Kokkos is a
 templated C++ library that provides two key abstractions for an
 application like LAMMPS.  First, it allows a single implementation of
 an application kernel (e.g. a pair style) to run efficiently on
 different kinds of hardware, such as a GPU, Intel Phi, or many-core
 CPU.
 
 The Kokkos library also provides data abstractions to adjust (at
 compile time) the memory layout of basic data structures like 2d and
 3d arrays and allow the transparent utilization of special hardware
 load and store operations.  Such data structures are used in LAMMPS to
 store atom coordinates or forces or neighbor lists.  The layout is
 chosen to optimize performance on different platforms.  Again this
 functionality is hidden from the developer, and does not affect how
 the kernel is coded.
 
 These abstractions are set at build time, when LAMMPS is compiled with
 the KOKKOS package installed.  All Kokkos operations occur within the
 context of an individual MPI task running on a single node of the
 machine.  The total number of MPI tasks used by LAMMPS (one or
 multiple per compute node) is set in the usual manner via the mpirun
 or mpiexec commands, and is independent of Kokkos.
 
 Kokkos currently provides support for 3 modes of execution (per MPI
 task).  These are OpenMP (for many-core CPUs), Cuda (for NVIDIA GPUs),
 and OpenMP (for Intel Phi).  Note that the KOKKOS package supports
 running on the Phi in native mode, not offload mode like the
 USER-INTEL package supports.  You choose the mode at build time to
 produce an executable compatible with specific hardware.
 
 Here is a quick overview of how to use the KOKKOS package
 for CPU acceleration, assuming one or more 16-core nodes.
 More details follow.
 
 use a C++11 compatible compiler
 make yes-kokkos
 make mpi KOKKOS_DEVICES=OpenMP                 # build with the KOKKOS package
 make kokkos_omp                                # or Makefile.kokkos_omp already has variable set :pre
 
 mpirun -np 16 lmp_mpi -k on -sf kk -in in.lj              # 1 node, 16 MPI tasks/node, no threads
 mpirun -np 2 -ppn 1 lmp_mpi -k on t 16 -sf kk -in in.lj   # 2 nodes, 1 MPI task/node, 16 threads/task
 mpirun -np 2 lmp_mpi -k on t 8 -sf kk -in in.lj           # 1 node, 2 MPI tasks/node, 8 threads/task
 mpirun -np 32 -ppn 4 lmp_mpi -k on t 4 -sf kk -in in.lj   # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
 
 specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support
 include the KOKKOS package and build LAMMPS
 enable the KOKKOS package and its hardware options via the "-k on" command-line switch use KOKKOS styles in your input script :ul
 
 Here is a quick overview of how to use the KOKKOS package for GPUs,
 assuming one or more nodes, each with 16 cores and a GPU.  More
 details follow.
 
 discuss use of NVCC, which Makefiles to examine
 
 use a C++11 compatible compiler
 KOKKOS_DEVICES = Cuda, OpenMP
 KOKKOS_ARCH = Kepler35
 make yes-kokkos
 make machine :pre
 
 mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj          # one MPI task, 6 threads on CPU
 mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj   # ditto on 4 nodes :pre
 
 mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj           # two MPI tasks, 8 threads per CPU
 mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj   # ditto on 16 nodes :pre
 
 Here is a quick overview of how to use the KOKKOS package
 for the Intel Phi:
 
 use a C++11 compatible compiler
 KOKKOS_DEVICES = OpenMP
 KOKKOS_ARCH = KNC
 make yes-kokkos
 make machine :pre
 
 host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
 mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj           # 1 MPI task on 1 Phi, 1*240 = 240
 mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj            # 30 MPI tasks on 1 Phi, 30*8 = 240
 mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj           # 12 MPI tasks on 1 Phi, 12*20 = 240
 mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj   # ditto on 8 Phis :pre
 
 [Required hardware/software:]
 
 Kokkos support within LAMMPS must be built with a C++11 compatible
 compiler.  If using gcc, version 4.7.2 or later is required.
 
 To build with Kokkos support for CPUs, your compiler must support the
 OpenMP interface.  You should have one or more multi-core CPUs so that
 multiple threads can be launched by each MPI task running on a CPU.
 
 To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA software
 version 7.5 or later must be installed on your system.  See the
 discussion for the "GPU"_accelerate_gpu.html package for details of
 how to check and do this.
 
 NOTE: For good performance of the KOKKOS package on GPUs, you must
 have Kepler generation GPUs (or later).  The Kokkos library exploits
 texture cache options not supported by Telsa generation GPUs (or
 older).
 
 To build with Kokkos support for Intel Xeon Phi coprocessors, your
 sysmte must be configured to use them in "native" mode, not "offload"
 mode like the USER-INTEL package supports.
 
 [Building LAMMPS with the KOKKOS package:]
 
 You must choose at build time whether to build for CPUs (OpenMP),
 GPUs, or Phi.
 
 You can do any of these in one line, using the suitable make command
 line flags as described in "Section 4"_Section_packages.html of the
 manual. If run from the src directory, these
-commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda, and
+commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda_mpi, and
 lmp_kokkos_phi.  Note that the OMP and PHI options use
 src/MAKE/Makefile.mpi as the starting Makefile.machine.  The CUDA
-option uses src/MAKE/OPTIONS/Makefile.kokkos_cuda.
+option uses src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi.
 
 The latter two steps can be done using the "-k on", "-pk kokkos" and
 "-sf kk" "command-line switches"_Section_start.html#start_6
 respectively.  Or the effect of the "-pk" or "-sf" switches can be
 duplicated by adding the "package kokkos"_package.html or "suffix
 kk"_suffix.html commands respectively to your input script.
 
 
 Or you can follow these steps:
 
 CPU-only (run all-MPI or with OpenMP threading):
 
 cd lammps/src
 make yes-kokkos
 make kokkos_omp :pre
 
 CPU-only (only MPI, no threading):
 
 cd lammps/src
 make yes-kokkos
-make kokkos_mpi :pre
+make kokkos_mpi_only :pre
 
 Intel Xeon Phi (Intel Compiler, Intel MPI):
 
 cd lammps/src
 make yes-kokkos
 make kokkos_phi :pre
 
-CPUs and GPUs (with MPICH):
+CPUs and GPUs (with MPICH or OpenMPI):
 
 cd lammps/src
 make yes-kokkos
-make kokkos_cuda_mpich :pre
+make kokkos_cuda_mpi :pre
 
 These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
 make command line which requires a GNU-compatible make command.  Try
 "gmake" if your system's standard make complains.
 
 NOTE: If you build using make line variables and re-build LAMMPS twice
 with different KOKKOS options and the *same* target, e.g. g++ in the
 first two examples above, then you *must* perform a "make clean-all"
 or "make clean-machine" before each build.  This is to force all the
 KOKKOS-dependent files to be re-compiled with the new options.
 
 NOTE: Currently, there are no precision options with the KOKKOS
 package.  All compilation and computation is performed in double
 precision.
 
 There are other allowed options when building with the KOKKOS package.
 As above, they can be set either as variables on the make command line
 or in Makefile.machine.  This is the full list of options, including
 those discussed above, Each takes a value shown below.  The
 default value is listed, which is set in the
 lib/kokkos/Makefile.kokkos file.
 
 #Default settings specific options
 #Options: force_uvm,use_ldg,rdc
 
 KOKKOS_DEVICES, values = {OpenMP}, {Serial}, {Pthreads}, {Cuda}, default = {OpenMP}
 KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {ARMv8}, {BGQ}, {Power7}, {Power8}, default = {none}
 KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
 KOKKOS_USE_TPLS, values = {hwloc}, {librt}, default = {none}
 KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc} :ul
 
 KOKKOS_DEVICE sets the parallelization method used for Kokkos code
 (within LAMMPS).  KOKKOS_DEVICES=OpenMP means that OpenMP will be
 used.  KOKKOS_DEVICES=Pthreads means that pthreads will be used.
 KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used.
 
 If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE
 directory must use "nvcc" as its compiler, via its CC setting.  For
 best performance its CCFLAGS setting should use -O3 and have a
 KOKKOS_ARCH setting that matches the compute capability of your NVIDIA
 hardware and software installation, e.g. KOKKOS_ARCH=Kepler30.  Note
 the minimal required compute capability is 2.0, but this will give
 significantly reduced performance compared to Kepler generation GPUs
 with compute capability 3.x.  For the LINK setting, "nvcc" should not
 be used; instead use g++ or another compiler suitable for linking C++
 applications.  Often you will want to use your MPI compiler wrapper
 for this setting (i.e. mpicxx).  Finally, the lo-level Makefile must
 also have a "Compilation rule" for creating *.o files from *.cu files.
 See src/Makefile.cuda for an example of a lo-level Makefile with all
 of these settings.
 
 KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
 migrate during a simulation.  KOKKOS_USE_TPLS=hwloc should always be
 used if running with KOKKOS_DEVICES=Pthreads for pthreads.  It is not
 necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
 provides alternative methods via environment variables for binding
 threads to hardware cores.  More info on binding threads to cores is
 given in "Section 5.3"_Section_accelerate.html#acc_3.
 
 KOKKOS_ARCH=KNC enables compiler switches needed when compiling for an
 Intel Phi processor.
 
 KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
 on most Unix platforms.  This library is not available on all
 platforms.
 
 KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
 within LAMMPS.  KOKKOS_DEBUG=yes enables printing of run-time
 debugging information that can be useful.  It also enables runtime
 bounds checking on Kokkos data structures.
 
 KOKKOS_CUDA_OPTIONS are additional options for CUDA.
 
 For more information on Kokkos see the Kokkos programmers' guide here:
 /lib/kokkos/doc/Kokkos_PG.pdf.
 
 [Run with the KOKKOS package from the command line:]
 
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 
 When using KOKKOS built with host=OMP, you need to choose how many
 OpenMP threads per MPI task will be used (via the "-k" command-line
 switch discussed below).  Note that the product of MPI tasks * OpenMP
 threads/task should not exceed the physical number of cores (on a
 node), otherwise performance will suffer.
 
 When using the KOKKOS package built with device=CUDA, you must use
 exactly one MPI task per physical GPU.
 
 When using the KOKKOS package built with host=MIC for Intel Xeon Phi
 coprocessor support you need to insure there are one or more MPI tasks
 per coprocessor, and choose the number of coprocessor threads to use
 per MPI task (via the "-k" command-line switch discussed below).  The
 product of MPI tasks * coprocessor threads/task should not exceed the
 maximum number of threads the coprocessor is designed to run,
 otherwise performance will suffer.  This value is 240 for current
 generation Xeon Phi(TM) chips, which is 60 physical cores * 4
 threads/core.  Note that with the KOKKOS package you do not need to
 specify how many Phi coprocessors there are per node; each
 coprocessors is simply treated as running some number of MPI tasks.
 
 You must use the "-k on" "command-line
 switch"_Section_start.html#start_6 to enable the KOKKOS package.  It
 takes additional arguments for hardware settings appropriate to your
 system.  Those arguments are "documented
 here"_Section_start.html#start_6.  The two most commonly used
 options are:
 
 -k on t Nt g Ng :pre
 
 The "t Nt" option applies to host=OMP (even if device=CUDA) and
 host=MIC.  For host=OMP, it specifies how many OpenMP threads per MPI
 task to use with a node.  For host=MIC, it specifies how many Xeon Phi
 threads per MPI task to use within a node.  The default is Nt = 1.
 Note that for host=OMP this is effectively MPI-only mode which may be
 fine.  But for host=MIC you will typically end up using far less than
 all the 240 available threads, which could give very poor performance.
 
 The "g Ng" option applies to device=CUDA.  It specifies how many GPUs
 per compute node to use.  The default is 1, so this only needs to be
 specified is you have 2 or more GPUs per compute node.
 
 The "-k on" switch also issues a "package kokkos" command (with no
 additional arguments) which sets various KOKKOS options to default
 values, as discussed on the "package"_package.html command doc page.
 
 Use the "-sf kk" "command-line switch"_Section_start.html#start_6,
 which will automatically append "kk" to styles that support it.  Use
 the "-pk kokkos" "command-line switch"_Section_start.html#start_6 if
 you wish to change any of the default "package kokkos"_package.html
 optionns set by the "-k on" "command-line
 switch"_Section_start.html#start_6.
 
 
 
 Note that the default for the "package kokkos"_package.html command is
 to use "full" neighbor lists and set the Newton flag to "off" for both
 pairwise and bonded interactions.  This typically gives fastest
 performance.  If the "newton"_newton.html command is used in the input
 script, it can override the Newton flag defaults.
 
 However, when running in MPI-only mode with 1 thread per MPI task, it
 will typically be faster to use "half" neighbor lists and set the
 Newton flag to "on", just as is the case for non-accelerated pair
 styles.  You can do this with the "-pk" "command-line
 switch"_Section_start.html#start_6.
 
 [Or run with the KOKKOS package by editing an input script:]
 
 The discussion above for the mpirun/mpiexec command and setting
 appropriate thread and GPU values for host=OMP or host=MIC or
 device=CUDA are the same.
 
 You must still use the "-k on" "command-line
 switch"_Section_start.html#start_6 to enable the KOKKOS package, and
 specify its additional arguments for hardware options appropriate to
 your system, as documented above.
 
 Use the "suffix kk"_suffix.html command, or you can explicitly add a
 "kk" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/kk 2.5 :pre
 
 You only need to use the "package kokkos"_package.html command if you
 wish to change any of its option defaults, as set by the "-k on"
 "command-line switch"_Section_start.html#start_6.
 
 [Speed-ups to expect:]
 
 The performance of KOKKOS running in different modes is a function of
 your hardware, which KOKKOS-enable styles are used, and the problem
 size.
 
 Generally speaking, the following rules of thumb apply:
 
 When running on CPUs only, with a single thread per MPI task,
 performance of a KOKKOS style is somewhere between the standard
 (un-accelerated) styles (MPI-only mode), and those provided by the
 USER-OMP package.  However the difference between all 3 is small (less
 than 20%). :ulb,l
 
 When running on CPUs only, with multiple threads per MPI task,
 performance of a KOKKOS style is a bit slower than the USER-OMP
 package. :l
 
 When running large number of atoms per GPU, KOKKOS is typically faster
 than the GPU package. :l
 
 When running on Intel Xeon Phi, KOKKOS is not as fast as
 the USER-INTEL package, which is optimized for that hardware. :l
 :ule
 
 See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
 LAMMPS web site for performance of the KOKKOS package on different
 hardware.
 
 [Guidelines for best performance:]
 
 Here are guidline for using the KOKKOS package on the different
 hardware configurations listed above.
 
 Many of the guidelines use the "package kokkos"_package.html command
 See its doc page for details and default settings.  Experimenting with
 its options can provide a speed-up for specific calculations.
 
 [Running on a multi-core CPU:]
 
 If N is the number of physical cores/node, then the number of MPI
 tasks/node * number of threads/task should not exceed N, and should
 typically equal N.  Note that the default threads/task is 1, as set by
 the "t" keyword of the "-k" "command-line
 switch"_Section_start.html#start_6.  If you do not change this, no
 additional parallelism (beyond MPI) will be invoked on the host
 CPU(s).
 
 You can compare the performance running in different modes:
 
 run with 1 MPI task/node and N threads/task
 run with N MPI tasks/node and 1 thread/task
 run with settings in between these extremes :ul
 
 Examples of mpirun commands in these modes are shown above.
 
 When using KOKKOS to perform multi-threading, it is important for
 performance to bind both MPI tasks to physical cores, and threads to
 physical cores, so they do not migrate during a simulation.
 
 If you are not certain MPI tasks are being bound (check the defaults
 for your MPI installation), binding can be forced with these flags:
 
 OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
 Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
 
 For binding threads with the KOKKOS OMP option, use thread affinity
 environment variables to force binding.  With OpenMP 3.1 (gcc 4.7 or
 later, intel 12 or later) setting the environment variable
 OMP_PROC_BIND=true should be sufficient.  For binding threads with the
 KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option
 (see "this section"_Section_packages.html#KOKKOS of the manual for
 details).
 
 [Running on GPUs:]
 
 Insure the -arch setting in the machine makefile you are using,
 e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software.
 (see "this section"_Section_packages.html#KOKKOS of the manual for
 details).
 
 The -np setting of the mpirun command should set the number of MPI
 tasks/node to be equal to the # of physical GPUs on the node.
 
 Use the "-k" "command-line switch"_Section_commands.html#start_6 to
 specify the number of GPUs per node, and the number of threads per MPI
 task.  As above for multi-core CPUs (and no GPU), if N is the number
 of physical cores/node, then the number of MPI tasks/node * number of
 threads/task should not exceed N.  With one GPU (and one MPI task) it
 may be faster to use less than all the available cores, by setting
 threads/task to a smaller value.  This is because using all the cores
 on a dual-socket node will incur extra cost to copy memory from the
 2nd socket to the GPU.
 
 Examples of mpirun commands that follow these rules are shown above.
 
 NOTE: When using a GPU, you will achieve the best performance if your
 input script does not use any fix or compute styles which are not yet
 Kokkos-enabled.  This allows data to stay on the GPU for multiple
 timesteps, without being copied back to the host CPU.  Invoking a
 non-Kokkos fix or compute, or performing I/O for
 "thermo"_thermo_style.html or "dump"_dump.html output will cause data
 to be copied back to the CPU.
 
 You cannot yet assign multiple MPI tasks to the same GPU with the
 KOKKOS package.  We plan to support this in the future, similar to the
 GPU package in LAMMPS.
 
 You cannot yet use both the host (multi-threaded) and device (GPU)
 together to compute pairwise interactions with the KOKKOS package.  We
 hope to support this in the future, similar to the GPU package in
 LAMMPS.
 
 [Running on an Intel Phi:]
 
 Kokkos only uses Intel Phi processors in their "native" mode, i.e.
 not hosted by a CPU.
 
 As illustrated above, build LAMMPS with OMP=yes (the default) and
 MIC=yes.  The latter insures code is correctly compiled for the Intel
 Phi.  The OMP setting means OpenMP will be used for parallelization on
 the Phi, which is currently the best option within Kokkos.  In the
 future, other options may be added.
 
 Current-generation Intel Phi chips have either 61 or 57 cores.  One
 core should be excluded for running the OS, leaving 60 or 56 cores.
 Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
 N = 224 (4*56) cores to run on.
 
 The -np setting of the mpirun command sets the number of MPI
 tasks/node.  The "-k on t Nt" command-line switch sets the number of
 threads/task as Nt.  The product of these 2 values should be N, i.e.
 240 or 224.  Also, the number of threads/task should be a multiple of
 4 so that logical threads from more than one MPI task do not run on
 the same physical core.
 
 Examples of mpirun commands that follow these rules are shown above.
 
 [Restrictions:]
 
 As noted above, if using GPUs, the number of MPI tasks per compute
 node should equal to the number of GPUs per compute node.  In the
 future Kokkos will support assigning multiple MPI tasks to a single
 GPU.
 
 Currently Kokkos does not support AMD GPUs due to limits in the
 available backend programming models.  Specifically, Kokkos requires
 extensive C++ support from the Kernel language.  This is expected to
 change in the future.