diff --git a/doc/src/Section_accelerate.txt b/doc/src/Section_accelerate.txt
index 881235888..bb0c93b8a 100644
--- a/doc/src/Section_accelerate.txt
+++ b/doc/src/Section_accelerate.txt
@@ -1,391 +1,391 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc - "Next
 Section"_Section_howto.html :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 5. Accelerating LAMMPS performance :h3
 
 This section describes various methods for improving LAMMPS
 performance for different classes of problems running on different
 kinds of machines.
 
 There are two thrusts to the discussion that follows.  The
 first is using code options that implement alternate algorithms
 that can speed-up a simulation.  The second is to use one
 of the several accelerator packages provided with LAMMPS that
 contain code optimized for certain kinds of hardware, including
 multi-core CPUs, GPUs, and Intel Xeon Phi coprocessors.
 
 5.1 "Measuring performance"_#acc_1 :ulb,l
 5.2 "Algorithms and code options to boost performace"_#acc_2 :l
 5.3 "Accelerator packages with optimized styles"_#acc_3 :l
     5.3.1 "GPU package"_accelerate_gpu.html :l
     5.3.2 "USER-INTEL package"_accelerate_intel.html :l
     5.3.3 "KOKKOS package"_accelerate_kokkos.html :l
     5.3.4 "USER-OMP package"_accelerate_omp.html :l
     5.3.5 "OPT package"_accelerate_opt.html :l
 5.4 "Comparison of various accelerator packages"_#acc_4 :l
 :ule
 
 The "Benchmark page"_http://lammps.sandia.gov/bench.html of the LAMMPS
 web site gives performance results for the various accelerator
 packages discussed in Section 5.2, for several of the standard LAMMPS
 benchmark problems, as a function of problem size and number of
 compute nodes, on different hardware platforms.
 
 :line
 :line
 
 5.1 Measuring performance :h4,link(acc_1)
 
 Before trying to make your simulation run faster, you should
 understand how it currently performs and where the bottlenecks are.
 
 The best way to do this is run the your system (actual number of
 atoms) for a modest number of timesteps (say 100 steps) on several
 different processor counts, including a single processor if possible.
 Do this for an equilibrium version of your system, so that the
 100-step timings are representative of a much longer run.  There is
 typically no need to run for 1000s of timesteps to get accurate
 timings; you can simply extrapolate from short runs.
 
 For the set of runs, look at the timing data printed to the screen and
 log file at the end of each LAMMPS run.  "This
 section"_Section_start.html#start_7 of the manual has an overview.
 
 Running on one (or a few processors) should give a good estimate of
 the serial performance and what portions of the timestep are taking
 the most time.  Running the same problem on a few different processor
 counts should give an estimate of parallel scalability.  I.e. if the
 simulation runs 16x faster on 16 processors, its 100% parallel
 efficient; if it runs 8x faster on 16 processors, it's 50% efficient.
 
 The most important data to look at in the timing info is the timing
 breakdown and relative percentages.  For example, trying different
 options for speeding up the long-range solvers will have little impact
 if they only consume 10% of the run time.  If the pairwise time is
 dominating, you may want to look at GPU or OMP versions of the pair
 style, as discussed below.  Comparing how the percentages change as
 you increase the processor count gives you a sense of how different
 operations within the timestep are scaling.  Note that if you are
 running with a Kspace solver, there is additional output on the
 breakdown of the Kspace time.  For PPPM, this includes the fraction
 spent on FFTs, which can be communication intensive.
 
 Another important detail in the timing info are the histograms of
 atoms counts and neighbor counts.  If these vary widely across
 processors, you have a load-imbalance issue.  This often results in
 inaccurate relative timing data, because processors have to wait when
 communication occurs for other processors to catch up.  Thus the
 reported times for "Communication" or "Other" may be higher than they
 really are, due to load-imbalance.  If this is an issue, you can
 uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile
 LAMMPS, to obtain synchronized timings.
 
 :line
 
 5.2 General strategies :h4,link(acc_2)
 
 NOTE: this section 5.2 is still a work in progress
 
 Here is a list of general ideas for improving simulation performance.
 Most of them are only applicable to certain models and certain
 bottlenecks in the current performance, so let the timing data you
 generate be your guide.  It is hard, if not impossible, to predict how
 much difference these options will make, since it is a function of
 problem size, number of processors used, and your machine.  There is
 no substitute for identifying performance bottlenecks, and trying out
 various options.
 
 rRESPA
 2-FFT PPPM
 Staggered PPPM
 single vs double PPPM
 partial charge PPPM
 verlet/split run style
 processor command for proc layout and numa layout
 load-balancing: balance and fix balance :ul
 
 2-FFT PPPM, also called {analytic differentiation} or {ad} PPPM, uses
 2 FFTs instead of the 4 FFTs used by the default {ik differentiation}
 PPPM. However, 2-FFT PPPM also requires a slightly larger mesh size to
 achieve the same accuracy as 4-FFT PPPM. For problems where the FFT
 cost is the performance bottleneck (typically large problems running
 on many processors), 2-FFT PPPM may be faster than 4-FFT PPPM.
 
 Staggered PPPM performs calculations using two different meshes, one
 shifted slightly with respect to the other.  This can reduce force
 aliasing errors and increase the accuracy of the method, but also
 doubles the amount of work required. For high relative accuracy, using
 staggered PPPM allows one to half the mesh size in each dimension as
 compared to regular PPPM, which can give around a 4x speedup in the
 kspace time. However, for low relative accuracy, using staggered PPPM
 gives little benefit and can be up to 2x slower in the kspace
 time. For example, the rhodopsin benchmark was run on a single
 processor, and results for kspace time vs. relative accuracy for the
 different methods are shown in the figure below.  For this system,
 staggered PPPM (using ik differentiation) becomes useful when using a
 relative accuracy of slightly greater than 1e-5 and above.
 
 :c,image(JPG/rhodo_staggered.jpg)
 
 NOTE: Using staggered PPPM may not give the same increase in accuracy
 of energy and pressure as it does in forces, so some caution must be
 used if energy and/or pressure are quantities of interest, such as
 when using a barostat.
 
 :line
 
 5.3 Packages with optimized styles :h4,link(acc_3)
 
 Accelerated versions of various "pair_style"_pair_style.html,
 "fixes"_fix.html, "computes"_compute.html, and other commands have
 been added to LAMMPS, which will typically run faster than the
 standard non-accelerated versions.  Some require appropriate hardware
 to be present on your system, e.g. GPUs or Intel Xeon Phi
 coprocessors.
 
 All of these commands are in packages provided with LAMMPS.  An
 overview of packages is give in "Section
 packages"_Section_packages.html.
 
 These are the accelerator packages
 currently in LAMMPS, either as standard or user packages:
 
 "GPU Package"_accelerate_gpu.html : for NVIDIA GPUs as well as OpenCL support
 "USER-INTEL Package"_accelerate_intel.html : for Intel CPUs and Intel Xeon Phi
 "KOKKOS Package"_accelerate_kokkos.html : for Nvidia GPUs, Intel Xeon Phi, and OpenMP threading
 "USER-OMP Package"_accelerate_omp.html : for OpenMP threading and generic CPU optimizations
 "OPT Package"_accelerate_opt.html : generic CPU optimizations :tb(s=:)
 
 <!-- RST
 
 .. toctree::
    :maxdepth: 1
    :hidden:
 
    accelerate_gpu
    accelerate_intel
    accelerate_kokkos
    accelerate_omp
    accelerate_opt
 
 END_RST -->
 
 Inverting this list, LAMMPS currently has acceleration support for
 three kinds of hardware, via the listed packages:
 
 Many-core CPUs : "USER-INTEL"_accelerate_intel.html, "KOKKOS"_accelerate_kokkos.html, "USER-OMP"_accelerate_omp.html, "OPT"_accelerate_opt.html packages
 NVIDIA GPUs : "GPU"_accelerate_gpu.html, "KOKKOS"_accelerate_kokkos.html packages
 Intel Phi : "USER-INTEL"_accelerate_intel.html, "KOKKOS"_accelerate_kokkos.html packages :tb(s=:)
 
 Which package is fastest for your hardware may depend on the size
 problem you are running and what commands (accelerated and
 non-accelerated) are invoked by your input script.  While these doc
 pages include performance guidelines, there is no substitute for
 trying out the different packages appropriate to your hardware.
 
 Any accelerated style has the same name as the corresponding standard
 style, except that a suffix is appended.  Otherwise, the syntax for
 the command that uses the style is identical, their functionality is
 the same, and the numerical results it produces should also be the
 same, except for precision and round-off effects.
 
 For example, all of these styles are accelerated variants of the
 Lennard-Jones "pair_style lj/cut"_pair_lj.html:
 
 "pair_style lj/cut/gpu"_pair_lj.html
 "pair_style lj/cut/intel"_pair_lj.html
 "pair_style lj/cut/kk"_pair_lj.html
 "pair_style lj/cut/omp"_pair_lj.html
 "pair_style lj/cut/opt"_pair_lj.html :ul
 
 To see what accelerate styles are currently available, see
 "Section 3.5"_Section_commands.html#cmd_5 of the manual.  The
 doc pages for individual commands (e.g. "pair lj/cut"_pair_lj.html or
 "fix nve"_fix_nve.html) also list any accelerated variants available
 for that style.
 
 To use an accelerator package in LAMMPS, and one or more of the styles
 it provides, follow these general steps.  Details vary from package to
 package and are explained in the individual accelerator doc pages,
 listed above:
 
 build the accelerator library |
   only for GPU package |
 install the accelerator package |
   make yes-opt, make yes-user-intel, etc |
 add compile/link flags to Makefile.machine in src/MAKE |
   only for USER-INTEL, KOKKOS, USER-OMP, OPT packages |
 re-build LAMMPS |
   make machine |
 prepare and test a regular LAMMPS simulation |
   lmp_machine -in in.script; mpirun -np 32 lmp_machine -in in.script |
 enable specific accelerator support via '-k on' "command-line switch"_Section_start.html#start_6, |
   only needed for KOKKOS package |
 set any needed options for the package via "-pk" "command-line switch"_Section_start.html#start_6 or "package"_package.html command, |
   only if defaults need to be changed |
 use accelerated styles in your input via "-sf" "command-line switch"_Section_start.html#start_6 or "suffix"_suffix.html command | lmp_machine -in in.script -sf gpu
 :tb(c=2,s=|)
 
-Note that the first 4 steps can be done as a single command, using the
-src/Make.py tool.  This tool is discussed in "Section
+Note that the first 4 steps can be done as a single command with
+suitable make command invocations. This is discussed in "Section
 4"_Section_packages.html of the manual, and its use is
 illustrated in the individual accelerator sections.  Typically these
 steps only need to be done once, to create an executable that uses one
 or more accelerator packages.
 
 The last 4 steps can all be done from the command-line when LAMMPS is
 launched, without changing your input script, as illustrated in the
 individual accelerator sections.  Or you can add
 "package"_package.html and "suffix"_suffix.html commands to your input
 script.
 
 NOTE: With a few exceptions, you can build a single LAMMPS executable
 with all its accelerator packages installed.  Note however that the
 USER-INTEL and KOKKOS packages require you to choose one of their
 hardware options when building for a specific platform.  I.e. CPU or
 Phi option for the USER-INTEL package.  Or the OpenMP, Cuda, or Phi
 option for the KOKKOS package.
 
 These are the exceptions.  You cannot build a single executable with:
 
 both the USER-INTEL Phi and KOKKOS Phi options
 the USER-INTEL Phi or Kokkos Phi option, and the GPU package :ul
 
 See the examples/accelerate/README and make.list files for sample
 Make.py commands that build LAMMPS with any or all of the accelerator
 packages.  As an example, here is a command that builds with all the
 GPU related packages installed (GPU, KOKKOS with Cuda), including
 settings to build the needed auxiliary GPU libraries for Kepler GPUs:
 
 Make.py -j 16 -p omp gpu kokkos -cc nvcc wrap=mpi \
   -gpu mode=double arch=35 -kokkos cuda arch=35 lib-all file mpi :pre
 
 The examples/accelerate directory also has input scripts that can be
 used with all of the accelerator packages.  See its README file for
 details.
 
 Likewise, the bench directory has FERMI and KEPLER and PHI
 sub-directories with Make.py commands and input scripts for using all
 the accelerator packages on various machines.  See the README files in
 those dirs.
 
 As mentioned above, the "Benchmark
 page"_http://lammps.sandia.gov/bench.html of the LAMMPS web site gives
 performance results for the various accelerator packages for several
 of the standard LAMMPS benchmark problems, as a function of problem
 size and number of compute nodes, on different hardware platforms.
 
 Here is a brief summary of what the various packages provide.  Details
 are in the individual accelerator sections.
 
 Styles with a "gpu" suffix are part of the GPU package, and can be run
 on NVIDIA GPUs.  The speed-up on a GPU depends on a variety of
 factors, discussed in the accelerator sections. :ulb,l
 
 Styles with an "intel" suffix are part of the USER-INTEL
 package. These styles support vectorized single and mixed precision
 calculations, in addition to full double precision.  In extreme cases,
 this can provide speedups over 3.5x on CPUs.  The package also
 supports acceleration in "offload" mode to Intel(R) Xeon Phi(TM)
 coprocessors.  This can result in additional speedup over 2x depending
 on the hardware configuration. :l
 
 Styles with a "kk" suffix are part of the KOKKOS package, and can be
 run using OpenMP on multicore CPUs, on an NVIDIA GPU, or on an Intel
 Xeon Phi in "native" mode.  The speed-up depends on a variety of
 factors, as discussed on the KOKKOS accelerator page. :l
 
 Styles with an "omp" suffix are part of the USER-OMP package and allow
 a pair-style to be run in multi-threaded mode using OpenMP.  This can
 be useful on nodes with high-core counts when using less MPI processes
 than cores is advantageous, e.g. when running with PPPM so that FFTs
 are run on fewer MPI processors or when the many MPI tasks would
 overload the available bandwidth for communication. :l
 
 Styles with an "opt" suffix are part of the OPT package and typically
 speed-up the pairwise calculations of your simulation by 5-25% on a
 CPU. :l
 :ule
 
 The individual accelerator package doc pages explain:
 
 what hardware and software the accelerated package requires
 how to build LAMMPS with the accelerated package
 how to run with the accelerated package either via command-line switches or modifying the input script
 speed-ups to expect
 guidelines for best performance
 restrictions :ul
 
 :line
 
 5.4 Comparison of various accelerator packages :h4,link(acc_4)
 
 NOTE: this section still needs to be re-worked with additional KOKKOS
 and USER-INTEL information.
 
 The next section compares and contrasts the various accelerator
 options, since there are multiple ways to perform OpenMP threading,
 run on GPUs, and run on Intel Xeon Phi coprocessors.
 
 All 3 of these packages accelerate a LAMMPS calculation using NVIDIA
 hardware, but they do it in different ways.
 
 As a consequence, for a particular simulation on specific hardware,
 one package may be faster than the other.  We give guidelines below,
 but the best way to determine which package is faster for your input
 script is to try both of them on your machine.  See the benchmarking
 section below for examples where this has been done.
 
 [Guidelines for using each package optimally:]
 
 The GPU package allows you to assign multiple CPUs (cores) to a single
 GPU (a common configuration for "hybrid" nodes that contain multicore
 CPU(s) and GPU(s)) and works effectively in this mode. :ulb,l
 
 The GPU package moves per-atom data (coordinates, forces)
 back-and-forth between the CPU and GPU every timestep.  The
 KOKKOS/CUDA package only does this on timesteps when a CPU calculation
 is required (e.g. to invoke a fix or compute that is non-GPU-ized).
 Hence, if you can formulate your input script to only use GPU-ized
 fixes and computes, and avoid doing I/O too often (thermo output, dump
 file snapshots, restart files), then the data transfer cost of the
 KOKKOS/CUDA package can be very low, causing it to run faster than the
 GPU package. :l
 
 The GPU package is often faster than the KOKKOS/CUDA package, if the
 number of atoms per GPU is smaller.  The crossover point, in terms of
 atoms/GPU at which the KOKKOS/CUDA package becomes faster depends
 strongly on the pair style.  For example, for a simple Lennard Jones
 system the crossover (in single precision) is often about 50K-100K
 atoms per GPU.  When performing double precision calculations the
 crossover point can be significantly smaller. :l
 
 Both packages compute bonded interactions (bonds, angles, etc) on the
 CPU.  If the GPU package is running with several MPI processes
 assigned to one GPU, the cost of computing the bonded interactions is
 spread across more CPUs and hence the GPU package can run faster. :l
 
 When using the GPU package with multiple CPUs assigned to one GPU, its
 performance depends to some extent on high bandwidth between the CPUs
 and the GPU.  Hence its performance is affected if full 16 PCIe lanes
 are not available for each GPU.  In HPC environments this can be the
 case if S2050/70 servers are used, where two devices generally share
 one PCIe 2.0 16x slot.  Also many multi-GPU mainboards do not provide
 full 16 lanes to each of the PCIe 2.0 16x slots. :l
 :ule
 
 [Differences between the two packages:]
 
 The GPU package accelerates only pair force, neighbor list, and PPPM
 calculations. :ulb,l
 
 The GPU package requires neighbor lists to be built on the CPU when using
 exclusion lists, hybrid pair styles, or a triclinic simulation box. :l
 :ule
diff --git a/doc/src/accelerate_gpu.txt b/doc/src/accelerate_gpu.txt
index 68e9fa477..2723b6e97 100644
--- a/doc/src/accelerate_gpu.txt
+++ b/doc/src/accelerate_gpu.txt
@@ -1,254 +1,249 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.1 GPU package :h5
 
 The GPU package was developed by Mike Brown at ORNL and his
 collaborators, particularly Trung Nguyen (ORNL).  It provides GPU
 versions of many pair styles, including the 3-body Stillinger-Weber
 pair style, and for "kspace_style pppm"_kspace_style.html for
 long-range Coulombics.  It has the following general features:
 
 It is designed to exploit common GPU hardware configurations where one
 or more GPUs are coupled to many cores of one or more multi-core CPUs,
 e.g. within a node of a parallel machine. :ulb,l
 
 Atom-based data (e.g. coordinates, forces) moves back-and-forth
 between the CPU(s) and GPU every timestep. :l
 
 Neighbor lists can be built on the CPU or on the GPU :l
 
 The charge assignment and force interpolation portions of PPPM can be
 run on the GPU.  The FFT portion, which requires MPI communication
 between processors, runs on the CPU. :l
 
 Asynchronous force computations can be performed simultaneously on the
 CPU(s) and GPU. :l
 
 It allows for GPU computations to be performed in single or double
 precision, or in mixed-mode precision, where pairwise forces are
 computed in single precision, but accumulated into double-precision
 force vectors. :l
 
 LAMMPS-specific code is in the GPU package.  It makes calls to a
 generic GPU library in the lib/gpu directory.  This library provides
 NVIDIA support as well as more general OpenCL support, so that the
 same functionality can eventually be supported on a variety of GPU
 hardware. :l
 :ule
 
 Here is a quick overview of how to enable and use the GPU package:
 
 build the library in lib/gpu for your GPU hardware with the desired precision settings
 install the GPU package and build LAMMPS as usual
 use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU
 specify the # of GPUs per node
 use GPU styles in your input script :ul
 
 The latter two steps can be done using the "-pk gpu" and "-sf gpu"
 "command-line switches"_Section_start.html#start_6 respectively.  Or
 the effect of the "-pk" or "-sf" switches can be duplicated by adding
 the "package gpu"_package.html or "suffix gpu"_suffix.html commands
 respectively to your input script.
 
 [Required hardware/software:]
 
 To use this package, you currently need to have an NVIDIA GPU and
 install the NVIDIA Cuda software on your system:
 
 Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information
 Go to http://www.nvidia.com/object/cuda_get.html
 Install a driver and toolkit appropriate for your system (SDK is not necessary)
 Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties :ul
 
 [Building LAMMPS with the GPU package:]
 
 This requires two steps (a,b): build the GPU library, then build
 LAMMPS with the GPU package.
 
-You can do both these steps in one line, using the src/Make.py script,
-described in "Section 4"_Section_packages.html of the manual.
-Type "Make.py -h" for help.  If run from the src directory, this
-command will create src/lmp_gpu using src/MAKE/Makefile.mpi as the
-starting Makefile.machine:
-
-Make.py -p gpu -gpu mode=single arch=31 -o gpu -a lib-gpu file mpi :pre
+You can do both these steps in one line as described in
+"Section 4"_Section_packages.html of the manual.
 
 Or you can follow these two (a,b) steps:
 
 (a) Build the GPU library
 
 The GPU library is in lammps/lib/gpu.  Select a Makefile.machine (in
 lib/gpu) appropriate for your system.  You should pay special
 attention to 3 settings in this makefile.
 
 CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system
 CUDA_ARCH = needs to be appropriate to your GPUs
 CUDA_PREC = precision (double, mixed, single) you desire :ul
 
 See lib/gpu/Makefile.linux.double for examples of the ARCH settings
 for different GPU choices, e.g. Fermi vs Kepler.  It also lists the
 possible precision settings:
 
 CUDA_PREC = -D_SINGLE_SINGLE  # single precision for all calculations
 CUDA_PREC = -D_DOUBLE_DOUBLE  # double precision for all calculations
 CUDA_PREC = -D_SINGLE_DOUBLE  # accumulation of forces, etc, in double :pre
 
 The last setting is the mixed mode referred to above.  Note that your
 GPU must support double precision to use either the 2nd or 3rd of
 these settings.
 
 To build the library, type:
 
 make -f Makefile.machine :pre
 
 If successful, it will produce the files libgpu.a and Makefile.lammps.
 
 The latter file has 3 settings that need to be appropriate for the
 paths and settings for the CUDA system software on your machine.
 Makefile.lammps is a copy of the file specified by the EXTRAMAKE
 setting in Makefile.machine.  You can change EXTRAMAKE or create your
 own Makefile.lammps.machine if needed.
 
 Note that to change the precision of the GPU library, you need to
 re-build the entire library.  Do a "clean" first, e.g. "make -f
 Makefile.linux clean", followed by the make command above.
 
 (b) Build LAMMPS with the GPU package
 
 cd lammps/src
 make yes-gpu
 make machine :pre
 
 No additional compile/link flags are needed in Makefile.machine.
 
 Note that if you change the GPU library precision (discussed above)
 and rebuild the GPU library, then you also need to re-install the GPU
 package and re-build LAMMPS, so that all affected files are
 re-compiled and linked to the new GPU library.
 
 [Run with the GPU package from the command line:]
 
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 
 When using the GPU package, you cannot assign more than one GPU to a
 single MPI task.  However multiple MPI tasks can share the same GPU,
 and in many cases it will be more efficient to run this way.  Likewise
 it may be more efficient to use less MPI tasks/node than the available
 # of CPU cores.  Assignment of multiple MPI tasks to a GPU will happen
 automatically if you create more MPI tasks/node than there are
 GPUs/mode.  E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be
 shared by 4 MPI tasks.
 
 Use the "-sf gpu" "command-line switch"_Section_start.html#start_6,
 which will automatically append "gpu" to styles that support it.  Use
 the "-pk gpu Ng" "command-line switch"_Section_start.html#start_6 to
 set Ng = # of GPUs/node to use.
 
 lmp_machine -sf gpu -pk gpu 1 -in in.script                         # 1 MPI task uses 1 GPU
 mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script           # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
 mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script   # ditto on 4 16-core nodes :pre
 
 Note that if the "-sf gpu" switch is used, it also issues a default
 "package gpu 1"_package.html command, which sets the number of
 GPUs/node to 1.
 
 Using the "-pk" switch explicitly allows for setting of the number of
 GPUs/node to use and additional options.  Its syntax is the same as
 same as the "package gpu" command.  See the "package"_package.html
 command doc page for details, including the default values used for
 all its options if it is not specified.
 
 Note that the default for the "package gpu"_package.html command is to
 set the Newton flag to "off" pairwise interactions.  It does not
 affect the setting for bonded interactions (LAMMPS default is "on").
 The "off" setting for pairwise interaction is currently required for
 GPU package pair styles.
 
 [Or run with the GPU package by editing an input script:]
 
 The discussion above for the mpirun/mpiexec command, MPI tasks/node,
 and use of multiple MPI tasks/GPU is the same.
 
 Use the "suffix gpu"_suffix.html command, or you can explicitly add an
 "gpu" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/gpu 2.5 :pre
 
 You must also use the "package gpu"_package.html command to enable the
 GPU package, unless the "-sf gpu" or "-pk gpu" "command-line
 switches"_Section_start.html#start_6 were used.  It specifies the
 number of GPUs/node to use, as well as other options.
 
 [Speed-ups to expect:]
 
 The performance of a GPU versus a multi-core CPU is a function of your
 hardware, which pair style is used, the number of atoms/GPU, and the
 precision used on the GPU (double, single, mixed).
 
 See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
 LAMMPS web site for performance of the GPU package on various
 hardware, including the Titan HPC platform at ORNL.
 
 You should also experiment with how many MPI tasks per GPU to use to
 give the best performance for your problem and machine.  This is also
 a function of the problem size and the pair style being using.
 Likewise, you should experiment with the precision setting for the GPU
 library to see if single or mixed precision will give accurate
 results, since they will typically be faster.
 
 [Guidelines for best performance:]
 
 Using multiple MPI tasks per GPU will often give the best performance,
 as allowed my most multi-core CPU/GPU configurations. :ulb,l
 
 If the number of particles per MPI task is small (e.g. 100s of
 particles), it can be more efficient to run with fewer MPI tasks per
 GPU, even if you do not use all the cores on the compute node. :l
 
 The "package gpu"_package.html command has several options for tuning
 performance.  Neighbor lists can be built on the GPU or CPU.  Force
 calculations can be dynamically balanced across the CPU cores and
 GPUs.  GPU-specific settings can be made which can be optimized
 for different hardware.  See the "packakge"_package.html command
 doc page for details. :l
 
 As described by the "package gpu"_package.html command, GPU
 accelerated pair styles can perform computations asynchronously with
 CPU computations. The "Pair" time reported by LAMMPS will be the
 maximum of the time required to complete the CPU pair style
 computations and the time required to complete the GPU pair style
 computations. Any time spent for GPU-enabled pair styles for
 computations that run simultaneously with "bond"_bond_style.html,
 "angle"_angle_style.html, "dihedral"_dihedral_style.html,
 "improper"_improper_style.html, and "long-range"_kspace_style.html
 calculations will not be included in the "Pair" time. :l
 
 When the {mode} setting for the package gpu command is force/neigh,
 the time for neighbor list calculations on the GPU will be added into
 the "Pair" time, not the "Neigh" time.  An additional breakdown of the
 times required for various tasks on the GPU (data copy, neighbor
 calculations, force computations, etc) are output only with the LAMMPS
 screen output (not in the log file) at the end of each run.  These
 timings represent total time spent on the GPU for each routine,
 regardless of asynchronous CPU calculations. :l
 
 The output section "GPU Time Info (average)" reports "Max Mem / Proc".
 This is the maximum memory used at one time on the GPU for data
 storage by a single MPI process. :l
 :ule
 
 [Restrictions:]
 
 None.
diff --git a/doc/src/accelerate_intel.txt b/doc/src/accelerate_intel.txt
index 74ae9d9a4..9eb295e0d 100644
--- a/doc/src/accelerate_intel.txt
+++ b/doc/src/accelerate_intel.txt
@@ -1,517 +1,514 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.2 USER-INTEL package :h5
 
 The USER-INTEL package is maintained by Mike Brown at Intel
 Corporation.  It provides two methods for accelerating simulations,
 depending on the hardware you have.  The first is acceleration on
 Intel CPUs by running in single, mixed, or double precision with
 vectorization.  The second is acceleration on Intel Xeon Phi
 coprocessors via offloading neighbor list and non-bonded force
 calculations to the Phi.  The same C++ code is used in both cases.
 When offloading to a coprocessor from a CPU, the same routine is run
 twice, once on the CPU and once with an offload flag. This allows
 LAMMPS to run on the CPU cores and coprocessor cores simultaneously.
 
 [Currently Available USER-INTEL Styles:]
 
 Angle Styles: charmm, harmonic :ulb,l
 Bond Styles: fene, harmonic :l
 Dihedral Styles: charmm, harmonic, opls :l
 Fixes: nve, npt, nvt, nvt/sllod :l
 Improper Styles: cvff, harmonic :l
 Pair Styles: buck/coul/cut, buck/coul/long, buck, eam, gayberne,
 charmm/coul/long, lj/cut, lj/cut/coul/long, lj/long/coul/long, sw, tersoff :l
 K-Space Styles: pppm, pppm/disp :l
 :ule
 
 [Speed-ups to expect:]
 
 The speedups will depend on your simulation, the hardware, which
 styles are used, the number of atoms, and the floating-point
 precision mode. Performance improvements are shown compared to
 LAMMPS {without using other acceleration packages} as these are
 under active development (and subject to performance changes). The
 measurements were performed using the input files available in
 the src/USER-INTEL/TEST directory with the provided run script.
 These are scalable in size; the results given are with 512K
 particles (524K for Liquid Crystal). Most of the simulations are
 standard LAMMPS benchmarks (indicated by the filename extension in
 parenthesis) with modifications to the run length and to add a
 warmup run (for use with offload benchmarks).
 
 :c,image(JPG/user_intel.png)
 
 Results are speedups obtained on Intel Xeon E5-2697v4 processors
 (code-named Broadwell) and Intel Xeon Phi 7250 processors
 (code-named Knights Landing) with "June 2017" LAMMPS built with
 Intel Parallel Studio 2017 update 2. Results are with 1 MPI task
 per physical core. See {src/USER-INTEL/TEST/README} for the raw
 simulation rates and instructions to reproduce.
 
 :line
 
 [Accuracy and order of operations:]
 
 In most molecular dynamics software, parallelization parameters
 (# of MPI, OpenMP, and vectorization) can change the results due
 to changing the order of operations with finite-precision
 calculations. The USER-INTEL package is deterministic. This means
 that the results should be reproducible from run to run with the
 {same} parallel configurations and when using determinstic
 libraries or library settings (MPI, OpenMP, FFT). However, there
 are differences in the USER-INTEL package that can change the
 order of operations compared to LAMMPS without acceleration:
 
 Neighbor lists can be created in a different order :ulb,l
 Bins used for sorting atoms can be oriented differently :l
 The default stencil order for PPPM is 7. By default, LAMMPS will
 calculate other PPPM parameters to fit the desired acuracy with
 this order :l
 The {newton} setting applies to all atoms, not just atoms shared
 between MPI tasks :l
 Vectorization can change the order for adding pairwise forces :l
 :ule
 
 The precision mode (described below) used with the USER-INTEL
 package can change the {accuracy} of the calculations. For the
 default {mixed} precision option, calculations between pairs or
 triplets of atoms are performed in single precision, intended to
 be within the inherent error of MD simulations. All accumulation
 is performed in double precision to prevent the error from growing
 with the number of atoms in the simulation. {Single} precision
 mode should not be used without appropriate validation.
 
 :line
 
 [Quick Start for Experienced Users:]
 
 LAMMPS should be built with the USER-INTEL package installed.
 Simulations should be run with 1 MPI task per physical {core},
 not {hardware thread}.
 
 Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary. :ulb,l
 Set the environment variable KMP_BLOCKTIME=0 :l
 "-pk intel 0 omp $t -sf intel" added to LAMMPS command-line :l
 $t should be 2 for Intel Xeon CPUs and 2 or 4 for Intel Xeon Phi :l
 For some of the simple 2-body potentials without long-range
 electrostatics, performance and scalability can be better with
 the "newton off" setting added to the input script :l
 For simulations on higher node counts, add "processors * * * grid 
 numa" to the beginning of the input script for better scalability :l
 If using {kspace_style pppm} in the input script, add
 "kspace_modify diff ad" for better performance :l
 :ule
 
 For Intel Xeon Phi CPUs:
 
 Runs should be performed using MCDRAM. :ulb,l
 :ule
 
 For simulations using {kspace_style pppm} on Intel CPUs
 supporting AVX-512:
 
 Add "kspace_modify diff ad" to the input script :ulb,l
 The command-line option should be changed to
 "-pk intel 0 omp $r lrt yes -sf intel" where $r is the number of
 threads minus 1. :l
 Do not use thread affinity (set KMP_AFFINITY=none) :l
 The "newton off" setting may provide better scalability :l
 :ule
 
 For Intel Xeon Phi coprocessors (Offload):
 
 Edit src/MAKE/OPTIONS/Makefile.intel_coprocessor as necessary :ulb,l
 "-pk intel N omp 1" added to command-line where N is the number of
 coprocessors per node. :l
 :ule
 
 :line
 
 [Required hardware/software:]
 
 In order to use offload to coprocessors, an Intel Xeon Phi
 coprocessor and an Intel compiler are required. For this, the
 recommended version of the Intel compiler is 14.0.1.106 or
 versions 15.0.2.044 and higher.
 
 Although any compiler can be used with the USER-INTEL package,
 currently, vectorization directives are disabled by default when
 not using Intel compilers due to lack of standard support and
 observations of decreased performance. The OpenMP standard now
 supports directives for vectorization and we plan to transition the
 code to this standard once it is available in most compilers. We
 expect this to allow improved performance and support with other
 compilers.
 
 For Intel Xeon Phi x200 series processors (code-named Knights
 Landing), there are multiple configuration options for the hardware.
 For best performance, we recommend that the MCDRAM is configured in
 "Flat" mode and with the cluster mode set to "Quadrant" or "SNC4".
 "Cache" mode can also be used, although the performance might be
 slightly lower.
 
 [Notes about Simultaneous Multithreading:]
 
 Modern CPUs often support Simultaneous Multithreading (SMT). On
 Intel processors, this is called Hyper-Threading (HT) technology.
 SMT is hardware support for running multiple threads efficiently on
 a single core. {Hardware threads} or {logical cores} are often used
 to refer to the number of threads that are supported in hardware.
 For example, the Intel Xeon E5-2697v4 processor is described
 as having 36 cores and 72 threads. This means that 36 MPI processes
 or OpenMP threads can run simultaneously on separate cores, but that
 up to 72 MPI processes or OpenMP threads can be running on the CPU
 without costly operating system context switches.
 
 Molecular dynamics simulations will often run faster when making use
 of SMT. If a thread becomes stalled, for example because it is
 waiting on data that has not yet arrived from memory, another thread
 can start running so that the CPU pipeline is still being used
 efficiently. Although benefits can be seen by launching a MPI task
 for every hardware thread, for multinode simulations, we recommend
 that OpenMP threads are used for SMT instead, either with the
 USER-INTEL package, "USER-OMP package"_accelerate_omp.html, or
 "KOKKOS package"_accelerate_kokkos.html. In the example above, up
 to 36X speedups can be observed by using all 36 physical cores with
 LAMMPS. By using all 72 hardware threads, an additional 10-30%
 performance gain can be achieved.
 
 The BIOS on many platforms allows SMT to be disabled, however, we do
 not recommend this on modern processors as there is little to no
 benefit for any software package in most cases. The operating system
 will report every hardware thread as a separate core allowing one to
 determine the number of hardware threads available. On Linux systems,
 this information can normally be obtained with:
 
 cat /proc/cpuinfo :pre
 
 [Building LAMMPS with the USER-INTEL package:]
 
 NOTE: See the src/USER-INTEL/README file for additional flags that
 might be needed for best performance on Intel server processors
 code-named "Skylake".
 
 The USER-INTEL package must be installed into the source directory:
 
 make yes-user-intel :pre
 
 Several example Makefiles for building with the Intel compiler are
 included with LAMMPS in the src/MAKE/OPTIONS/ directory:
 
 Makefile.intel_cpu_intelmpi # Intel Compiler, Intel MPI, No Offload
 Makefile.knl                # Intel Compiler, Intel MPI, No Offload
 Makefile.intel_cpu_mpich    # Intel Compiler, MPICH, No Offload
 Makefile.intel_cpu_openpmi  # Intel Compiler, OpenMPI, No Offload
 Makefile.intel_coprocessor  # Intel Compiler, Intel MPI, Offload :pre
 
 Makefile.knl is identical to Makefile.intel_cpu_intelmpi except that
 it explicitly specifies that vectorization should be for Intel
 Xeon Phi x200 processors making it easier to cross-compile. For
 users with recent installations of Intel Parallel Studio, the
 process can be as simple as:
 
 make yes-user-intel
 source /opt/intel/parallel_studio_xe_2016.3.067/psxevars.sh
 # or psxevars.csh for C-shell
 make intel_cpu_intelmpi :pre
 
-Alternatively, the build can be accomplished with the src/Make.py
-script, described in "Section 4"_Section_packages.html of the
-manual. Type "Make.py -h" for help. For an example:
-
-Make.py -v -p intel omp -intel cpu -a file intel_cpu_intelmpi :pre
+Alternatively this can be done as a single command with
+suitable make command invocations. This is discussed in "Section
+4"_Section_packages.html of the manual.
 
 Note that if you build with support for a Phi coprocessor, the same
 binary can be used on nodes with or without coprocessors installed.
 However, if you do not have coprocessors on your system, building
 without offload support will produce a smaller binary.
 
 The general requirements for Makefiles with the USER-INTEL package
 are as follows. "-DLAMMPS_MEMALIGN=64" is required for CCFLAGS. When
 using Intel compilers, "-restrict" is required and "-qopenmp" is
 highly recommended for CCFLAGS and LINKFLAGS. LIB should include
 "-ltbbmalloc". For builds supporting offload, "-DLMP_INTEL_OFFLOAD"
 is required for CCFLAGS and "-qoffload" is required for LINKFLAGS.
 Other recommended CCFLAG options for best performance are
 "-O2 -fno-alias -ansi-alias -qoverride-limits fp-model fast=2
--no-prec-div". The Make.py command will add all of these
-automatically.
+-no-prec-div".
 
 NOTE: The vectorization and math capabilities can differ depending on
 the CPU. For Intel compilers, the "-x" flag specifies the type of
 processor for which to optimize. "-xHost" specifies that the compiler
 should build for the processor used for compiling. For Intel Xeon Phi
 x200 series processors, this option is "-xMIC-AVX512". For fourth
 generation Intel Xeon (v4/Broadwell) processors, "-xCORE-AVX2" should
 be used. For older Intel Xeon processors, "-xAVX" will perform best
 in general for the different simulations in LAMMPS. The default
 in most of the example Makefiles is to use "-xHost", however this
 should not be used when cross-compiling.
 
 [Running LAMMPS with the USER-INTEL package:]
 
 Running LAMMPS with the USER-INTEL package is similar to normal use
 with the exceptions that one should 1) specify that LAMMPS should use
 the USER-INTEL package, 2) specify the number of OpenMP threads, and
 3) optionally specify the specific LAMMPS styles that should use the
 USER-INTEL package. 1) and 2) can be performed from the command-line
 or by editing the input script. 3) requires editing the input script.
 Advanced performance tuning options are also described below to get
 the best performance.
 
 When running on a single node (including runs using offload to a
 coprocessor), best performance is normally obtained by using 1 MPI
 task per physical core and additional OpenMP threads with SMT. For
 Intel Xeon processors, 2 OpenMP threads should be used for SMT.
 For Intel Xeon Phi CPUs, 2 or 4 OpenMP threads should be used
 (best choice depends on the simulation). In cases where the user
 specifies that LRT mode is used (described below), 1 or 3 OpenMP
 threads should be used. For multi-node runs, using 1 MPI task per
 physical core will often perform best, however, depending on the
 machine and scale, users might get better performance by decreasing
 the number of MPI tasks and using more OpenMP threads. For
 performance, the product of the number of MPI tasks and OpenMP
 threads should not exceed the number of available hardware threads in
 almost all cases.
 
 NOTE: Setting core affinity is often used to pin MPI tasks and OpenMP
 threads to a core or group of cores so that memory access can be
 uniform. Unless disabled at build time, affinity for MPI tasks and
 OpenMP threads on the host (CPU) will be set by default on the host
 {when using offload to a coprocessor}. In this case, it is unnecessary
 to use other methods to control affinity (e.g. taskset, numactl,
 I_MPI_PIN_DOMAIN, etc.). This can be disabled with the {no_affinity}
 option to the "package intel"_package.html command or by disabling the
 option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the
 CCFLAGS line of your Makefile). Disabling this option is not
 recommended, especially when running on a machine with Intel
 Hyper-Threading technology disabled.
 
 [Run with the USER-INTEL package from the command line:]
 
 To enable USER-INTEL optimizations for all available styles used in
 the input script, the "-sf intel"
 "command-line switch"_Section_start.html#start_6 can be used without
 any requirement for editing the input script. This switch will
 automatically append "intel" to styles that support it. It also
 invokes a default command: "package intel 1"_package.html. This
 package command is used to set options for the USER-INTEL package.
 The default package command will specify that USER-INTEL calculations
 are performed in mixed precision, that the number of OpenMP threads
 is specified by the OMP_NUM_THREADS environment variable, and that
 if coprocessors are present and the binary was built with offload
 support, that 1 coprocessor per node will be used with automatic
 balancing of work between the CPU and the coprocessor.
 
 You can specify different options for the USER-INTEL package by using
 the "-pk intel Nphi" "command-line switch"_Section_start.html#start_6
 with keyword/value pairs as specified in the documentation. Here,
 Nphi = # of Xeon Phi coprocessors/node (ignored without offload
 support). Common options to the USER-INTEL package include {omp} to
 override any OMP_NUM_THREADS setting and specify the number of OpenMP
 threads, {mode} to set the floating-point precision mode, and
 {lrt} to enable Long-Range Thread mode as described below. See the
 "package intel"_package.html command for details, including the
 default values used for all its options if not specified, and how to
 set the number of OpenMP threads via the OMP_NUM_THREADS environment
 variable if desired.
 
 Examples (see documentation for your MPI/Machine for differences in
 launching MPI applications):
 
 mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script                                 # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads
 mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double   # Don't use any coprocessors that might be available, use 2 OpenMP threads for each task, use double precision :pre
 
 [Or run with the USER-INTEL package by editing an input script:]
 
 As an alternative to adding command-line arguments, the input script
 can be edited to enable the USER-INTEL package. This requires adding
 the "package intel"_package.html command to the top of the input
 script. For the second example above, this would be:
 
 package intel 0 omp 2 mode double :pre
 
 To enable the USER-INTEL package only for individual styles, you can
 add an "intel" suffix to the individual style, e.g.:
 
 pair_style lj/cut/intel 2.5 :pre
 
 Alternatively, the "suffix intel"_suffix.html command can be added to
 the input script to enable USER-INTEL styles for the commands that
 follow in the input script.
 
 [Tuning for Performance:]
 
 NOTE: The USER-INTEL package will perform better with modifications
 to the input script when "PPPM"_kspace_style.html is used:
 "kspace_modify diff ad"_kspace_modify.html should be added to the
 input script.
 
 Long-Range Thread (LRT) mode is an option to the "package
 intel"_package.html command that can improve performance when using
 "PPPM"_kspace_style.html for long-range electrostatics on processors
 with SMT. It generates an extra pthread for each MPI task. The thread
 is dedicated to performing some of the PPPM calculations and MPI
 communications. On Intel Xeon Phi x200 series CPUs, this will likely
 always improve performance, even on a single node. On Intel Xeon
 processors, using this mode might result in better performance when
 using multiple nodes, depending on the machine. To use this mode,
 specify that the number of OpenMP threads is one less than would
 normally be used for the run and add the "lrt yes" option to the "-pk"
 command-line suffix or "package intel" command. For example, if a run
 would normally perform best with "-pk intel 0 omp 4", instead use
 "-pk intel 0 omp 3 lrt yes". When using LRT, you should set the
 environment variable "KMP_AFFINITY=none". LRT mode is not supported
 when using offload.
 
 NOTE: Changing the "newton"_newton.html setting to off can improve
 performance and/or scalability for simple 2-body potentials such as
 lj/cut or when using LRT mode on processors supporting AVX-512.
 
 Not all styles are supported in the USER-INTEL package. You can mix
 the USER-INTEL package with styles from the "OPT"_accelerate_opt.html
 package or the "USER-OMP package"_accelerate_omp.html. Of course,
 this requires that these packages were installed at build time. This
 can performed automatically by using "-sf hybrid intel opt" or
 "-sf hybrid intel omp" command-line options. Alternatively, the "opt"
 and "omp" suffixes can be appended manually in the input script. For
 the latter, the "package omp"_package.html command must be in the
 input script or the "-pk omp Nt" "command-line
 switch"_Section_start.html#start_6 must be used where Nt is the
 number of OpenMP threads. The number of OpenMP threads should not be
 set differently for the different packages. Note that the "suffix
 hybrid intel omp"_suffix.html command can also be used within the
 input script to automatically append the "omp" suffix to styles when
 USER-INTEL styles are not available.
 
 NOTE: For simulations on higher node counts, add "processors * * * 
 grid numa"_processors.html" to the beginning of the input script for
 better scalability.
 
 When running on many nodes, performance might be better when using
 fewer OpenMP threads and more MPI tasks. This will depend on the
 simulation and the machine. Using the "verlet/split"_run_style.html
 run style might also give better performance for simulations with
 "PPPM"_kspace_style.html electrostatics. Note that this is an
 alternative to LRT mode and the two cannot be used together.
 
 Currently, when using Intel MPI with Intel Xeon Phi x200 series
 CPUs, better performance might be obtained by setting the
 environment variable "I_MPI_SHM_LMT=shm" for Linux kernels that do
 not yet have full support for AVX-512. Runs on Intel Xeon Phi x200
 series processors will always perform better using MCDRAM. Please
 consult your system documentation for the best approach to specify
 that MPI runs are performed in MCDRAM.
 
 [Tuning for Offload Performance:]
 
 The default settings for offload should give good performance.
 
 When using LAMMPS with offload to Intel coprocessors, best performance
 will typically be achieved with concurrent calculations performed on
 both the CPU and the coprocessor. This is achieved by offloading only
 a fraction of the neighbor and pair computations to the coprocessor or
 using "hybrid"_pair_hybrid.html pair styles where only one style uses
 the "intel" suffix. For simulations with long-range electrostatics or
 bond, angle, dihedral, improper calculations, computation and data
 transfer to the coprocessor will run concurrently with computations
 and MPI communications for these calculations on the host CPU. This
 is illustrated in the figure below for the rhodopsin protein benchmark
 running on E5-2697v2 processors with a Intel Xeon Phi 7120p
 coprocessor. In this plot, the vertical access is time and routines
 running at the same time are running concurrently on both the host and
 the coprocessor.
 
 :c,image(JPG/offload_knc.png)
 
 The fraction of the offloaded work is controlled by the {balance}
 keyword in the "package intel"_package.html command. A balance of 0
 runs all calculations on the CPU.  A balance of 1 runs all
 supported calculations on the coprocessor.  A balance of 0.5 runs half
 of the calculations on the coprocessor.  Setting the balance to -1
 (the default) will enable dynamic load balancing that continously
 adjusts the fraction of offloaded work throughout the simulation.
 Because data transfer cannot be timed, this option typically produces
 results within 5 to 10 percent of the optimal fixed balance.
 
 If running short benchmark runs with dynamic load balancing, adding a
 short warm-up run (10-20 steps) will allow the load-balancer to find a
 near-optimal setting that will carry over to additional runs.
 
 The default for the "package intel"_package.html command is to have
 all the MPI tasks on a given compute node use a single Xeon Phi
 coprocessor.  In general, running with a large number of MPI tasks on
 each node will perform best with offload.  Each MPI task will
 automatically get affinity to a subset of the hardware threads
 available on the coprocessor.  For example, if your card has 61 cores,
 with 60 cores available for offload and 4 hardware threads per core
 (240 total threads), running with 24 MPI tasks per node will cause
 each MPI task to use a subset of 10 threads on the coprocessor.  Fine
 tuning of the number of threads to use per MPI task or the number of
 threads to use per core can be accomplished with keyword settings of
 the "package intel"_package.html command.
 
 The USER-INTEL package has two modes for deciding which atoms will be
 handled by the coprocessor.  This choice is controlled with the {ghost}
 keyword of the "package intel"_package.html command.  When set to 0,
 ghost atoms (atoms at the borders between MPI tasks) are not offloaded
 to the card.  This allows for overlap of MPI communication of forces
 with computation on the coprocessor when the "newton"_newton.html
 setting is "on".  The default is dependent on the style being used,
 however, better performance may be achieved by setting this option
 explicitly.
 
 When using offload with CPU Hyper-Threading disabled, it may help
 performance to use fewer MPI tasks and OpenMP threads than available
 cores.  This is due to the fact that additional threads are generated
 internally to handle the asynchronous offload tasks.
 
 If pair computations are being offloaded to an Intel Xeon Phi
 coprocessor, a diagnostic line is printed to the screen (not to the
 log file), during the setup phase of a run, indicating that offload
 mode is being used and indicating the number of coprocessor threads
 per MPI task.  Additionally, an offload timing summary is printed at
 the end of each run.  When offloading, the frequency for "atom
 sorting"_atom_modify.html is changed to 1 so that the per-atom data is
 effectively sorted at every rebuild of the neighbor lists. All the
 available coprocessor threads on each Phi will be divided among MPI
 tasks, unless the {tptask} option of the "-pk intel" "command-line
 switch"_Section_start.html#start_6 is used to limit the coprocessor
 threads per MPI task.
 
 [Restrictions:]
 
 When offloading to a coprocessor, "hybrid"_pair_hybrid.html styles
 that require skip lists for neighbor builds cannot be offloaded.
 Using "hybrid/overlay"_pair_hybrid.html is allowed.  Only one intel
 accelerated style may be used with hybrid styles.
 "Special_bonds"_special_bonds.html exclusion lists are not currently
 supported with offload, however, the same effect can often be
 accomplished by setting cutoffs for excluded atom types to 0.  None of
 the pair styles in the USER-INTEL package currently support the
 "inner", "middle", "outer" options for rRESPA integration via the
 "run_style respa"_run_style.html command; only the "pair" option is
 supported.
 
 [References:]
 
 Brown, W.M., Carrillo, J.-M.Y., Mishra, B., Gavhane, N., Thakker, F.M., De Kraker, A.R., Yamada, M., Ang, J.A., Plimpton, S.J., "Optimizing Classical Molecular Dynamics in LAMMPS," in Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, J. Jeffers, J. Reinders, A. Sodani, Eds. Morgan Kaufmann. :ulb,l
 
 Brown, W. M., Semin, A., Hebenstreit, M., Khvostov, S., Raman, K., Plimpton, S.J. "Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efficiency."_http://dl.acm.org/citation.cfm?id=3014915 2016 High Performance Computing, Networking, Storage and Analysis, SC16: International Conference (pp. 82-95). :l
 
 Brown, W.M., Carrillo, J.-M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J. Optimizing Legacy Molecular Dynamics Software with Directive-Based Offload. Computer Physics Communications. 2015. 195: p. 95-101. :l
 :ule
 
 
 
 
diff --git a/doc/src/accelerate_kokkos.txt b/doc/src/accelerate_kokkos.txt
index 6ccd69584..712a05300 100644
--- a/doc/src/accelerate_kokkos.txt
+++ b/doc/src/accelerate_kokkos.txt
@@ -1,496 +1,493 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.3 KOKKOS package :h5
 
 The KOKKOS package was developed primarily by Christian Trott (Sandia)
 with contributions of various styles by others, including Sikandar
 Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia).  The
 underlying Kokkos library was written primarily by Carter Edwards,
 Christian Trott, and Dan Sunderland (all Sandia).
 
 The KOKKOS package contains versions of pair, fix, and atom styles
 that use data structures and macros provided by the Kokkos library,
 which is included with LAMMPS in lib/kokkos.
 
 The Kokkos library is part of
 "Trilinos"_http://trilinos.sandia.gov/packages/kokkos and can also be
 downloaded from "Github"_https://github.com/kokkos/kokkos. Kokkos is a
 templated C++ library that provides two key abstractions for an
 application like LAMMPS.  First, it allows a single implementation of
 an application kernel (e.g. a pair style) to run efficiently on
 different kinds of hardware, such as a GPU, Intel Phi, or many-core
 CPU.
 
 The Kokkos library also provides data abstractions to adjust (at
 compile time) the memory layout of basic data structures like 2d and
 3d arrays and allow the transparent utilization of special hardware
 load and store operations.  Such data structures are used in LAMMPS to
 store atom coordinates or forces or neighbor lists.  The layout is
 chosen to optimize performance on different platforms.  Again this
 functionality is hidden from the developer, and does not affect how
 the kernel is coded.
 
 These abstractions are set at build time, when LAMMPS is compiled with
 the KOKKOS package installed.  All Kokkos operations occur within the
 context of an individual MPI task running on a single node of the
 machine.  The total number of MPI tasks used by LAMMPS (one or
 multiple per compute node) is set in the usual manner via the mpirun
 or mpiexec commands, and is independent of Kokkos.
 
 Kokkos currently provides support for 3 modes of execution (per MPI
 task).  These are OpenMP (for many-core CPUs), Cuda (for NVIDIA GPUs),
 and OpenMP (for Intel Phi).  Note that the KOKKOS package supports
 running on the Phi in native mode, not offload mode like the
 USER-INTEL package supports.  You choose the mode at build time to
 produce an executable compatible with specific hardware.
 
 Here is a quick overview of how to use the KOKKOS package
 for CPU acceleration, assuming one or more 16-core nodes.
 More details follow.
 
 use a C++11 compatible compiler
 make yes-kokkos
 make mpi KOKKOS_DEVICES=OpenMP                 # build with the KOKKOS package
-make kokkos_omp                                # or Makefile.kokkos_omp already has variable set
-Make.py -v -p kokkos -kokkos omp -o mpi -a file mpi   # or one-line build via Make.py :pre
+make kokkos_omp                                # or Makefile.kokkos_omp already has variable set :pre
 
 mpirun -np 16 lmp_mpi -k on -sf kk -in in.lj              # 1 node, 16 MPI tasks/node, no threads
 mpirun -np 2 -ppn 1 lmp_mpi -k on t 16 -sf kk -in in.lj   # 2 nodes, 1 MPI task/node, 16 threads/task
 mpirun -np 2 lmp_mpi -k on t 8 -sf kk -in in.lj           # 1 node, 2 MPI tasks/node, 8 threads/task
 mpirun -np 32 -ppn 4 lmp_mpi -k on t 4 -sf kk -in in.lj   # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
 
 specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support
 include the KOKKOS package and build LAMMPS
 enable the KOKKOS package and its hardware options via the "-k on" command-line switch use KOKKOS styles in your input script :ul
 
 Here is a quick overview of how to use the KOKKOS package for GPUs,
 assuming one or more nodes, each with 16 cores and a GPU.  More
 details follow.
 
 discuss use of NVCC, which Makefiles to examine
 
 use a C++11 compatible compiler
 KOKKOS_DEVICES = Cuda, OpenMP
 KOKKOS_ARCH = Kepler35
 make yes-kokkos
-make machine
-Make.py -p kokkos -kokkos cuda arch=31 -o kokkos_cuda -a file kokkos_cuda :pre
+make machine :pre
 
 mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj          # one MPI task, 6 threads on CPU
 mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj   # ditto on 4 nodes :pre
 
 mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj           # two MPI tasks, 8 threads per CPU
 mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj   # ditto on 16 nodes :pre
 
 Here is a quick overview of how to use the KOKKOS package
 for the Intel Phi:
 
 use a C++11 compatible compiler
 KOKKOS_DEVICES = OpenMP
 KOKKOS_ARCH = KNC
 make yes-kokkos
-make machine
-Make.py -p kokkos -kokkos phi -o kokkos_phi -a file mpi :pre
+make machine :pre
 
 host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
 mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj           # 1 MPI task on 1 Phi, 1*240 = 240
 mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj            # 30 MPI tasks on 1 Phi, 30*8 = 240
 mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj           # 12 MPI tasks on 1 Phi, 12*20 = 240
 mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj   # ditto on 8 Phis :pre
 
 [Required hardware/software:]
 
 Kokkos support within LAMMPS must be built with a C++11 compatible
 compiler.  If using gcc, version 4.7.2 or later is required.
 
 To build with Kokkos support for CPUs, your compiler must support the
 OpenMP interface.  You should have one or more multi-core CPUs so that
 multiple threads can be launched by each MPI task running on a CPU.
 
 To build with Kokkos support for NVIDIA GPUs, NVIDIA Cuda software
 version 7.5 or later must be installed on your system.  See the
 discussion for the "GPU"_accelerate_gpu.html package for details of
 how to check and do this.
 
 NOTE: For good performance of the KOKKOS package on GPUs, you must
 have Kepler generation GPUs (or later).  The Kokkos library exploits
 texture cache options not supported by Telsa generation GPUs (or
 older).
 
 To build with Kokkos support for Intel Xeon Phi coprocessors, your
 sysmte must be configured to use them in "native" mode, not "offload"
 mode like the USER-INTEL package supports.
 
 [Building LAMMPS with the KOKKOS package:]
 
 You must choose at build time whether to build for CPUs (OpenMP),
 GPUs, or Phi.
 
-You can do any of these in one line, using the src/Make.py script,
-described in "Section 4"_Section_packages.html of the manual.
-Type "Make.py -h" for help.  If run from the src directory, these
+You can do any of these in one line, using the suitable make command
+line flags as described in "Section 4"_Section_packages.html of the
+manual. If run from the src directory, these
 commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda, and
 lmp_kokkos_phi.  Note that the OMP and PHI options use
 src/MAKE/Makefile.mpi as the starting Makefile.machine.  The CUDA
 option uses src/MAKE/OPTIONS/Makefile.kokkos_cuda.
 
 The latter two steps can be done using the "-k on", "-pk kokkos" and
 "-sf kk" "command-line switches"_Section_start.html#start_6
 respectively.  Or the effect of the "-pk" or "-sf" switches can be
 duplicated by adding the "package kokkos"_package.html or "suffix
 kk"_suffix.html commands respectively to your input script.
 
 
 Or you can follow these steps:
 
 CPU-only (run all-MPI or with OpenMP threading):
 
 cd lammps/src
 make yes-kokkos
 make kokkos_omp :pre
 
 CPU-only (only MPI, no threading):
 
 cd lammps/src
 make yes-kokkos
 make kokkos_mpi :pre
 
 Intel Xeon Phi (Intel Compiler, Intel MPI):
 
 cd lammps/src
 make yes-kokkos
 make kokkos_phi :pre
 
 CPUs and GPUs (with MPICH):
 
 cd lammps/src
 make yes-kokkos
 make kokkos_cuda_mpich :pre
 
 These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
 make command line which requires a GNU-compatible make command.  Try
 "gmake" if your system's standard make complains.
 
 NOTE: If you build using make line variables and re-build LAMMPS twice
 with different KOKKOS options and the *same* target, e.g. g++ in the
 first two examples above, then you *must* perform a "make clean-all"
 or "make clean-machine" before each build.  This is to force all the
 KOKKOS-dependent files to be re-compiled with the new options.
 
 NOTE: Currently, there are no precision options with the KOKKOS
 package.  All compilation and computation is performed in double
 precision.
 
 There are other allowed options when building with the KOKKOS package.
 As above, they can be set either as variables on the make command line
 or in Makefile.machine.  This is the full list of options, including
 those discussed above, Each takes a value shown below.  The
 default value is listed, which is set in the
 lib/kokkos/Makefile.kokkos file.
 
 #Default settings specific options
 #Options: force_uvm,use_ldg,rdc
 
 KOKKOS_DEVICES, values = {OpenMP}, {Serial}, {Pthreads}, {Cuda}, default = {OpenMP}
 KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {ARMv8}, {BGQ}, {Power7}, {Power8}, default = {none}
 KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
 KOKKOS_USE_TPLS, values = {hwloc}, {librt}, default = {none}
 KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc} :ul
 
 KOKKOS_DEVICE sets the parallelization method used for Kokkos code
 (within LAMMPS).  KOKKOS_DEVICES=OpenMP means that OpenMP will be
 used.  KOKKOS_DEVICES=Pthreads means that pthreads will be used.
 KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used.
 
 If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE
 directory must use "nvcc" as its compiler, via its CC setting.  For
 best performance its CCFLAGS setting should use -O3 and have a
 KOKKOS_ARCH setting that matches the compute capability of your NVIDIA
 hardware and software installation, e.g. KOKKOS_ARCH=Kepler30.  Note
 the minimal required compute capability is 2.0, but this will give
 significantly reduced performance compared to Kepler generation GPUs
 with compute capability 3.x.  For the LINK setting, "nvcc" should not
 be used; instead use g++ or another compiler suitable for linking C++
 applications.  Often you will want to use your MPI compiler wrapper
 for this setting (i.e. mpicxx).  Finally, the lo-level Makefile must
 also have a "Compilation rule" for creating *.o files from *.cu files.
 See src/Makefile.cuda for an example of a lo-level Makefile with all
 of these settings.
 
 KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
 migrate during a simulation.  KOKKOS_USE_TPLS=hwloc should always be
 used if running with KOKKOS_DEVICES=Pthreads for pthreads.  It is not
 necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
 provides alternative methods via environment variables for binding
 threads to hardware cores.  More info on binding threads to cores is
 given in "Section 5.3"_Section_accelerate.html#acc_3.
 
 KOKKOS_ARCH=KNC enables compiler switches needed when compiling for an
 Intel Phi processor.
 
 KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
 on most Unix platforms.  This library is not available on all
 platforms.
 
 KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
 within LAMMPS.  KOKKOS_DEBUG=yes enables printing of run-time
 debugging information that can be useful.  It also enables runtime
 bounds checking on Kokkos data structures.
 
 KOKKOS_CUDA_OPTIONS are additional options for CUDA.
 
 For more information on Kokkos see the Kokkos programmers' guide here:
 /lib/kokkos/doc/Kokkos_PG.pdf.
 
 [Run with the KOKKOS package from the command line:]
 
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 
 When using KOKKOS built with host=OMP, you need to choose how many
 OpenMP threads per MPI task will be used (via the "-k" command-line
 switch discussed below).  Note that the product of MPI tasks * OpenMP
 threads/task should not exceed the physical number of cores (on a
 node), otherwise performance will suffer.
 
 When using the KOKKOS package built with device=CUDA, you must use
 exactly one MPI task per physical GPU.
 
 When using the KOKKOS package built with host=MIC for Intel Xeon Phi
 coprocessor support you need to insure there are one or more MPI tasks
 per coprocessor, and choose the number of coprocessor threads to use
 per MPI task (via the "-k" command-line switch discussed below).  The
 product of MPI tasks * coprocessor threads/task should not exceed the
 maximum number of threads the coprocessor is designed to run,
 otherwise performance will suffer.  This value is 240 for current
 generation Xeon Phi(TM) chips, which is 60 physical cores * 4
 threads/core.  Note that with the KOKKOS package you do not need to
 specify how many Phi coprocessors there are per node; each
 coprocessors is simply treated as running some number of MPI tasks.
 
 You must use the "-k on" "command-line
 switch"_Section_start.html#start_6 to enable the KOKKOS package.  It
 takes additional arguments for hardware settings appropriate to your
 system.  Those arguments are "documented
 here"_Section_start.html#start_6.  The two most commonly used
 options are:
 
 -k on t Nt g Ng :pre
 
 The "t Nt" option applies to host=OMP (even if device=CUDA) and
 host=MIC.  For host=OMP, it specifies how many OpenMP threads per MPI
 task to use with a node.  For host=MIC, it specifies how many Xeon Phi
 threads per MPI task to use within a node.  The default is Nt = 1.
 Note that for host=OMP this is effectively MPI-only mode which may be
 fine.  But for host=MIC you will typically end up using far less than
 all the 240 available threads, which could give very poor performance.
 
 The "g Ng" option applies to device=CUDA.  It specifies how many GPUs
 per compute node to use.  The default is 1, so this only needs to be
 specified is you have 2 or more GPUs per compute node.
 
 The "-k on" switch also issues a "package kokkos" command (with no
 additional arguments) which sets various KOKKOS options to default
 values, as discussed on the "package"_package.html command doc page.
 
 Use the "-sf kk" "command-line switch"_Section_start.html#start_6,
 which will automatically append "kk" to styles that support it.  Use
 the "-pk kokkos" "command-line switch"_Section_start.html#start_6 if
 you wish to change any of the default "package kokkos"_package.html
 optionns set by the "-k on" "command-line
 switch"_Section_start.html#start_6.
 
 
 
 Note that the default for the "package kokkos"_package.html command is
 to use "full" neighbor lists and set the Newton flag to "off" for both
 pairwise and bonded interactions.  This typically gives fastest
 performance.  If the "newton"_newton.html command is used in the input
 script, it can override the Newton flag defaults.
 
 However, when running in MPI-only mode with 1 thread per MPI task, it
 will typically be faster to use "half" neighbor lists and set the
 Newton flag to "on", just as is the case for non-accelerated pair
 styles.  You can do this with the "-pk" "command-line
 switch"_Section_start.html#start_6.
 
 [Or run with the KOKKOS package by editing an input script:]
 
 The discussion above for the mpirun/mpiexec command and setting
 appropriate thread and GPU values for host=OMP or host=MIC or
 device=CUDA are the same.
 
 You must still use the "-k on" "command-line
 switch"_Section_start.html#start_6 to enable the KOKKOS package, and
 specify its additional arguments for hardware options appropriate to
 your system, as documented above.
 
 Use the "suffix kk"_suffix.html command, or you can explicitly add a
 "kk" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/kk 2.5 :pre
 
 You only need to use the "package kokkos"_package.html command if you
 wish to change any of its option defaults, as set by the "-k on"
 "command-line switch"_Section_start.html#start_6.
 
 [Speed-ups to expect:]
 
 The performance of KOKKOS running in different modes is a function of
 your hardware, which KOKKOS-enable styles are used, and the problem
 size.
 
 Generally speaking, the following rules of thumb apply:
 
 When running on CPUs only, with a single thread per MPI task,
 performance of a KOKKOS style is somewhere between the standard
 (un-accelerated) styles (MPI-only mode), and those provided by the
 USER-OMP package.  However the difference between all 3 is small (less
 than 20%). :ulb,l
 
 When running on CPUs only, with multiple threads per MPI task,
 performance of a KOKKOS style is a bit slower than the USER-OMP
 package. :l
 
 When running large number of atoms per GPU, KOKKOS is typically faster
 than the GPU package. :l
 
 When running on Intel Xeon Phi, KOKKOS is not as fast as
 the USER-INTEL package, which is optimized for that hardware. :l
 :ule
 
 See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
 LAMMPS web site for performance of the KOKKOS package on different
 hardware.
 
 [Guidelines for best performance:]
 
 Here are guidline for using the KOKKOS package on the different
 hardware configurations listed above.
 
 Many of the guidelines use the "package kokkos"_package.html command
 See its doc page for details and default settings.  Experimenting with
 its options can provide a speed-up for specific calculations.
 
 [Running on a multi-core CPU:]
 
 If N is the number of physical cores/node, then the number of MPI
 tasks/node * number of threads/task should not exceed N, and should
 typically equal N.  Note that the default threads/task is 1, as set by
 the "t" keyword of the "-k" "command-line
 switch"_Section_start.html#start_6.  If you do not change this, no
 additional parallelism (beyond MPI) will be invoked on the host
 CPU(s).
 
 You can compare the performance running in different modes:
 
 run with 1 MPI task/node and N threads/task
 run with N MPI tasks/node and 1 thread/task
 run with settings in between these extremes :ul
 
 Examples of mpirun commands in these modes are shown above.
 
 When using KOKKOS to perform multi-threading, it is important for
 performance to bind both MPI tasks to physical cores, and threads to
 physical cores, so they do not migrate during a simulation.
 
 If you are not certain MPI tasks are being bound (check the defaults
 for your MPI installation), binding can be forced with these flags:
 
 OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
 Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
 
 For binding threads with the KOKKOS OMP option, use thread affinity
 environment variables to force binding.  With OpenMP 3.1 (gcc 4.7 or
 later, intel 12 or later) setting the environment variable
 OMP_PROC_BIND=true should be sufficient.  For binding threads with the
 KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option
 (see "this section"_Section_packages.html#KOKKOS of the manual for
 details).
 
 [Running on GPUs:]
 
 Insure the -arch setting in the machine makefile you are using,
 e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software.
 (see "this section"_Section_packages.html#KOKKOS of the manual for
 details).
 
 The -np setting of the mpirun command should set the number of MPI
 tasks/node to be equal to the # of physical GPUs on the node.
 
 Use the "-k" "command-line switch"_Section_commands.html#start_6 to
 specify the number of GPUs per node, and the number of threads per MPI
 task.  As above for multi-core CPUs (and no GPU), if N is the number
 of physical cores/node, then the number of MPI tasks/node * number of
 threads/task should not exceed N.  With one GPU (and one MPI task) it
 may be faster to use less than all the available cores, by setting
 threads/task to a smaller value.  This is because using all the cores
 on a dual-socket node will incur extra cost to copy memory from the
 2nd socket to the GPU.
 
 Examples of mpirun commands that follow these rules are shown above.
 
 NOTE: When using a GPU, you will achieve the best performance if your
 input script does not use any fix or compute styles which are not yet
 Kokkos-enabled.  This allows data to stay on the GPU for multiple
 timesteps, without being copied back to the host CPU.  Invoking a
 non-Kokkos fix or compute, or performing I/O for
 "thermo"_thermo_style.html or "dump"_dump.html output will cause data
 to be copied back to the CPU.
 
 You cannot yet assign multiple MPI tasks to the same GPU with the
 KOKKOS package.  We plan to support this in the future, similar to the
 GPU package in LAMMPS.
 
 You cannot yet use both the host (multi-threaded) and device (GPU)
 together to compute pairwise interactions with the KOKKOS package.  We
 hope to support this in the future, similar to the GPU package in
 LAMMPS.
 
 [Running on an Intel Phi:]
 
 Kokkos only uses Intel Phi processors in their "native" mode, i.e.
 not hosted by a CPU.
 
 As illustrated above, build LAMMPS with OMP=yes (the default) and
 MIC=yes.  The latter insures code is correctly compiled for the Intel
 Phi.  The OMP setting means OpenMP will be used for parallelization on
 the Phi, which is currently the best option within Kokkos.  In the
 future, other options may be added.
 
 Current-generation Intel Phi chips have either 61 or 57 cores.  One
 core should be excluded for running the OS, leaving 60 or 56 cores.
 Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
 N = 224 (4*56) cores to run on.
 
 The -np setting of the mpirun command sets the number of MPI
 tasks/node.  The "-k on t Nt" command-line switch sets the number of
 threads/task as Nt.  The product of these 2 values should be N, i.e.
 240 or 224.  Also, the number of threads/task should be a multiple of
 4 so that logical threads from more than one MPI task do not run on
 the same physical core.
 
 Examples of mpirun commands that follow these rules are shown above.
 
 [Restrictions:]
 
 As noted above, if using GPUs, the number of MPI tasks per compute
 node should equal to the number of GPUs per compute node.  In the
 future Kokkos will support assigning multiple MPI tasks to a single
 GPU.
 
 Currently Kokkos does not support AMD GPUs due to limits in the
 available backend programming models.  Specifically, Kokkos requires
 extensive C++ support from the Kernel language.  This is expected to
 change in the future.
diff --git a/doc/src/accelerate_omp.txt b/doc/src/accelerate_omp.txt
index 81b7a5adc..fa7bef1a5 100644
--- a/doc/src/accelerate_omp.txt
+++ b/doc/src/accelerate_omp.txt
@@ -1,187 +1,183 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section 5 overview"_Section_accelerate.html
 
 5.3.4 USER-OMP package :h5
 
 The USER-OMP package was developed by Axel Kohlmeyer at Temple
 University.  It provides multi-threaded versions of most pair styles,
 nearly all bonded styles (bond, angle, dihedral, improper), several
 Kspace styles, and a few fix styles.  The package currently uses the
 OpenMP interface for multi-threading.
 
 Here is a quick overview of how to use the USER-OMP package, assuming
 one or more 16-core nodes.  More details follow.
 
 use -fopenmp with CCFLAGS and LINKFLAGS in Makefile.machine
 make yes-user-omp
 make mpi                                   # build with USER-OMP package, if settings added to Makefile.mpi
-make omp                                   # or Makefile.omp already has settings
-Make.py -v -p omp -o mpi -a file mpi       # or one-line build via Make.py :pre
+make omp                                   # or Makefile.omp already has settings :pre
 
 lmp_mpi -sf omp -pk omp 16 < in.script                         # 1 MPI task, 16 threads
 mpirun -np 4 lmp_mpi -sf omp -pk omp 4 -in in.script           # 4 MPI tasks, 4 threads/task
 mpirun -np 32 -ppn 4 lmp_mpi -sf omp -pk omp 4 -in in.script   # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
 
 [Required hardware/software:]
 
 Your compiler must support the OpenMP interface.  You should have one
 or more multi-core CPUs so that multiple threads can be launched by
 each MPI task running on a CPU.
 
 [Building LAMMPS with the USER-OMP package:]
 
 The lines above illustrate how to include/build with the USER-OMP
 package in two steps, using the "make" command.  Or how to do it with
-one command via the src/Make.py script, described in "Section
-4"_Section_packages.html of the manual.  Type "Make.py -h" for
-help.
+one command as described in "Section 4"_Section_packages.html of the manual.
 
 Note that the CCFLAGS and LINKFLAGS settings in Makefile.machine must
 include "-fopenmp".  Likewise, if you use an Intel compiler, the
-CCFLAGS setting must include "-restrict".  The Make.py command will
-add these automatically.
+CCFLAGS setting must include "-restrict".
 
 [Run with the USER-OMP package from the command line:]
 
 The mpirun or mpiexec command sets the total number of MPI tasks used
 by LAMMPS (one or multiple per compute node) and the number of MPI
 tasks used per node.  E.g. the mpirun command in MPICH does this via
 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 
 You need to choose how many OpenMP threads per MPI task will be used
 by the USER-OMP package.  Note that the product of MPI tasks *
 threads/task should not exceed the physical number of cores (on a
 node), otherwise performance will suffer.
 
 As in the lines above, use the "-sf omp" "command-line
 switch"_Section_start.html#start_6, which will automatically append
 "omp" to styles that support it.  The "-sf omp" switch also issues a
 default "package omp 0"_package.html command, which will set the
 number of threads per MPI task via the OMP_NUM_THREADS environment
 variable.
 
 You can also use the "-pk omp Nt" "command-line
 switch"_Section_start.html#start_6, to explicitly set Nt = # of OpenMP
 threads per MPI task to use, as well as additional options.  Its
 syntax is the same as the "package omp"_package.html command whose doc
 page gives details, including the default values used if it is not
 specified.  It also gives more details on how to set the number of
 threads via the OMP_NUM_THREADS environment variable.
 
 [Or run with the USER-OMP package by editing an input script:]
 
 The discussion above for the mpirun/mpiexec command, MPI tasks/node,
 and threads/MPI task is the same.
 
 Use the "suffix omp"_suffix.html command, or you can explicitly add an
 "omp" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/omp 2.5 :pre
 
 You must also use the "package omp"_package.html command to enable the
 USER-OMP package.  When you do this you also specify how many threads
 per MPI task to use.  The command doc page explains other options and
 how to set the number of threads via the OMP_NUM_THREADS environment
 variable.
 
 [Speed-ups to expect:]
 
 Depending on which styles are accelerated, you should look for a
 reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
 time" values printed at the end of a run.
 
 You may see a small performance advantage (5 to 20%) when running a
 USER-OMP style (in serial or parallel) with a single thread per MPI
 task, versus running standard LAMMPS with its standard un-accelerated
 styles (in serial or all-MPI parallelization with 1 task/core).  This
 is because many of the USER-OMP styles contain similar optimizations
 to those used in the OPT package, described in "Section
 5.3.5"_accelerate_opt.html.
 
 With multiple threads/task, the optimal choice of number of MPI
 tasks/node and OpenMP threads/task can vary a lot and should always be
 tested via benchmark runs for a specific simulation running on a
 specific machine, paying attention to guidelines discussed in the next
 sub-section.
 
 A description of the multi-threading strategy used in the USER-OMP
 package and some performance examples are "presented
 here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
 
 [Guidelines for best performance:]
 
 For many problems on current generation CPUs, running the USER-OMP
 package with a single thread/task is faster than running with multiple
 threads/task.  This is because the MPI parallelization in LAMMPS is
 often more efficient than multi-threading as implemented in the
 USER-OMP package.  The parallel efficiency (in a threaded sense) also
 varies for different USER-OMP styles.
 
 Using multiple threads/task can be more effective under the following
 circumstances:
 
 Individual compute nodes have a significant number of CPU cores but
 the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
 (Clovertown) and 54xx (Harpertown) quad-core processors.  Running one
 MPI task per CPU core will result in significant performance
 degradation, so that running with 4 or even only 2 MPI tasks per node
 is faster.  Running in hybrid MPI+OpenMP mode will reduce the
 inter-node communication bandwidth contention in the same way, but
 offers an additional speedup by utilizing the otherwise idle CPU
 cores. :ulb,l
 
 The interconnect used for MPI communication does not provide
 sufficient bandwidth for a large number of MPI tasks per node.  For
 example, this applies to running over gigabit ethernet or on Cray XT4
 or XT5 series supercomputers.  As in the aforementioned case, this
 effect worsens when using an increasing number of nodes. :l
 
 The system has a spatially inhomogeneous particle density which does
 not map well to the "domain decomposition scheme"_processors.html or
 "load-balancing"_balance.html options that LAMMPS provides.  This is
 because multi-threading achives parallelism over the number of
 particles, not via their distribution in space. :l
 
 A machine is being used in "capability mode", i.e. near the point
 where MPI parallelism is maxed out.  For example, this can happen when
 using the "PPPM solver"_kspace_style.html for long-range
 electrostatics on large numbers of nodes.  The scaling of the KSpace
 calculation (see the "kspace_style"_kspace_style.html command) becomes
 the performance-limiting factor.  Using multi-threading allows less
 MPI tasks to be invoked and can speed-up the long-range solver, while
 increasing overall performance by parallelizing the pairwise and
 bonded calculations via OpenMP.  Likewise additional speedup can be
 sometimes be achived by increasing the length of the Coulombic cutoff
 and thus reducing the work done by the long-range solver.  Using the
 "run_style verlet/split"_run_style.html command, which is compatible
 with the USER-OMP package, is an alternative way to reduce the number
 of MPI tasks assigned to the KSpace calculation. :l
 :ule
 
 Additional performance tips are as follows:
 
 The best parallel efficiency from {omp} styles is typically achieved
 when there is at least one MPI task per physical CPU chip, i.e. socket
 or die. :ulb,l
 
 It is usually most efficient to restrict threading to a single
 socket, i.e. use one or more MPI task per socket. :l
 
 NOTE: By default, several current MPI implementations use a processor
 affinity setting that restricts each MPI task to a single CPU core.
 Using multi-threading in this mode will force all threads to share the
 one core and thus is likely to be counterproductive.  Instead, binding
 MPI tasks to a (multi-core) socket, should solve this issue. :l
 :ule
 
 [Restrictions:]
 
 None.
diff --git a/doc/src/accelerate_opt.txt b/doc/src/accelerate_opt.txt
index 5a2a5eac0..845264b52 100644
--- a/doc/src/accelerate_opt.txt
+++ b/doc/src/accelerate_opt.txt
@@ -1,71 +1,67 @@
 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 "Return to Section accelerate overview"_Section_accelerate.html
 
 5.3.5 OPT package :h5
 
 The OPT package was developed by James Fischer (High Performance
 Technologies), David Richie, and Vincent Natoli (Stone Ridge
 Technologies).  It contains a handful of pair styles whose compute()
 methods were rewritten in C++ templated form to reduce the overhead
 due to if tests and other conditional code.
 
 Here is a quick overview of how to use the OPT package.  More details
 follow.
 
 make yes-opt
-make mpi                               # build with the OPT package
-Make.py -v -p opt -o mpi -a file mpi   # or one-line build via Make.py :pre
+make mpi                               # build with the OPT package :pre
 
 lmp_mpi -sf opt -in in.script                # run in serial
 mpirun -np 4 lmp_mpi -sf opt -in in.script   # run in parallel :pre
 
 [Required hardware/software:]
 
 None.
 
 [Building LAMMPS with the OPT package:]
 
 The lines above illustrate how to build LAMMPS with the OPT package in
 two steps, using the "make" command.  Or how to do it with one command
-via the src/Make.py script, described in "Section
-4"_Section_packages.html of the manual.  Type "Make.py -h" for
-help.
+as described in "Section 4"_Section_packages.html of the manual.
 
 Note that if you use an Intel compiler to build with the OPT package,
 the CCFLAGS setting in your Makefile.machine must include "-restrict".
-The Make.py command will add this automatically.
 
 [Run with the OPT package from the command line:]
 
 As in the lines above, use the "-sf opt" "command-line
 switch"_Section_start.html#start_6, which will automatically append
 "opt" to styles that support it.
 
 [Or run with the OPT package by editing an input script:]
 
 Use the "suffix opt"_suffix.html command, or you can explicitly add an
 "opt" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/opt 2.5 :pre
 
 [Speed-ups to expect:]
 
 You should see a reduction in the "Pair time" value printed at the end
 of a run.  On most machines for reasonable problem sizes, it will be a
 5 to 20% savings.
 
 [Guidelines for best performance:]
 
 Just try out an OPT pair style to see how it performs.
 
 [Restrictions:]
 
 None.