diff --git a/doc/src/Section_accelerate.txt b/doc/src/Section_accelerate.txt index 881235888..bb0c93b8a 100644 --- a/doc/src/Section_accelerate.txt +++ b/doc/src/Section_accelerate.txt @@ -1,391 +1,391 @@ "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc - "Next Section"_Section_howto.html :c :link(lws,http://lammps.sandia.gov) :link(ld,Manual.html) :link(lc,Section_commands.html#comm) :line 5. Accelerating LAMMPS performance :h3 This section describes various methods for improving LAMMPS performance for different classes of problems running on different kinds of machines. There are two thrusts to the discussion that follows. The first is using code options that implement alternate algorithms that can speed-up a simulation. The second is to use one of the several accelerator packages provided with LAMMPS that contain code optimized for certain kinds of hardware, including multi-core CPUs, GPUs, and Intel Xeon Phi coprocessors. 5.1 "Measuring performance"_#acc_1 :ulb,l 5.2 "Algorithms and code options to boost performace"_#acc_2 :l 5.3 "Accelerator packages with optimized styles"_#acc_3 :l 5.3.1 "GPU package"_accelerate_gpu.html :l 5.3.2 "USER-INTEL package"_accelerate_intel.html :l 5.3.3 "KOKKOS package"_accelerate_kokkos.html :l 5.3.4 "USER-OMP package"_accelerate_omp.html :l 5.3.5 "OPT package"_accelerate_opt.html :l 5.4 "Comparison of various accelerator packages"_#acc_4 :l :ule The "Benchmark page"_http://lammps.sandia.gov/bench.html of the LAMMPS web site gives performance results for the various accelerator packages discussed in Section 5.2, for several of the standard LAMMPS benchmark problems, as a function of problem size and number of compute nodes, on different hardware platforms. :line :line 5.1 Measuring performance :h4,link(acc_1) Before trying to make your simulation run faster, you should understand how it currently performs and where the bottlenecks are. The best way to do this is run the your system (actual number of atoms) for a modest number of timesteps (say 100 steps) on several different processor counts, including a single processor if possible. Do this for an equilibrium version of your system, so that the 100-step timings are representative of a much longer run. There is typically no need to run for 1000s of timesteps to get accurate timings; you can simply extrapolate from short runs. For the set of runs, look at the timing data printed to the screen and log file at the end of each LAMMPS run. "This section"_Section_start.html#start_7 of the manual has an overview. Running on one (or a few processors) should give a good estimate of the serial performance and what portions of the timestep are taking the most time. Running the same problem on a few different processor counts should give an estimate of parallel scalability. I.e. if the simulation runs 16x faster on 16 processors, its 100% parallel efficient; if it runs 8x faster on 16 processors, it's 50% efficient. The most important data to look at in the timing info is the timing breakdown and relative percentages. For example, trying different options for speeding up the long-range solvers will have little impact if they only consume 10% of the run time. If the pairwise time is dominating, you may want to look at GPU or OMP versions of the pair style, as discussed below. Comparing how the percentages change as you increase the processor count gives you a sense of how different operations within the timestep are scaling. Note that if you are running with a Kspace solver, there is additional output on the breakdown of the Kspace time. For PPPM, this includes the fraction spent on FFTs, which can be communication intensive. Another important detail in the timing info are the histograms of atoms counts and neighbor counts. If these vary widely across processors, you have a load-imbalance issue. This often results in inaccurate relative timing data, because processors have to wait when communication occurs for other processors to catch up. Thus the reported times for "Communication" or "Other" may be higher than they really are, due to load-imbalance. If this is an issue, you can uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile LAMMPS, to obtain synchronized timings. :line 5.2 General strategies :h4,link(acc_2) NOTE: this section 5.2 is still a work in progress Here is a list of general ideas for improving simulation performance. Most of them are only applicable to certain models and certain bottlenecks in the current performance, so let the timing data you generate be your guide. It is hard, if not impossible, to predict how much difference these options will make, since it is a function of problem size, number of processors used, and your machine. There is no substitute for identifying performance bottlenecks, and trying out various options. rRESPA 2-FFT PPPM Staggered PPPM single vs double PPPM partial charge PPPM verlet/split run style processor command for proc layout and numa layout load-balancing: balance and fix balance :ul 2-FFT PPPM, also called {analytic differentiation} or {ad} PPPM, uses 2 FFTs instead of the 4 FFTs used by the default {ik differentiation} PPPM. However, 2-FFT PPPM also requires a slightly larger mesh size to achieve the same accuracy as 4-FFT PPPM. For problems where the FFT cost is the performance bottleneck (typically large problems running on many processors), 2-FFT PPPM may be faster than 4-FFT PPPM. Staggered PPPM performs calculations using two different meshes, one shifted slightly with respect to the other. This can reduce force aliasing errors and increase the accuracy of the method, but also doubles the amount of work required. For high relative accuracy, using staggered PPPM allows one to half the mesh size in each dimension as compared to regular PPPM, which can give around a 4x speedup in the kspace time. However, for low relative accuracy, using staggered PPPM gives little benefit and can be up to 2x slower in the kspace time. For example, the rhodopsin benchmark was run on a single processor, and results for kspace time vs. relative accuracy for the different methods are shown in the figure below. For this system, staggered PPPM (using ik differentiation) becomes useful when using a relative accuracy of slightly greater than 1e-5 and above. :c,image(JPG/rhodo_staggered.jpg) NOTE: Using staggered PPPM may not give the same increase in accuracy of energy and pressure as it does in forces, so some caution must be used if energy and/or pressure are quantities of interest, such as when using a barostat. :line 5.3 Packages with optimized styles :h4,link(acc_3) Accelerated versions of various "pair_style"_pair_style.html, "fixes"_fix.html, "computes"_compute.html, and other commands have been added to LAMMPS, which will typically run faster than the standard non-accelerated versions. Some require appropriate hardware to be present on your system, e.g. GPUs or Intel Xeon Phi coprocessors. All of these commands are in packages provided with LAMMPS. An overview of packages is give in "Section packages"_Section_packages.html. These are the accelerator packages currently in LAMMPS, either as standard or user packages: "GPU Package"_accelerate_gpu.html : for NVIDIA GPUs as well as OpenCL support "USER-INTEL Package"_accelerate_intel.html : for Intel CPUs and Intel Xeon Phi "KOKKOS Package"_accelerate_kokkos.html : for Nvidia GPUs, Intel Xeon Phi, and OpenMP threading "USER-OMP Package"_accelerate_omp.html : for OpenMP threading and generic CPU optimizations "OPT Package"_accelerate_opt.html : generic CPU optimizations :tb(s=:) <!-- RST .. toctree:: :maxdepth: 1 :hidden: accelerate_gpu accelerate_intel accelerate_kokkos accelerate_omp accelerate_opt END_RST --> Inverting this list, LAMMPS currently has acceleration support for three kinds of hardware, via the listed packages: Many-core CPUs : "USER-INTEL"_accelerate_intel.html, "KOKKOS"_accelerate_kokkos.html, "USER-OMP"_accelerate_omp.html, "OPT"_accelerate_opt.html packages NVIDIA GPUs : "GPU"_accelerate_gpu.html, "KOKKOS"_accelerate_kokkos.html packages Intel Phi : "USER-INTEL"_accelerate_intel.html, "KOKKOS"_accelerate_kokkos.html packages :tb(s=:) Which package is fastest for your hardware may depend on the size problem you are running and what commands (accelerated and non-accelerated) are invoked by your input script. While these doc pages include performance guidelines, there is no substitute for trying out the different packages appropriate to your hardware. Any accelerated style has the same name as the corresponding standard style, except that a suffix is appended. Otherwise, the syntax for the command that uses the style is identical, their functionality is the same, and the numerical results it produces should also be the same, except for precision and round-off effects. For example, all of these styles are accelerated variants of the Lennard-Jones "pair_style lj/cut"_pair_lj.html: "pair_style lj/cut/gpu"_pair_lj.html "pair_style lj/cut/intel"_pair_lj.html "pair_style lj/cut/kk"_pair_lj.html "pair_style lj/cut/omp"_pair_lj.html "pair_style lj/cut/opt"_pair_lj.html :ul To see what accelerate styles are currently available, see "Section 3.5"_Section_commands.html#cmd_5 of the manual. The doc pages for individual commands (e.g. "pair lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) also list any accelerated variants available for that style. To use an accelerator package in LAMMPS, and one or more of the styles it provides, follow these general steps. Details vary from package to package and are explained in the individual accelerator doc pages, listed above: build the accelerator library | only for GPU package | install the accelerator package | make yes-opt, make yes-user-intel, etc | add compile/link flags to Makefile.machine in src/MAKE | only for USER-INTEL, KOKKOS, USER-OMP, OPT packages | re-build LAMMPS | make machine | prepare and test a regular LAMMPS simulation | lmp_machine -in in.script; mpirun -np 32 lmp_machine -in in.script | enable specific accelerator support via '-k on' "command-line switch"_Section_start.html#start_6, | only needed for KOKKOS package | set any needed options for the package via "-pk" "command-line switch"_Section_start.html#start_6 or "package"_package.html command, | only if defaults need to be changed | use accelerated styles in your input via "-sf" "command-line switch"_Section_start.html#start_6 or "suffix"_suffix.html command | lmp_machine -in in.script -sf gpu :tb(c=2,s=|) -Note that the first 4 steps can be done as a single command, using the -src/Make.py tool. This tool is discussed in "Section +Note that the first 4 steps can be done as a single command with +suitable make command invocations. This is discussed in "Section 4"_Section_packages.html of the manual, and its use is illustrated in the individual accelerator sections. Typically these steps only need to be done once, to create an executable that uses one or more accelerator packages. The last 4 steps can all be done from the command-line when LAMMPS is launched, without changing your input script, as illustrated in the individual accelerator sections. Or you can add "package"_package.html and "suffix"_suffix.html commands to your input script. NOTE: With a few exceptions, you can build a single LAMMPS executable with all its accelerator packages installed. Note however that the USER-INTEL and KOKKOS packages require you to choose one of their hardware options when building for a specific platform. I.e. CPU or Phi option for the USER-INTEL package. Or the OpenMP, Cuda, or Phi option for the KOKKOS package. These are the exceptions. You cannot build a single executable with: both the USER-INTEL Phi and KOKKOS Phi options the USER-INTEL Phi or Kokkos Phi option, and the GPU package :ul See the examples/accelerate/README and make.list files for sample Make.py commands that build LAMMPS with any or all of the accelerator packages. As an example, here is a command that builds with all the GPU related packages installed (GPU, KOKKOS with Cuda), including settings to build the needed auxiliary GPU libraries for Kepler GPUs: Make.py -j 16 -p omp gpu kokkos -cc nvcc wrap=mpi \ -gpu mode=double arch=35 -kokkos cuda arch=35 lib-all file mpi :pre The examples/accelerate directory also has input scripts that can be used with all of the accelerator packages. See its README file for details. Likewise, the bench directory has FERMI and KEPLER and PHI sub-directories with Make.py commands and input scripts for using all the accelerator packages on various machines. See the README files in those dirs. As mentioned above, the "Benchmark page"_http://lammps.sandia.gov/bench.html of the LAMMPS web site gives performance results for the various accelerator packages for several of the standard LAMMPS benchmark problems, as a function of problem size and number of compute nodes, on different hardware platforms. Here is a brief summary of what the various packages provide. Details are in the individual accelerator sections. Styles with a "gpu" suffix are part of the GPU package, and can be run on NVIDIA GPUs. The speed-up on a GPU depends on a variety of factors, discussed in the accelerator sections. :ulb,l Styles with an "intel" suffix are part of the USER-INTEL package. These styles support vectorized single and mixed precision calculations, in addition to full double precision. In extreme cases, this can provide speedups over 3.5x on CPUs. The package also supports acceleration in "offload" mode to Intel(R) Xeon Phi(TM) coprocessors. This can result in additional speedup over 2x depending on the hardware configuration. :l Styles with a "kk" suffix are part of the KOKKOS package, and can be run using OpenMP on multicore CPUs, on an NVIDIA GPU, or on an Intel Xeon Phi in "native" mode. The speed-up depends on a variety of factors, as discussed on the KOKKOS accelerator page. :l Styles with an "omp" suffix are part of the USER-OMP package and allow a pair-style to be run in multi-threaded mode using OpenMP. This can be useful on nodes with high-core counts when using less MPI processes than cores is advantageous, e.g. when running with PPPM so that FFTs are run on fewer MPI processors or when the many MPI tasks would overload the available bandwidth for communication. :l Styles with an "opt" suffix are part of the OPT package and typically speed-up the pairwise calculations of your simulation by 5-25% on a CPU. :l :ule The individual accelerator package doc pages explain: what hardware and software the accelerated package requires how to build LAMMPS with the accelerated package how to run with the accelerated package either via command-line switches or modifying the input script speed-ups to expect guidelines for best performance restrictions :ul :line 5.4 Comparison of various accelerator packages :h4,link(acc_4) NOTE: this section still needs to be re-worked with additional KOKKOS and USER-INTEL information. The next section compares and contrasts the various accelerator options, since there are multiple ways to perform OpenMP threading, run on GPUs, and run on Intel Xeon Phi coprocessors. All 3 of these packages accelerate a LAMMPS calculation using NVIDIA hardware, but they do it in different ways. As a consequence, for a particular simulation on specific hardware, one package may be faster than the other. We give guidelines below, but the best way to determine which package is faster for your input script is to try both of them on your machine. See the benchmarking section below for examples where this has been done. [Guidelines for using each package optimally:] The GPU package allows you to assign multiple CPUs (cores) to a single GPU (a common configuration for "hybrid" nodes that contain multicore CPU(s) and GPU(s)) and works effectively in this mode. :ulb,l The GPU package moves per-atom data (coordinates, forces) back-and-forth between the CPU and GPU every timestep. The KOKKOS/CUDA package only does this on timesteps when a CPU calculation is required (e.g. to invoke a fix or compute that is non-GPU-ized). Hence, if you can formulate your input script to only use GPU-ized fixes and computes, and avoid doing I/O too often (thermo output, dump file snapshots, restart files), then the data transfer cost of the KOKKOS/CUDA package can be very low, causing it to run faster than the GPU package. :l The GPU package is often faster than the KOKKOS/CUDA package, if the number of atoms per GPU is smaller. The crossover point, in terms of atoms/GPU at which the KOKKOS/CUDA package becomes faster depends strongly on the pair style. For example, for a simple Lennard Jones system the crossover (in single precision) is often about 50K-100K atoms per GPU. When performing double precision calculations the crossover point can be significantly smaller. :l Both packages compute bonded interactions (bonds, angles, etc) on the CPU. If the GPU package is running with several MPI processes assigned to one GPU, the cost of computing the bonded interactions is spread across more CPUs and hence the GPU package can run faster. :l When using the GPU package with multiple CPUs assigned to one GPU, its performance depends to some extent on high bandwidth between the CPUs and the GPU. Hence its performance is affected if full 16 PCIe lanes are not available for each GPU. In HPC environments this can be the case if S2050/70 servers are used, where two devices generally share one PCIe 2.0 16x slot. Also many multi-GPU mainboards do not provide full 16 lanes to each of the PCIe 2.0 16x slots. :l :ule [Differences between the two packages:] The GPU package accelerates only pair force, neighbor list, and PPPM calculations. :ulb,l The GPU package requires neighbor lists to be built on the CPU when using exclusion lists, hybrid pair styles, or a triclinic simulation box. :l :ule diff --git a/doc/src/accelerate_gpu.txt b/doc/src/accelerate_gpu.txt index 68e9fa477..2723b6e97 100644 --- a/doc/src/accelerate_gpu.txt +++ b/doc/src/accelerate_gpu.txt @@ -1,254 +1,249 @@ "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c :link(lws,http://lammps.sandia.gov) :link(ld,Manual.html) :link(lc,Section_commands.html#comm) :line "Return to Section accelerate overview"_Section_accelerate.html 5.3.1 GPU package :h5 The GPU package was developed by Mike Brown at ORNL and his collaborators, particularly Trung Nguyen (ORNL). It provides GPU versions of many pair styles, including the 3-body Stillinger-Weber pair style, and for "kspace_style pppm"_kspace_style.html for long-range Coulombics. It has the following general features: It is designed to exploit common GPU hardware configurations where one or more GPUs are coupled to many cores of one or more multi-core CPUs, e.g. within a node of a parallel machine. :ulb,l Atom-based data (e.g. coordinates, forces) moves back-and-forth between the CPU(s) and GPU every timestep. :l Neighbor lists can be built on the CPU or on the GPU :l The charge assignment and force interpolation portions of PPPM can be run on the GPU. The FFT portion, which requires MPI communication between processors, runs on the CPU. :l Asynchronous force computations can be performed simultaneously on the CPU(s) and GPU. :l It allows for GPU computations to be performed in single or double precision, or in mixed-mode precision, where pairwise forces are computed in single precision, but accumulated into double-precision force vectors. :l LAMMPS-specific code is in the GPU package. It makes calls to a generic GPU library in the lib/gpu directory. This library provides NVIDIA support as well as more general OpenCL support, so that the same functionality can eventually be supported on a variety of GPU hardware. :l :ule Here is a quick overview of how to enable and use the GPU package: build the library in lib/gpu for your GPU hardware with the desired precision settings install the GPU package and build LAMMPS as usual use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU specify the # of GPUs per node use GPU styles in your input script :ul The latter two steps can be done using the "-pk gpu" and "-sf gpu" "command-line switches"_Section_start.html#start_6 respectively. Or the effect of the "-pk" or "-sf" switches can be duplicated by adding the "package gpu"_package.html or "suffix gpu"_suffix.html commands respectively to your input script. [Required hardware/software:] To use this package, you currently need to have an NVIDIA GPU and install the NVIDIA Cuda software on your system: Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information Go to http://www.nvidia.com/object/cuda_get.html Install a driver and toolkit appropriate for your system (SDK is not necessary) Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties :ul [Building LAMMPS with the GPU package:] This requires two steps (a,b): build the GPU library, then build LAMMPS with the GPU package. -You can do both these steps in one line, using the src/Make.py script, -described in "Section 4"_Section_packages.html of the manual. -Type "Make.py -h" for help. If run from the src directory, this -command will create src/lmp_gpu using src/MAKE/Makefile.mpi as the -starting Makefile.machine: - -Make.py -p gpu -gpu mode=single arch=31 -o gpu -a lib-gpu file mpi :pre +You can do both these steps in one line as described in +"Section 4"_Section_packages.html of the manual. Or you can follow these two (a,b) steps: (a) Build the GPU library The GPU library is in lammps/lib/gpu. Select a Makefile.machine (in lib/gpu) appropriate for your system. You should pay special attention to 3 settings in this makefile. CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system CUDA_ARCH = needs to be appropriate to your GPUs CUDA_PREC = precision (double, mixed, single) you desire :ul See lib/gpu/Makefile.linux.double for examples of the ARCH settings for different GPU choices, e.g. Fermi vs Kepler. It also lists the possible precision settings: CUDA_PREC = -D_SINGLE_SINGLE # single precision for all calculations CUDA_PREC = -D_DOUBLE_DOUBLE # double precision for all calculations CUDA_PREC = -D_SINGLE_DOUBLE # accumulation of forces, etc, in double :pre The last setting is the mixed mode referred to above. Note that your GPU must support double precision to use either the 2nd or 3rd of these settings. To build the library, type: make -f Makefile.machine :pre If successful, it will produce the files libgpu.a and Makefile.lammps. The latter file has 3 settings that need to be appropriate for the paths and settings for the CUDA system software on your machine. Makefile.lammps is a copy of the file specified by the EXTRAMAKE setting in Makefile.machine. You can change EXTRAMAKE or create your own Makefile.lammps.machine if needed. Note that to change the precision of the GPU library, you need to re-build the entire library. Do a "clean" first, e.g. "make -f Makefile.linux clean", followed by the make command above. (b) Build LAMMPS with the GPU package cd lammps/src make yes-gpu make machine :pre No additional compile/link flags are needed in Makefile.machine. Note that if you change the GPU library precision (discussed above) and rebuild the GPU library, then you also need to re-install the GPU package and re-build LAMMPS, so that all affected files are re-compiled and linked to the new GPU library. [Run with the GPU package from the command line:] The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode. When using the GPU package, you cannot assign more than one GPU to a single MPI task. However multiple MPI tasks can share the same GPU, and in many cases it will be more efficient to run this way. Likewise it may be more efficient to use less MPI tasks/node than the available # of CPU cores. Assignment of multiple MPI tasks to a GPU will happen automatically if you create more MPI tasks/node than there are GPUs/mode. E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be shared by 4 MPI tasks. Use the "-sf gpu" "command-line switch"_Section_start.html#start_6, which will automatically append "gpu" to styles that support it. Use the "-pk gpu Ng" "command-line switch"_Section_start.html#start_6 to set Ng = # of GPUs/node to use. lmp_machine -sf gpu -pk gpu 1 -in in.script # 1 MPI task uses 1 GPU mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # ditto on 4 16-core nodes :pre Note that if the "-sf gpu" switch is used, it also issues a default "package gpu 1"_package.html command, which sets the number of GPUs/node to 1. Using the "-pk" switch explicitly allows for setting of the number of GPUs/node to use and additional options. Its syntax is the same as same as the "package gpu" command. See the "package"_package.html command doc page for details, including the default values used for all its options if it is not specified. Note that the default for the "package gpu"_package.html command is to set the Newton flag to "off" pairwise interactions. It does not affect the setting for bonded interactions (LAMMPS default is "on"). The "off" setting for pairwise interaction is currently required for GPU package pair styles. [Or run with the GPU package by editing an input script:] The discussion above for the mpirun/mpiexec command, MPI tasks/node, and use of multiple MPI tasks/GPU is the same. Use the "suffix gpu"_suffix.html command, or you can explicitly add an "gpu" suffix to individual styles in your input script, e.g. pair_style lj/cut/gpu 2.5 :pre You must also use the "package gpu"_package.html command to enable the GPU package, unless the "-sf gpu" or "-pk gpu" "command-line switches"_Section_start.html#start_6 were used. It specifies the number of GPUs/node to use, as well as other options. [Speed-ups to expect:] The performance of a GPU versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/GPU, and the precision used on the GPU (double, single, mixed). See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the LAMMPS web site for performance of the GPU package on various hardware, including the Titan HPC platform at ORNL. You should also experiment with how many MPI tasks per GPU to use to give the best performance for your problem and machine. This is also a function of the problem size and the pair style being using. Likewise, you should experiment with the precision setting for the GPU library to see if single or mixed precision will give accurate results, since they will typically be faster. [Guidelines for best performance:] Using multiple MPI tasks per GPU will often give the best performance, as allowed my most multi-core CPU/GPU configurations. :ulb,l If the number of particles per MPI task is small (e.g. 100s of particles), it can be more efficient to run with fewer MPI tasks per GPU, even if you do not use all the cores on the compute node. :l The "package gpu"_package.html command has several options for tuning performance. Neighbor lists can be built on the GPU or CPU. Force calculations can be dynamically balanced across the CPU cores and GPUs. GPU-specific settings can be made which can be optimized for different hardware. See the "packakge"_package.html command doc page for details. :l As described by the "package gpu"_package.html command, GPU accelerated pair styles can perform computations asynchronously with CPU computations. The "Pair" time reported by LAMMPS will be the maximum of the time required to complete the CPU pair style computations and the time required to complete the GPU pair style computations. Any time spent for GPU-enabled pair styles for computations that run simultaneously with "bond"_bond_style.html, "angle"_angle_style.html, "dihedral"_dihedral_style.html, "improper"_improper_style.html, and "long-range"_kspace_style.html calculations will not be included in the "Pair" time. :l When the {mode} setting for the package gpu command is force/neigh, the time for neighbor list calculations on the GPU will be added into the "Pair" time, not the "Neigh" time. An additional breakdown of the times required for various tasks on the GPU (data copy, neighbor calculations, force computations, etc) are output only with the LAMMPS screen output (not in the log file) at the end of each run. These timings represent total time spent on the GPU for each routine, regardless of asynchronous CPU calculations. :l The output section "GPU Time Info (average)" reports "Max Mem / Proc". This is the maximum memory used at one time on the GPU for data storage by a single MPI process. :l :ule [Restrictions:] None. diff --git a/doc/src/accelerate_intel.txt b/doc/src/accelerate_intel.txt index 74ae9d9a4..9eb295e0d 100644 --- a/doc/src/accelerate_intel.txt +++ b/doc/src/accelerate_intel.txt @@ -1,517 +1,514 @@ "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c :link(lws,http://lammps.sandia.gov) :link(ld,Manual.html) :link(lc,Section_commands.html#comm) :line "Return to Section accelerate overview"_Section_accelerate.html 5.3.2 USER-INTEL package :h5 The USER-INTEL package is maintained by Mike Brown at Intel Corporation. It provides two methods for accelerating simulations, depending on the hardware you have. The first is acceleration on Intel CPUs by running in single, mixed, or double precision with vectorization. The second is acceleration on Intel Xeon Phi coprocessors via offloading neighbor list and non-bonded force calculations to the Phi. The same C++ code is used in both cases. When offloading to a coprocessor from a CPU, the same routine is run twice, once on the CPU and once with an offload flag. This allows LAMMPS to run on the CPU cores and coprocessor cores simultaneously. [Currently Available USER-INTEL Styles:] Angle Styles: charmm, harmonic :ulb,l Bond Styles: fene, harmonic :l Dihedral Styles: charmm, harmonic, opls :l Fixes: nve, npt, nvt, nvt/sllod :l Improper Styles: cvff, harmonic :l Pair Styles: buck/coul/cut, buck/coul/long, buck, eam, gayberne, charmm/coul/long, lj/cut, lj/cut/coul/long, lj/long/coul/long, sw, tersoff :l K-Space Styles: pppm, pppm/disp :l :ule [Speed-ups to expect:] The speedups will depend on your simulation, the hardware, which styles are used, the number of atoms, and the floating-point precision mode. Performance improvements are shown compared to LAMMPS {without using other acceleration packages} as these are under active development (and subject to performance changes). The measurements were performed using the input files available in the src/USER-INTEL/TEST directory with the provided run script. These are scalable in size; the results given are with 512K particles (524K for Liquid Crystal). Most of the simulations are standard LAMMPS benchmarks (indicated by the filename extension in parenthesis) with modifications to the run length and to add a warmup run (for use with offload benchmarks). :c,image(JPG/user_intel.png) Results are speedups obtained on Intel Xeon E5-2697v4 processors (code-named Broadwell) and Intel Xeon Phi 7250 processors (code-named Knights Landing) with "June 2017" LAMMPS built with Intel Parallel Studio 2017 update 2. Results are with 1 MPI task per physical core. See {src/USER-INTEL/TEST/README} for the raw simulation rates and instructions to reproduce. :line [Accuracy and order of operations:] In most molecular dynamics software, parallelization parameters (# of MPI, OpenMP, and vectorization) can change the results due to changing the order of operations with finite-precision calculations. The USER-INTEL package is deterministic. This means that the results should be reproducible from run to run with the {same} parallel configurations and when using determinstic libraries or library settings (MPI, OpenMP, FFT). However, there are differences in the USER-INTEL package that can change the order of operations compared to LAMMPS without acceleration: Neighbor lists can be created in a different order :ulb,l Bins used for sorting atoms can be oriented differently :l The default stencil order for PPPM is 7. By default, LAMMPS will calculate other PPPM parameters to fit the desired acuracy with this order :l The {newton} setting applies to all atoms, not just atoms shared between MPI tasks :l Vectorization can change the order for adding pairwise forces :l :ule The precision mode (described below) used with the USER-INTEL package can change the {accuracy} of the calculations. For the default {mixed} precision option, calculations between pairs or triplets of atoms are performed in single precision, intended to be within the inherent error of MD simulations. All accumulation is performed in double precision to prevent the error from growing with the number of atoms in the simulation. {Single} precision mode should not be used without appropriate validation. :line [Quick Start for Experienced Users:] LAMMPS should be built with the USER-INTEL package installed. Simulations should be run with 1 MPI task per physical {core}, not {hardware thread}. Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary. :ulb,l Set the environment variable KMP_BLOCKTIME=0 :l "-pk intel 0 omp $t -sf intel" added to LAMMPS command-line :l $t should be 2 for Intel Xeon CPUs and 2 or 4 for Intel Xeon Phi :l For some of the simple 2-body potentials without long-range electrostatics, performance and scalability can be better with the "newton off" setting added to the input script :l For simulations on higher node counts, add "processors * * * grid numa" to the beginning of the input script for better scalability :l If using {kspace_style pppm} in the input script, add "kspace_modify diff ad" for better performance :l :ule For Intel Xeon Phi CPUs: Runs should be performed using MCDRAM. :ulb,l :ule For simulations using {kspace_style pppm} on Intel CPUs supporting AVX-512: Add "kspace_modify diff ad" to the input script :ulb,l The command-line option should be changed to "-pk intel 0 omp $r lrt yes -sf intel" where $r is the number of threads minus 1. :l Do not use thread affinity (set KMP_AFFINITY=none) :l The "newton off" setting may provide better scalability :l :ule For Intel Xeon Phi coprocessors (Offload): Edit src/MAKE/OPTIONS/Makefile.intel_coprocessor as necessary :ulb,l "-pk intel N omp 1" added to command-line where N is the number of coprocessors per node. :l :ule :line [Required hardware/software:] In order to use offload to coprocessors, an Intel Xeon Phi coprocessor and an Intel compiler are required. For this, the recommended version of the Intel compiler is 14.0.1.106 or versions 15.0.2.044 and higher. Although any compiler can be used with the USER-INTEL package, currently, vectorization directives are disabled by default when not using Intel compilers due to lack of standard support and observations of decreased performance. The OpenMP standard now supports directives for vectorization and we plan to transition the code to this standard once it is available in most compilers. We expect this to allow improved performance and support with other compilers. For Intel Xeon Phi x200 series processors (code-named Knights Landing), there are multiple configuration options for the hardware. For best performance, we recommend that the MCDRAM is configured in "Flat" mode and with the cluster mode set to "Quadrant" or "SNC4". "Cache" mode can also be used, although the performance might be slightly lower. [Notes about Simultaneous Multithreading:] Modern CPUs often support Simultaneous Multithreading (SMT). On Intel processors, this is called Hyper-Threading (HT) technology. SMT is hardware support for running multiple threads efficiently on a single core. {Hardware threads} or {logical cores} are often used to refer to the number of threads that are supported in hardware. For example, the Intel Xeon E5-2697v4 processor is described as having 36 cores and 72 threads. This means that 36 MPI processes or OpenMP threads can run simultaneously on separate cores, but that up to 72 MPI processes or OpenMP threads can be running on the CPU without costly operating system context switches. Molecular dynamics simulations will often run faster when making use of SMT. If a thread becomes stalled, for example because it is waiting on data that has not yet arrived from memory, another thread can start running so that the CPU pipeline is still being used efficiently. Although benefits can be seen by launching a MPI task for every hardware thread, for multinode simulations, we recommend that OpenMP threads are used for SMT instead, either with the USER-INTEL package, "USER-OMP package"_accelerate_omp.html, or "KOKKOS package"_accelerate_kokkos.html. In the example above, up to 36X speedups can be observed by using all 36 physical cores with LAMMPS. By using all 72 hardware threads, an additional 10-30% performance gain can be achieved. The BIOS on many platforms allows SMT to be disabled, however, we do not recommend this on modern processors as there is little to no benefit for any software package in most cases. The operating system will report every hardware thread as a separate core allowing one to determine the number of hardware threads available. On Linux systems, this information can normally be obtained with: cat /proc/cpuinfo :pre [Building LAMMPS with the USER-INTEL package:] NOTE: See the src/USER-INTEL/README file for additional flags that might be needed for best performance on Intel server processors code-named "Skylake". The USER-INTEL package must be installed into the source directory: make yes-user-intel :pre Several example Makefiles for building with the Intel compiler are included with LAMMPS in the src/MAKE/OPTIONS/ directory: Makefile.intel_cpu_intelmpi # Intel Compiler, Intel MPI, No Offload Makefile.knl # Intel Compiler, Intel MPI, No Offload Makefile.intel_cpu_mpich # Intel Compiler, MPICH, No Offload Makefile.intel_cpu_openpmi # Intel Compiler, OpenMPI, No Offload Makefile.intel_coprocessor # Intel Compiler, Intel MPI, Offload :pre Makefile.knl is identical to Makefile.intel_cpu_intelmpi except that it explicitly specifies that vectorization should be for Intel Xeon Phi x200 processors making it easier to cross-compile. For users with recent installations of Intel Parallel Studio, the process can be as simple as: make yes-user-intel source /opt/intel/parallel_studio_xe_2016.3.067/psxevars.sh # or psxevars.csh for C-shell make intel_cpu_intelmpi :pre -Alternatively, the build can be accomplished with the src/Make.py -script, described in "Section 4"_Section_packages.html of the -manual. Type "Make.py -h" for help. For an example: - -Make.py -v -p intel omp -intel cpu -a file intel_cpu_intelmpi :pre +Alternatively this can be done as a single command with +suitable make command invocations. This is discussed in "Section +4"_Section_packages.html of the manual. Note that if you build with support for a Phi coprocessor, the same binary can be used on nodes with or without coprocessors installed. However, if you do not have coprocessors on your system, building without offload support will produce a smaller binary. The general requirements for Makefiles with the USER-INTEL package are as follows. "-DLAMMPS_MEMALIGN=64" is required for CCFLAGS. When using Intel compilers, "-restrict" is required and "-qopenmp" is highly recommended for CCFLAGS and LINKFLAGS. LIB should include "-ltbbmalloc". For builds supporting offload, "-DLMP_INTEL_OFFLOAD" is required for CCFLAGS and "-qoffload" is required for LINKFLAGS. Other recommended CCFLAG options for best performance are "-O2 -fno-alias -ansi-alias -qoverride-limits fp-model fast=2 --no-prec-div". The Make.py command will add all of these -automatically. +-no-prec-div". NOTE: The vectorization and math capabilities can differ depending on the CPU. For Intel compilers, the "-x" flag specifies the type of processor for which to optimize. "-xHost" specifies that the compiler should build for the processor used for compiling. For Intel Xeon Phi x200 series processors, this option is "-xMIC-AVX512". For fourth generation Intel Xeon (v4/Broadwell) processors, "-xCORE-AVX2" should be used. For older Intel Xeon processors, "-xAVX" will perform best in general for the different simulations in LAMMPS. The default in most of the example Makefiles is to use "-xHost", however this should not be used when cross-compiling. [Running LAMMPS with the USER-INTEL package:] Running LAMMPS with the USER-INTEL package is similar to normal use with the exceptions that one should 1) specify that LAMMPS should use the USER-INTEL package, 2) specify the number of OpenMP threads, and 3) optionally specify the specific LAMMPS styles that should use the USER-INTEL package. 1) and 2) can be performed from the command-line or by editing the input script. 3) requires editing the input script. Advanced performance tuning options are also described below to get the best performance. When running on a single node (including runs using offload to a coprocessor), best performance is normally obtained by using 1 MPI task per physical core and additional OpenMP threads with SMT. For Intel Xeon processors, 2 OpenMP threads should be used for SMT. For Intel Xeon Phi CPUs, 2 or 4 OpenMP threads should be used (best choice depends on the simulation). In cases where the user specifies that LRT mode is used (described below), 1 or 3 OpenMP threads should be used. For multi-node runs, using 1 MPI task per physical core will often perform best, however, depending on the machine and scale, users might get better performance by decreasing the number of MPI tasks and using more OpenMP threads. For performance, the product of the number of MPI tasks and OpenMP threads should not exceed the number of available hardware threads in almost all cases. NOTE: Setting core affinity is often used to pin MPI tasks and OpenMP threads to a core or group of cores so that memory access can be uniform. Unless disabled at build time, affinity for MPI tasks and OpenMP threads on the host (CPU) will be set by default on the host {when using offload to a coprocessor}. In this case, it is unnecessary to use other methods to control affinity (e.g. taskset, numactl, I_MPI_PIN_DOMAIN, etc.). This can be disabled with the {no_affinity} option to the "package intel"_package.html command or by disabling the option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile). Disabling this option is not recommended, especially when running on a machine with Intel Hyper-Threading technology disabled. [Run with the USER-INTEL package from the command line:] To enable USER-INTEL optimizations for all available styles used in the input script, the "-sf intel" "command-line switch"_Section_start.html#start_6 can be used without any requirement for editing the input script. This switch will automatically append "intel" to styles that support it. It also invokes a default command: "package intel 1"_package.html. This package command is used to set options for the USER-INTEL package. The default package command will specify that USER-INTEL calculations are performed in mixed precision, that the number of OpenMP threads is specified by the OMP_NUM_THREADS environment variable, and that if coprocessors are present and the binary was built with offload support, that 1 coprocessor per node will be used with automatic balancing of work between the CPU and the coprocessor. You can specify different options for the USER-INTEL package by using the "-pk intel Nphi" "command-line switch"_Section_start.html#start_6 with keyword/value pairs as specified in the documentation. Here, Nphi = # of Xeon Phi coprocessors/node (ignored without offload support). Common options to the USER-INTEL package include {omp} to override any OMP_NUM_THREADS setting and specify the number of OpenMP threads, {mode} to set the floating-point precision mode, and {lrt} to enable Long-Range Thread mode as described below. See the "package intel"_package.html command for details, including the default values used for all its options if not specified, and how to set the number of OpenMP threads via the OMP_NUM_THREADS environment variable if desired. Examples (see documentation for your MPI/Machine for differences in launching MPI applications): mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double # Don't use any coprocessors that might be available, use 2 OpenMP threads for each task, use double precision :pre [Or run with the USER-INTEL package by editing an input script:] As an alternative to adding command-line arguments, the input script can be edited to enable the USER-INTEL package. This requires adding the "package intel"_package.html command to the top of the input script. For the second example above, this would be: package intel 0 omp 2 mode double :pre To enable the USER-INTEL package only for individual styles, you can add an "intel" suffix to the individual style, e.g.: pair_style lj/cut/intel 2.5 :pre Alternatively, the "suffix intel"_suffix.html command can be added to the input script to enable USER-INTEL styles for the commands that follow in the input script. [Tuning for Performance:] NOTE: The USER-INTEL package will perform better with modifications to the input script when "PPPM"_kspace_style.html is used: "kspace_modify diff ad"_kspace_modify.html should be added to the input script. Long-Range Thread (LRT) mode is an option to the "package intel"_package.html command that can improve performance when using "PPPM"_kspace_style.html for long-range electrostatics on processors with SMT. It generates an extra pthread for each MPI task. The thread is dedicated to performing some of the PPPM calculations and MPI communications. On Intel Xeon Phi x200 series CPUs, this will likely always improve performance, even on a single node. On Intel Xeon processors, using this mode might result in better performance when using multiple nodes, depending on the machine. To use this mode, specify that the number of OpenMP threads is one less than would normally be used for the run and add the "lrt yes" option to the "-pk" command-line suffix or "package intel" command. For example, if a run would normally perform best with "-pk intel 0 omp 4", instead use "-pk intel 0 omp 3 lrt yes". When using LRT, you should set the environment variable "KMP_AFFINITY=none". LRT mode is not supported when using offload. NOTE: Changing the "newton"_newton.html setting to off can improve performance and/or scalability for simple 2-body potentials such as lj/cut or when using LRT mode on processors supporting AVX-512. Not all styles are supported in the USER-INTEL package. You can mix the USER-INTEL package with styles from the "OPT"_accelerate_opt.html package or the "USER-OMP package"_accelerate_omp.html. Of course, this requires that these packages were installed at build time. This can performed automatically by using "-sf hybrid intel opt" or "-sf hybrid intel omp" command-line options. Alternatively, the "opt" and "omp" suffixes can be appended manually in the input script. For the latter, the "package omp"_package.html command must be in the input script or the "-pk omp Nt" "command-line switch"_Section_start.html#start_6 must be used where Nt is the number of OpenMP threads. The number of OpenMP threads should not be set differently for the different packages. Note that the "suffix hybrid intel omp"_suffix.html command can also be used within the input script to automatically append the "omp" suffix to styles when USER-INTEL styles are not available. NOTE: For simulations on higher node counts, add "processors * * * grid numa"_processors.html" to the beginning of the input script for better scalability. When running on many nodes, performance might be better when using fewer OpenMP threads and more MPI tasks. This will depend on the simulation and the machine. Using the "verlet/split"_run_style.html run style might also give better performance for simulations with "PPPM"_kspace_style.html electrostatics. Note that this is an alternative to LRT mode and the two cannot be used together. Currently, when using Intel MPI with Intel Xeon Phi x200 series CPUs, better performance might be obtained by setting the environment variable "I_MPI_SHM_LMT=shm" for Linux kernels that do not yet have full support for AVX-512. Runs on Intel Xeon Phi x200 series processors will always perform better using MCDRAM. Please consult your system documentation for the best approach to specify that MPI runs are performed in MCDRAM. [Tuning for Offload Performance:] The default settings for offload should give good performance. When using LAMMPS with offload to Intel coprocessors, best performance will typically be achieved with concurrent calculations performed on both the CPU and the coprocessor. This is achieved by offloading only a fraction of the neighbor and pair computations to the coprocessor or using "hybrid"_pair_hybrid.html pair styles where only one style uses the "intel" suffix. For simulations with long-range electrostatics or bond, angle, dihedral, improper calculations, computation and data transfer to the coprocessor will run concurrently with computations and MPI communications for these calculations on the host CPU. This is illustrated in the figure below for the rhodopsin protein benchmark running on E5-2697v2 processors with a Intel Xeon Phi 7120p coprocessor. In this plot, the vertical access is time and routines running at the same time are running concurrently on both the host and the coprocessor. :c,image(JPG/offload_knc.png) The fraction of the offloaded work is controlled by the {balance} keyword in the "package intel"_package.html command. A balance of 0 runs all calculations on the CPU. A balance of 1 runs all supported calculations on the coprocessor. A balance of 0.5 runs half of the calculations on the coprocessor. Setting the balance to -1 (the default) will enable dynamic load balancing that continously adjusts the fraction of offloaded work throughout the simulation. Because data transfer cannot be timed, this option typically produces results within 5 to 10 percent of the optimal fixed balance. If running short benchmark runs with dynamic load balancing, adding a short warm-up run (10-20 steps) will allow the load-balancer to find a near-optimal setting that will carry over to additional runs. The default for the "package intel"_package.html command is to have all the MPI tasks on a given compute node use a single Xeon Phi coprocessor. In general, running with a large number of MPI tasks on each node will perform best with offload. Each MPI task will automatically get affinity to a subset of the hardware threads available on the coprocessor. For example, if your card has 61 cores, with 60 cores available for offload and 4 hardware threads per core (240 total threads), running with 24 MPI tasks per node will cause each MPI task to use a subset of 10 threads on the coprocessor. Fine tuning of the number of threads to use per MPI task or the number of threads to use per core can be accomplished with keyword settings of the "package intel"_package.html command. The USER-INTEL package has two modes for deciding which atoms will be handled by the coprocessor. This choice is controlled with the {ghost} keyword of the "package intel"_package.html command. When set to 0, ghost atoms (atoms at the borders between MPI tasks) are not offloaded to the card. This allows for overlap of MPI communication of forces with computation on the coprocessor when the "newton"_newton.html setting is "on". The default is dependent on the style being used, however, better performance may be achieved by setting this option explicitly. When using offload with CPU Hyper-Threading disabled, it may help performance to use fewer MPI tasks and OpenMP threads than available cores. This is due to the fact that additional threads are generated internally to handle the asynchronous offload tasks. If pair computations are being offloaded to an Intel Xeon Phi coprocessor, a diagnostic line is printed to the screen (not to the log file), during the setup phase of a run, indicating that offload mode is being used and indicating the number of coprocessor threads per MPI task. Additionally, an offload timing summary is printed at the end of each run. When offloading, the frequency for "atom sorting"_atom_modify.html is changed to 1 so that the per-atom data is effectively sorted at every rebuild of the neighbor lists. All the available coprocessor threads on each Phi will be divided among MPI tasks, unless the {tptask} option of the "-pk intel" "command-line switch"_Section_start.html#start_6 is used to limit the coprocessor threads per MPI task. [Restrictions:] When offloading to a coprocessor, "hybrid"_pair_hybrid.html styles that require skip lists for neighbor builds cannot be offloaded. Using "hybrid/overlay"_pair_hybrid.html is allowed. Only one intel accelerated style may be used with hybrid styles. "Special_bonds"_special_bonds.html exclusion lists are not currently supported with offload, however, the same effect can often be accomplished by setting cutoffs for excluded atom types to 0. None of the pair styles in the USER-INTEL package currently support the "inner", "middle", "outer" options for rRESPA integration via the "run_style respa"_run_style.html command; only the "pair" option is supported. [References:] Brown, W.M., Carrillo, J.-M.Y., Mishra, B., Gavhane, N., Thakker, F.M., De Kraker, A.R., Yamada, M., Ang, J.A., Plimpton, S.J., "Optimizing Classical Molecular Dynamics in LAMMPS," in Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, J. Jeffers, J. Reinders, A. Sodani, Eds. Morgan Kaufmann. :ulb,l Brown, W. M., Semin, A., Hebenstreit, M., Khvostov, S., Raman, K., Plimpton, S.J. "Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efficiency."_http://dl.acm.org/citation.cfm?id=3014915 2016 High Performance Computing, Networking, Storage and Analysis, SC16: International Conference (pp. 82-95). :l Brown, W.M., Carrillo, J.-M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J. Optimizing Legacy Molecular Dynamics Software with Directive-Based Offload. Computer Physics Communications. 2015. 195: p. 95-101. :l :ule diff --git a/doc/src/accelerate_kokkos.txt b/doc/src/accelerate_kokkos.txt index 6ccd69584..712a05300 100644 --- a/doc/src/accelerate_kokkos.txt +++ b/doc/src/accelerate_kokkos.txt @@ -1,496 +1,493 @@ "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c :link(lws,http://lammps.sandia.gov) :link(ld,Manual.html) :link(lc,Section_commands.html#comm) :line "Return to Section accelerate overview"_Section_accelerate.html 5.3.3 KOKKOS package :h5 The KOKKOS package was developed primarily by Christian Trott (Sandia) with contributions of various styles by others, including Sikandar Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia). The underlying Kokkos library was written primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all Sandia). The KOKKOS package contains versions of pair, fix, and atom styles that use data structures and macros provided by the Kokkos library, which is included with LAMMPS in lib/kokkos. The Kokkos library is part of "Trilinos"_http://trilinos.sandia.gov/packages/kokkos and can also be downloaded from "Github"_https://github.com/kokkos/kokkos. Kokkos is a templated C++ library that provides two key abstractions for an application like LAMMPS. First, it allows a single implementation of an application kernel (e.g. a pair style) to run efficiently on different kinds of hardware, such as a GPU, Intel Phi, or many-core CPU. The Kokkos library also provides data abstractions to adjust (at compile time) the memory layout of basic data structures like 2d and 3d arrays and allow the transparent utilization of special hardware load and store operations. Such data structures are used in LAMMPS to store atom coordinates or forces or neighbor lists. The layout is chosen to optimize performance on different platforms. Again this functionality is hidden from the developer, and does not affect how the kernel is coded. These abstractions are set at build time, when LAMMPS is compiled with the KOKKOS package installed. All Kokkos operations occur within the context of an individual MPI task running on a single node of the machine. The total number of MPI tasks used by LAMMPS (one or multiple per compute node) is set in the usual manner via the mpirun or mpiexec commands, and is independent of Kokkos. Kokkos currently provides support for 3 modes of execution (per MPI task). These are OpenMP (for many-core CPUs), Cuda (for NVIDIA GPUs), and OpenMP (for Intel Phi). Note that the KOKKOS package supports running on the Phi in native mode, not offload mode like the USER-INTEL package supports. You choose the mode at build time to produce an executable compatible with specific hardware. Here is a quick overview of how to use the KOKKOS package for CPU acceleration, assuming one or more 16-core nodes. More details follow. use a C++11 compatible compiler make yes-kokkos make mpi KOKKOS_DEVICES=OpenMP # build with the KOKKOS package -make kokkos_omp # or Makefile.kokkos_omp already has variable set -Make.py -v -p kokkos -kokkos omp -o mpi -a file mpi # or one-line build via Make.py :pre +make kokkos_omp # or Makefile.kokkos_omp already has variable set :pre mpirun -np 16 lmp_mpi -k on -sf kk -in in.lj # 1 node, 16 MPI tasks/node, no threads mpirun -np 2 -ppn 1 lmp_mpi -k on t 16 -sf kk -in in.lj # 2 nodes, 1 MPI task/node, 16 threads/task mpirun -np 2 lmp_mpi -k on t 8 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 8 threads/task mpirun -np 32 -ppn 4 lmp_mpi -k on t 4 -sf kk -in in.lj # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support include the KOKKOS package and build LAMMPS enable the KOKKOS package and its hardware options via the "-k on" command-line switch use KOKKOS styles in your input script :ul Here is a quick overview of how to use the KOKKOS package for GPUs, assuming one or more nodes, each with 16 cores and a GPU. More details follow. discuss use of NVCC, which Makefiles to examine use a C++11 compatible compiler KOKKOS_DEVICES = Cuda, OpenMP KOKKOS_ARCH = Kepler35 make yes-kokkos -make machine -Make.py -p kokkos -kokkos cuda arch=31 -o kokkos_cuda -a file kokkos_cuda :pre +make machine :pre mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj # one MPI task, 6 threads on CPU mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj # ditto on 4 nodes :pre mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # two MPI tasks, 8 threads per CPU mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # ditto on 16 nodes :pre Here is a quick overview of how to use the KOKKOS package for the Intel Phi: use a C++11 compatible compiler KOKKOS_DEVICES = OpenMP KOKKOS_ARCH = KNC make yes-kokkos -make machine -Make.py -p kokkos -kokkos phi -o kokkos_phi -a file mpi :pre +make machine :pre host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading): mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj # 1 MPI task on 1 Phi, 1*240 = 240 mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj # 30 MPI tasks on 1 Phi, 30*8 = 240 mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj # 12 MPI tasks on 1 Phi, 12*20 = 240 mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj # ditto on 8 Phis :pre [Required hardware/software:] Kokkos support within LAMMPS must be built with a C++11 compatible compiler. If using gcc, version 4.7.2 or later is required. To build with Kokkos support for CPUs, your compiler must support the OpenMP interface. You should have one or more multi-core CPUs so that multiple threads can be launched by each MPI task running on a CPU. To build with Kokkos support for NVIDIA GPUs, NVIDIA Cuda software version 7.5 or later must be installed on your system. See the discussion for the "GPU"_accelerate_gpu.html package for details of how to check and do this. NOTE: For good performance of the KOKKOS package on GPUs, you must have Kepler generation GPUs (or later). The Kokkos library exploits texture cache options not supported by Telsa generation GPUs (or older). To build with Kokkos support for Intel Xeon Phi coprocessors, your sysmte must be configured to use them in "native" mode, not "offload" mode like the USER-INTEL package supports. [Building LAMMPS with the KOKKOS package:] You must choose at build time whether to build for CPUs (OpenMP), GPUs, or Phi. -You can do any of these in one line, using the src/Make.py script, -described in "Section 4"_Section_packages.html of the manual. -Type "Make.py -h" for help. If run from the src directory, these +You can do any of these in one line, using the suitable make command +line flags as described in "Section 4"_Section_packages.html of the +manual. If run from the src directory, these commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda, and lmp_kokkos_phi. Note that the OMP and PHI options use src/MAKE/Makefile.mpi as the starting Makefile.machine. The CUDA option uses src/MAKE/OPTIONS/Makefile.kokkos_cuda. The latter two steps can be done using the "-k on", "-pk kokkos" and "-sf kk" "command-line switches"_Section_start.html#start_6 respectively. Or the effect of the "-pk" or "-sf" switches can be duplicated by adding the "package kokkos"_package.html or "suffix kk"_suffix.html commands respectively to your input script. Or you can follow these steps: CPU-only (run all-MPI or with OpenMP threading): cd lammps/src make yes-kokkos make kokkos_omp :pre CPU-only (only MPI, no threading): cd lammps/src make yes-kokkos make kokkos_mpi :pre Intel Xeon Phi (Intel Compiler, Intel MPI): cd lammps/src make yes-kokkos make kokkos_phi :pre CPUs and GPUs (with MPICH): cd lammps/src make yes-kokkos make kokkos_cuda_mpich :pre These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the make command line which requires a GNU-compatible make command. Try "gmake" if your system's standard make complains. NOTE: If you build using make line variables and re-build LAMMPS twice with different KOKKOS options and the *same* target, e.g. g++ in the first two examples above, then you *must* perform a "make clean-all" or "make clean-machine" before each build. This is to force all the KOKKOS-dependent files to be re-compiled with the new options. NOTE: Currently, there are no precision options with the KOKKOS package. All compilation and computation is performed in double precision. There are other allowed options when building with the KOKKOS package. As above, they can be set either as variables on the make command line or in Makefile.machine. This is the full list of options, including those discussed above, Each takes a value shown below. The default value is listed, which is set in the lib/kokkos/Makefile.kokkos file. #Default settings specific options #Options: force_uvm,use_ldg,rdc KOKKOS_DEVICES, values = {OpenMP}, {Serial}, {Pthreads}, {Cuda}, default = {OpenMP} KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {ARMv8}, {BGQ}, {Power7}, {Power8}, default = {none} KOKKOS_DEBUG, values = {yes}, {no}, default = {no} KOKKOS_USE_TPLS, values = {hwloc}, {librt}, default = {none} KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc} :ul KOKKOS_DEVICE sets the parallelization method used for Kokkos code (within LAMMPS). KOKKOS_DEVICES=OpenMP means that OpenMP will be used. KOKKOS_DEVICES=Pthreads means that pthreads will be used. KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used. If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE directory must use "nvcc" as its compiler, via its CC setting. For best performance its CCFLAGS setting should use -O3 and have a KOKKOS_ARCH setting that matches the compute capability of your NVIDIA hardware and software installation, e.g. KOKKOS_ARCH=Kepler30. Note the minimal required compute capability is 2.0, but this will give significantly reduced performance compared to Kepler generation GPUs with compute capability 3.x. For the LINK setting, "nvcc" should not be used; instead use g++ or another compiler suitable for linking C++ applications. Often you will want to use your MPI compiler wrapper for this setting (i.e. mpicxx). Finally, the lo-level Makefile must also have a "Compilation rule" for creating *.o files from *.cu files. See src/Makefile.cuda for an example of a lo-level Makefile with all of these settings. KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP provides alternative methods via environment variables for binding threads to hardware cores. More info on binding threads to cores is given in "Section 5.3"_Section_accelerate.html#acc_3. KOKKOS_ARCH=KNC enables compiler switches needed when compiling for an Intel Phi processor. KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism on most Unix platforms. This library is not available on all platforms. KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time debugging information that can be useful. It also enables runtime bounds checking on Kokkos data structures. KOKKOS_CUDA_OPTIONS are additional options for CUDA. For more information on Kokkos see the Kokkos programmers' guide here: /lib/kokkos/doc/Kokkos_PG.pdf. [Run with the KOKKOS package from the command line:] The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode. When using KOKKOS built with host=OMP, you need to choose how many OpenMP threads per MPI task will be used (via the "-k" command-line switch discussed below). Note that the product of MPI tasks * OpenMP threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer. When using the KOKKOS package built with device=CUDA, you must use exactly one MPI task per physical GPU. When using the KOKKOS package built with host=MIC for Intel Xeon Phi coprocessor support you need to insure there are one or more MPI tasks per coprocessor, and choose the number of coprocessor threads to use per MPI task (via the "-k" command-line switch discussed below). The product of MPI tasks * coprocessor threads/task should not exceed the maximum number of threads the coprocessor is designed to run, otherwise performance will suffer. This value is 240 for current generation Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core. Note that with the KOKKOS package you do not need to specify how many Phi coprocessors there are per node; each coprocessors is simply treated as running some number of MPI tasks. You must use the "-k on" "command-line switch"_Section_start.html#start_6 to enable the KOKKOS package. It takes additional arguments for hardware settings appropriate to your system. Those arguments are "documented here"_Section_start.html#start_6. The two most commonly used options are: -k on t Nt g Ng :pre The "t Nt" option applies to host=OMP (even if device=CUDA) and host=MIC. For host=OMP, it specifies how many OpenMP threads per MPI task to use with a node. For host=MIC, it specifies how many Xeon Phi threads per MPI task to use within a node. The default is Nt = 1. Note that for host=OMP this is effectively MPI-only mode which may be fine. But for host=MIC you will typically end up using far less than all the 240 available threads, which could give very poor performance. The "g Ng" option applies to device=CUDA. It specifies how many GPUs per compute node to use. The default is 1, so this only needs to be specified is you have 2 or more GPUs per compute node. The "-k on" switch also issues a "package kokkos" command (with no additional arguments) which sets various KOKKOS options to default values, as discussed on the "package"_package.html command doc page. Use the "-sf kk" "command-line switch"_Section_start.html#start_6, which will automatically append "kk" to styles that support it. Use the "-pk kokkos" "command-line switch"_Section_start.html#start_6 if you wish to change any of the default "package kokkos"_package.html optionns set by the "-k on" "command-line switch"_Section_start.html#start_6. Note that the default for the "package kokkos"_package.html command is to use "full" neighbor lists and set the Newton flag to "off" for both pairwise and bonded interactions. This typically gives fastest performance. If the "newton"_newton.html command is used in the input script, it can override the Newton flag defaults. However, when running in MPI-only mode with 1 thread per MPI task, it will typically be faster to use "half" neighbor lists and set the Newton flag to "on", just as is the case for non-accelerated pair styles. You can do this with the "-pk" "command-line switch"_Section_start.html#start_6. [Or run with the KOKKOS package by editing an input script:] The discussion above for the mpirun/mpiexec command and setting appropriate thread and GPU values for host=OMP or host=MIC or device=CUDA are the same. You must still use the "-k on" "command-line switch"_Section_start.html#start_6 to enable the KOKKOS package, and specify its additional arguments for hardware options appropriate to your system, as documented above. Use the "suffix kk"_suffix.html command, or you can explicitly add a "kk" suffix to individual styles in your input script, e.g. pair_style lj/cut/kk 2.5 :pre You only need to use the "package kokkos"_package.html command if you wish to change any of its option defaults, as set by the "-k on" "command-line switch"_Section_start.html#start_6. [Speed-ups to expect:] The performance of KOKKOS running in different modes is a function of your hardware, which KOKKOS-enable styles are used, and the problem size. Generally speaking, the following rules of thumb apply: When running on CPUs only, with a single thread per MPI task, performance of a KOKKOS style is somewhere between the standard (un-accelerated) styles (MPI-only mode), and those provided by the USER-OMP package. However the difference between all 3 is small (less than 20%). :ulb,l When running on CPUs only, with multiple threads per MPI task, performance of a KOKKOS style is a bit slower than the USER-OMP package. :l When running large number of atoms per GPU, KOKKOS is typically faster than the GPU package. :l When running on Intel Xeon Phi, KOKKOS is not as fast as the USER-INTEL package, which is optimized for that hardware. :l :ule See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the LAMMPS web site for performance of the KOKKOS package on different hardware. [Guidelines for best performance:] Here are guidline for using the KOKKOS package on the different hardware configurations listed above. Many of the guidelines use the "package kokkos"_package.html command See its doc page for details and default settings. Experimenting with its options can provide a speed-up for specific calculations. [Running on a multi-core CPU:] If N is the number of physical cores/node, then the number of MPI tasks/node * number of threads/task should not exceed N, and should typically equal N. Note that the default threads/task is 1, as set by the "t" keyword of the "-k" "command-line switch"_Section_start.html#start_6. If you do not change this, no additional parallelism (beyond MPI) will be invoked on the host CPU(s). You can compare the performance running in different modes: run with 1 MPI task/node and N threads/task run with N MPI tasks/node and 1 thread/task run with settings in between these extremes :ul Examples of mpirun commands in these modes are shown above. When using KOKKOS to perform multi-threading, it is important for performance to bind both MPI tasks to physical cores, and threads to physical cores, so they do not migrate during a simulation. If you are not certain MPI tasks are being bound (check the defaults for your MPI installation), binding can be forced with these flags: OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ... Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre For binding threads with the KOKKOS OMP option, use thread affinity environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12 or later) setting the environment variable OMP_PROC_BIND=true should be sufficient. For binding threads with the KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option (see "this section"_Section_packages.html#KOKKOS of the manual for details). [Running on GPUs:] Insure the -arch setting in the machine makefile you are using, e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software. (see "this section"_Section_packages.html#KOKKOS of the manual for details). The -np setting of the mpirun command should set the number of MPI tasks/node to be equal to the # of physical GPUs on the node. Use the "-k" "command-line switch"_Section_commands.html#start_6 to specify the number of GPUs per node, and the number of threads per MPI task. As above for multi-core CPUs (and no GPU), if N is the number of physical cores/node, then the number of MPI tasks/node * number of threads/task should not exceed N. With one GPU (and one MPI task) it may be faster to use less than all the available cores, by setting threads/task to a smaller value. This is because using all the cores on a dual-socket node will incur extra cost to copy memory from the 2nd socket to the GPU. Examples of mpirun commands that follow these rules are shown above. NOTE: When using a GPU, you will achieve the best performance if your input script does not use any fix or compute styles which are not yet Kokkos-enabled. This allows data to stay on the GPU for multiple timesteps, without being copied back to the host CPU. Invoking a non-Kokkos fix or compute, or performing I/O for "thermo"_thermo_style.html or "dump"_dump.html output will cause data to be copied back to the CPU. You cannot yet assign multiple MPI tasks to the same GPU with the KOKKOS package. We plan to support this in the future, similar to the GPU package in LAMMPS. You cannot yet use both the host (multi-threaded) and device (GPU) together to compute pairwise interactions with the KOKKOS package. We hope to support this in the future, similar to the GPU package in LAMMPS. [Running on an Intel Phi:] Kokkos only uses Intel Phi processors in their "native" mode, i.e. not hosted by a CPU. As illustrated above, build LAMMPS with OMP=yes (the default) and MIC=yes. The latter insures code is correctly compiled for the Intel Phi. The OMP setting means OpenMP will be used for parallelization on the Phi, which is currently the best option within Kokkos. In the future, other options may be added. Current-generation Intel Phi chips have either 61 or 57 cores. One core should be excluded for running the OS, leaving 60 or 56 cores. Each core is hyperthreaded, so there are effectively N = 240 (4*60) or N = 224 (4*56) cores to run on. The -np setting of the mpirun command sets the number of MPI tasks/node. The "-k on t Nt" command-line switch sets the number of threads/task as Nt. The product of these 2 values should be N, i.e. 240 or 224. Also, the number of threads/task should be a multiple of 4 so that logical threads from more than one MPI task do not run on the same physical core. Examples of mpirun commands that follow these rules are shown above. [Restrictions:] As noted above, if using GPUs, the number of MPI tasks per compute node should equal to the number of GPUs per compute node. In the future Kokkos will support assigning multiple MPI tasks to a single GPU. Currently Kokkos does not support AMD GPUs due to limits in the available backend programming models. Specifically, Kokkos requires extensive C++ support from the Kernel language. This is expected to change in the future. diff --git a/doc/src/accelerate_omp.txt b/doc/src/accelerate_omp.txt index 81b7a5adc..fa7bef1a5 100644 --- a/doc/src/accelerate_omp.txt +++ b/doc/src/accelerate_omp.txt @@ -1,187 +1,183 @@ "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c :link(lws,http://lammps.sandia.gov) :link(ld,Manual.html) :link(lc,Section_commands.html#comm) :line "Return to Section 5 overview"_Section_accelerate.html 5.3.4 USER-OMP package :h5 The USER-OMP package was developed by Axel Kohlmeyer at Temple University. It provides multi-threaded versions of most pair styles, nearly all bonded styles (bond, angle, dihedral, improper), several Kspace styles, and a few fix styles. The package currently uses the OpenMP interface for multi-threading. Here is a quick overview of how to use the USER-OMP package, assuming one or more 16-core nodes. More details follow. use -fopenmp with CCFLAGS and LINKFLAGS in Makefile.machine make yes-user-omp make mpi # build with USER-OMP package, if settings added to Makefile.mpi -make omp # or Makefile.omp already has settings -Make.py -v -p omp -o mpi -a file mpi # or one-line build via Make.py :pre +make omp # or Makefile.omp already has settings :pre lmp_mpi -sf omp -pk omp 16 < in.script # 1 MPI task, 16 threads mpirun -np 4 lmp_mpi -sf omp -pk omp 4 -in in.script # 4 MPI tasks, 4 threads/task mpirun -np 32 -ppn 4 lmp_mpi -sf omp -pk omp 4 -in in.script # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre [Required hardware/software:] Your compiler must support the OpenMP interface. You should have one or more multi-core CPUs so that multiple threads can be launched by each MPI task running on a CPU. [Building LAMMPS with the USER-OMP package:] The lines above illustrate how to include/build with the USER-OMP package in two steps, using the "make" command. Or how to do it with -one command via the src/Make.py script, described in "Section -4"_Section_packages.html of the manual. Type "Make.py -h" for -help. +one command as described in "Section 4"_Section_packages.html of the manual. Note that the CCFLAGS and LINKFLAGS settings in Makefile.machine must include "-fopenmp". Likewise, if you use an Intel compiler, the -CCFLAGS setting must include "-restrict". The Make.py command will -add these automatically. +CCFLAGS setting must include "-restrict". [Run with the USER-OMP package from the command line:] The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode. You need to choose how many OpenMP threads per MPI task will be used by the USER-OMP package. Note that the product of MPI tasks * threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer. As in the lines above, use the "-sf omp" "command-line switch"_Section_start.html#start_6, which will automatically append "omp" to styles that support it. The "-sf omp" switch also issues a default "package omp 0"_package.html command, which will set the number of threads per MPI task via the OMP_NUM_THREADS environment variable. You can also use the "-pk omp Nt" "command-line switch"_Section_start.html#start_6, to explicitly set Nt = # of OpenMP threads per MPI task to use, as well as additional options. Its syntax is the same as the "package omp"_package.html command whose doc page gives details, including the default values used if it is not specified. It also gives more details on how to set the number of threads via the OMP_NUM_THREADS environment variable. [Or run with the USER-OMP package by editing an input script:] The discussion above for the mpirun/mpiexec command, MPI tasks/node, and threads/MPI task is the same. Use the "suffix omp"_suffix.html command, or you can explicitly add an "omp" suffix to individual styles in your input script, e.g. pair_style lj/cut/omp 2.5 :pre You must also use the "package omp"_package.html command to enable the USER-OMP package. When you do this you also specify how many threads per MPI task to use. The command doc page explains other options and how to set the number of threads via the OMP_NUM_THREADS environment variable. [Speed-ups to expect:] Depending on which styles are accelerated, you should look for a reduction in the "Pair time", "Bond time", "KSpace time", and "Loop time" values printed at the end of a run. You may see a small performance advantage (5 to 20%) when running a USER-OMP style (in serial or parallel) with a single thread per MPI task, versus running standard LAMMPS with its standard un-accelerated styles (in serial or all-MPI parallelization with 1 task/core). This is because many of the USER-OMP styles contain similar optimizations to those used in the OPT package, described in "Section 5.3.5"_accelerate_opt.html. With multiple threads/task, the optimal choice of number of MPI tasks/node and OpenMP threads/task can vary a lot and should always be tested via benchmark runs for a specific simulation running on a specific machine, paying attention to guidelines discussed in the next sub-section. A description of the multi-threading strategy used in the USER-OMP package and some performance examples are "presented here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1 [Guidelines for best performance:] For many problems on current generation CPUs, running the USER-OMP package with a single thread/task is faster than running with multiple threads/task. This is because the MPI parallelization in LAMMPS is often more efficient than multi-threading as implemented in the USER-OMP package. The parallel efficiency (in a threaded sense) also varies for different USER-OMP styles. Using multiple threads/task can be more effective under the following circumstances: Individual compute nodes have a significant number of CPU cores but the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx (Clovertown) and 54xx (Harpertown) quad-core processors. Running one MPI task per CPU core will result in significant performance degradation, so that running with 4 or even only 2 MPI tasks per node is faster. Running in hybrid MPI+OpenMP mode will reduce the inter-node communication bandwidth contention in the same way, but offers an additional speedup by utilizing the otherwise idle CPU cores. :ulb,l The interconnect used for MPI communication does not provide sufficient bandwidth for a large number of MPI tasks per node. For example, this applies to running over gigabit ethernet or on Cray XT4 or XT5 series supercomputers. As in the aforementioned case, this effect worsens when using an increasing number of nodes. :l The system has a spatially inhomogeneous particle density which does not map well to the "domain decomposition scheme"_processors.html or "load-balancing"_balance.html options that LAMMPS provides. This is because multi-threading achives parallelism over the number of particles, not via their distribution in space. :l A machine is being used in "capability mode", i.e. near the point where MPI parallelism is maxed out. For example, this can happen when using the "PPPM solver"_kspace_style.html for long-range electrostatics on large numbers of nodes. The scaling of the KSpace calculation (see the "kspace_style"_kspace_style.html command) becomes the performance-limiting factor. Using multi-threading allows less MPI tasks to be invoked and can speed-up the long-range solver, while increasing overall performance by parallelizing the pairwise and bonded calculations via OpenMP. Likewise additional speedup can be sometimes be achived by increasing the length of the Coulombic cutoff and thus reducing the work done by the long-range solver. Using the "run_style verlet/split"_run_style.html command, which is compatible with the USER-OMP package, is an alternative way to reduce the number of MPI tasks assigned to the KSpace calculation. :l :ule Additional performance tips are as follows: The best parallel efficiency from {omp} styles is typically achieved when there is at least one MPI task per physical CPU chip, i.e. socket or die. :ulb,l It is usually most efficient to restrict threading to a single socket, i.e. use one or more MPI task per socket. :l NOTE: By default, several current MPI implementations use a processor affinity setting that restricts each MPI task to a single CPU core. Using multi-threading in this mode will force all threads to share the one core and thus is likely to be counterproductive. Instead, binding MPI tasks to a (multi-core) socket, should solve this issue. :l :ule [Restrictions:] None. diff --git a/doc/src/accelerate_opt.txt b/doc/src/accelerate_opt.txt index 5a2a5eac0..845264b52 100644 --- a/doc/src/accelerate_opt.txt +++ b/doc/src/accelerate_opt.txt @@ -1,71 +1,67 @@ "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c :link(lws,http://lammps.sandia.gov) :link(ld,Manual.html) :link(lc,Section_commands.html#comm) :line "Return to Section accelerate overview"_Section_accelerate.html 5.3.5 OPT package :h5 The OPT package was developed by James Fischer (High Performance Technologies), David Richie, and Vincent Natoli (Stone Ridge Technologies). It contains a handful of pair styles whose compute() methods were rewritten in C++ templated form to reduce the overhead due to if tests and other conditional code. Here is a quick overview of how to use the OPT package. More details follow. make yes-opt -make mpi # build with the OPT package -Make.py -v -p opt -o mpi -a file mpi # or one-line build via Make.py :pre +make mpi # build with the OPT package :pre lmp_mpi -sf opt -in in.script # run in serial mpirun -np 4 lmp_mpi -sf opt -in in.script # run in parallel :pre [Required hardware/software:] None. [Building LAMMPS with the OPT package:] The lines above illustrate how to build LAMMPS with the OPT package in two steps, using the "make" command. Or how to do it with one command -via the src/Make.py script, described in "Section -4"_Section_packages.html of the manual. Type "Make.py -h" for -help. +as described in "Section 4"_Section_packages.html of the manual. Note that if you use an Intel compiler to build with the OPT package, the CCFLAGS setting in your Makefile.machine must include "-restrict". -The Make.py command will add this automatically. [Run with the OPT package from the command line:] As in the lines above, use the "-sf opt" "command-line switch"_Section_start.html#start_6, which will automatically append "opt" to styles that support it. [Or run with the OPT package by editing an input script:] Use the "suffix opt"_suffix.html command, or you can explicitly add an "opt" suffix to individual styles in your input script, e.g. pair_style lj/cut/opt 2.5 :pre [Speed-ups to expect:] You should see a reduction in the "Pair time" value printed at the end of a run. On most machines for reasonable problem sizes, it will be a 5 to 20% savings. [Guidelines for best performance:] Just try out an OPT pair style to see how it performs. [Restrictions:] None.