diff --git a/doc/accelerate_cuda.html b/doc/accelerate_cuda.html index f1d45f093..65183cff2 100644 --- a/doc/accelerate_cuda.html +++ b/doc/accelerate_cuda.html @@ -1,372 +1,372 @@ <!DOCTYPE html> <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]--> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>5.USER-CUDA package — LAMMPS 15 May 2015 version documentation</title> <link rel="stylesheet" href="_static/css/theme.css" type="text/css" /> <link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" /> <link rel="top" title="LAMMPS 15 May 2015 version documentation" href="index.html"/> <script src="_static/js/modernizr.min.js"></script> </head> <body class="wy-body-for-nav" role="document"> <div class="wy-grid-for-nav"> <nav data-toggle="wy-nav-shift" class="wy-nav-side"> <div class="wy-side-nav-search"> <a href="Manual.html" class="icon icon-home"> LAMMPS </a> <div role="search"> <form id="rtd-search-form" class="wy-form" action="search.html" method="get"> <input type="text" name="q" placeholder="Search docs" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> </div> </div> <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation"> <ul> <li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance & scalability</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying & extending LAMMPS</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li> </ul> </div> </nav> <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"> <nav class="wy-nav-top" role="navigation" aria-label="top navigation"> <i data-toggle="wy-nav-top" class="fa fa-bars"></i> <a href="Manual.html">LAMMPS</a> </nav> <div class="wy-nav-content"> <div class="rst-content"> <div role="navigation" aria-label="breadcrumbs navigation"> <ul class="wy-breadcrumbs"> <li><a href="Manual.html">Docs</a> »</li> <li>5.USER-CUDA package</li> <li class="wy-breadcrumbs-aside"> <a href="http://lammps.sandia.gov">Website</a> <a href="Section_commands.html#comm">Commands</a> </li> </ul> <hr/> </div> <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> <div itemprop="articleBody"> <p><a class="reference internal" href="Section_accelerate.html"><em>Return to Section accelerate overview</em></a></p> <div class="section" id="user-cuda-package"> <h1>5.USER-CUDA package<a class="headerlink" href="#user-cuda-package" title="Permalink to this headline">¶</a></h1> <p>The USER-CUDA package was developed by Christian Trott (Sandia) while at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions of many pair styles, many fixes, a few computes, and for long-range Coulombics via the PPPM command. It has the following general features:</p> <ul class="simple"> <li>The package is designed to allow an entire LAMMPS calculation, for many timesteps, to run entirely on the GPU (except for inter-processor MPI communication), so that atom-based data (e.g. coordinates, forces) do not have to move back-and-forth between the CPU and GPU.</li> <li>The speed-up advantage of this approach is typically better when the number of atoms per GPU is large</li> <li>Data will stay on the GPU until a timestep where a non-USER-CUDA fix or compute is invoked. Whenever a non-GPU operation occurs (fix, compute, output), data automatically moves back to the CPU as needed. This may incur a performance penalty, but should otherwise work transparently.</li> <li>Neighbor lists are constructed on the GPU.</li> <li>The package only supports use of a single MPI task, running on a single CPU (core), assigned to each GPU.</li> </ul> <p>Here is a quick overview of how to use the USER-CUDA package:</p> <ul class="simple"> <li>build the library in lib/cuda for your GPU hardware with desired precision</li> <li>include the USER-CUDA package and build LAMMPS</li> <li>use the mpirun command to specify 1 MPI task per GPU (on each node)</li> <li>enable the USER-CUDA package via the “-c on” command-line switch</li> <li>specify the # of GPUs per node</li> <li>use USER-CUDA styles in your input script</li> </ul> <p>The latter two steps can be done using the “-pk cuda” and “-sf cuda” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> respectively. Or the effect of the “-pk” or “-sf” switches can be duplicated by adding the <a class="reference internal" href="package.html"><em>package cuda</em></a> or <a class="reference internal" href="suffix.html"><em>suffix cuda</em></a> commands respectively to your input script.</p> <p><strong>Required hardware/software:</strong></p> <p>To use this package, you need to have one or more NVIDIA GPUs and install the NVIDIA Cuda software on your system:</p> <p>Your NVIDIA GPU needs to support Compute Capability 1.3. This list may help you to find out the Compute Capability of your card:</p> <p><a class="reference external" href="http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units">http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units</a></p> <p>Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the corresponding GPU drivers. The Nvidia Cuda SDK is not required, but we recommend it also be installed. You can then make sure its sample projects can be compiled without problems.</p> <p><strong>Building LAMMPS with the USER-CUDA package:</strong></p> <p>This requires two steps (a,b): build the USER-CUDA library, then build LAMMPS with the USER-CUDA package.</p> <p>You can do both these steps in one line, using the src/Make.py script, described in <a class="reference internal" href="Section_start.html#start-4"><span>Section 2.4</span></a> of the manual. Type “Make.py -h” for help. If run from the src directory, this command will create src/lmp_cuda using src/MAKE/Makefile.mpi as the starting Makefile.machine:</p> -<div class="highlight-python"><div class="highlight"><pre>Make.py -p cuda -cuda mode=single arch=20 -o cuda lib-cuda file mpi +<div class="highlight-python"><div class="highlight"><pre>Make.py -p cuda -cuda mode=single arch=20 -o cuda -a lib-cuda file mpi </pre></div> </div> <p>Or you can follow these two (a,b) steps:</p> <ol class="loweralpha simple"> <li>Build the USER-CUDA library</li> </ol> <p>The USER-CUDA library is in lammps/lib/cuda. If your <em>CUDA</em> toolkit is not installed in the default system directoy <em>/usr/local/cuda</em> edit the file <em>lib/cuda/Makefile.common</em> accordingly.</p> <p>To build the library with the settings in lib/cuda/Makefile.default, simply type:</p> <div class="highlight-python"><div class="highlight"><pre><span class="n">make</span> </pre></div> </div> <p>To set options when the library is built, type “make OPTIONS”, where <em>OPTIONS</em> are one or more of the following. The settings will be written to the <em>lib/cuda/Makefile.defaults</em> before the build.</p> <pre class="literal-block"> <em>precision=N</em> to set the precision level N = 1 for single precision (default) N = 2 for double precision N = 3 for positions in double precision N = 4 for positions and velocities in double precision <em>arch=M</em> to set GPU compute capability M = 35 for Kepler GPUs M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default) M = 21 for CC2.1 (GF104/114, e.g. GTX560, GTX460, GTX450) M = 13 for CC1.3 (GF200, e.g. C1060, GTX285) <em>prec_timer=0/1</em> to use hi-precision timers 0 = do not use them (default) 1 = use them this is usually only useful for Mac machines <em>dbg=0/1</em> to activate debug mode 0 = no debug mode (default) 1 = yes debug mode this is only useful for developers <em>cufft=1</em> for use of the CUDA FFT library 0 = no CUFFT support (default) in the future other CUDA-enabled FFT libraries might be supported </pre> <p>If the build is successful, it will produce the files liblammpscuda.a and Makefile.lammps.</p> <p>Note that if you change any of the options (like precision), you need to re-build the entire library. Do a “make clean” first, followed by “make”.</p> <ol class="loweralpha simple" start="2"> <li>Build LAMMPS with the USER-CUDA package</li> </ol> <div class="highlight-python"><div class="highlight"><pre>cd lammps/src make yes-user-cuda make machine </pre></div> </div> <p>No additional compile/link flags are needed in Makefile.machine.</p> <p>Note that if you change the USER-CUDA library precision (discussed above) and rebuild the USER-CUDA library, then you also need to re-install the USER-CUDA package and re-build LAMMPS, so that all affected files are re-compiled and linked to the new USER-CUDA library.</p> <p><strong>Run with the USER-CUDA package from the command line:</strong></p> <p>The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.</p> <p>When using the USER-CUDA package, you must use exactly one MPI task per physical GPU.</p> <p>You must use the “-c on” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to enable the USER-CUDA package. The “-c on” switch also issues a default <a class="reference internal" href="package.html"><em>package cuda 1</em></a> command which sets various USER-CUDA options to default values, as discussed on the <a class="reference internal" href="package.html"><em>package</em></a> command doc page.</p> <p>Use the “-sf cuda” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>, which will automatically append “cuda” to styles that support it. Use the “-pk cuda Ng” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to set Ng = # of GPUs per node to a different value than the default set by the “-c on” switch (1 GPU) or change other <a class="reference internal" href="package.html"><em>package cuda</em></a> options.</p> <div class="highlight-python"><div class="highlight"><pre>lmp_machine -c on -sf cuda -pk cuda 1 -in in.script # 1 MPI task uses 1 GPU mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # ditto on 12 16-core nodes </pre></div> </div> <p>The syntax for the “-pk” switch is the same as same as the “package cuda” command. See the <a class="reference internal" href="package.html"><em>package</em></a> command doc page for details, including the default values used for all its options if it is not specified.</p> <p>Note that the default for the <a class="reference internal" href="package.html"><em>package cuda</em></a> command is to set the Newton flag to “off” for both pairwise and bonded interactions. This typically gives fastest performance. If the <a class="reference internal" href="newton.html"><em>newton</em></a> command is used in the input script, it can override these defaults.</p> <p><strong>Or run with the USER-CUDA package by editing an input script:</strong></p> <p>The discussion above for the mpirun/mpiexec command and the requirement of one MPI task per GPU is the same.</p> <p>You must still use the “-c on” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to enable the USER-CUDA package.</p> <p>Use the <a class="reference internal" href="suffix.html"><em>suffix cuda</em></a> command, or you can explicitly add a “cuda” suffix to individual styles in your input script, e.g.</p> <div class="highlight-python"><div class="highlight"><pre>pair_style lj/cut/cuda 2.5 </pre></div> </div> <p>You only need to use the <a class="reference internal" href="package.html"><em>package cuda</em></a> command if you wish to change any of its option defaults, including the number of GPUs/node (default = 1), as set by the “-c on” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>.</p> <p><strong>Speed-ups to expect:</strong></p> <p>The performance of a GPU versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/GPU, and the precision used on the GPU (double, single, mixed).</p> <p>See the <a class="reference external" href="http://lammps.sandia.gov/bench.html">Benchmark page</a> of the LAMMPS web site for performance of the USER-CUDA package on different hardware.</p> <p><strong>Guidelines for best performance:</strong></p> <ul class="simple"> <li>The USER-CUDA package offers more speed-up relative to CPU performance when the number of atoms per GPU is large, e.g. on the order of tens or hundreds of 1000s.</li> <li>As noted above, this package will continue to run a simulation entirely on the GPU(s) (except for inter-processor MPI communication), for multiple timesteps, until a CPU calculation is required, either by a fix or compute that is non-GPU-ized, or until output is performed (thermo or dump snapshot or restart file). The less often this occurs, the faster your simulation will run.</li> </ul> <div class="section" id="restrictions"> <h2>Restrictions<a class="headerlink" href="#restrictions" title="Permalink to this headline">¶</a></h2> <p>None.</p> </div> </div> </div> </div> <footer> <hr/> <div role="contentinfo"> <p> © Copyright . </p> </div> Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. </footer> </div> </div> </section> </div> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT:'./', VERSION:'15 May 2015 version', COLLAPSE_INDEX:false, FILE_SUFFIX:'.html', HAS_SOURCE: true }; </script> <script type="text/javascript" src="_static/jquery.js"></script> <script type="text/javascript" src="_static/underscore.js"></script> <script type="text/javascript" src="_static/doctools.js"></script> <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script> <script type="text/javascript" src="_static/js/theme.js"></script> <script type="text/javascript"> jQuery(function () { SphinxRtdTheme.StickyNav.enable(); }); </script> </body> </html> \ No newline at end of file diff --git a/doc/accelerate_cuda.txt b/doc/accelerate_cuda.txt index 43b4d660d..5a6ca4925 100644 --- a/doc/accelerate_cuda.txt +++ b/doc/accelerate_cuda.txt @@ -1,223 +1,223 @@ "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c :link(lws,http://lammps.sandia.gov) :link(ld,Manual.html) :link(lc,Section_commands.html#comm) :line "Return to Section accelerate overview"_Section_accelerate.html 5.3.1 USER-CUDA package :h4 The USER-CUDA package was developed by Christian Trott (Sandia) while at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions of many pair styles, many fixes, a few computes, and for long-range Coulombics via the PPPM command. It has the following general features: The package is designed to allow an entire LAMMPS calculation, for many timesteps, to run entirely on the GPU (except for inter-processor MPI communication), so that atom-based data (e.g. coordinates, forces) do not have to move back-and-forth between the CPU and GPU. :ulb,l The speed-up advantage of this approach is typically better when the number of atoms per GPU is large :l Data will stay on the GPU until a timestep where a non-USER-CUDA fix or compute is invoked. Whenever a non-GPU operation occurs (fix, compute, output), data automatically moves back to the CPU as needed. This may incur a performance penalty, but should otherwise work transparently. :l Neighbor lists are constructed on the GPU. :l The package only supports use of a single MPI task, running on a single CPU (core), assigned to each GPU. :l,ule Here is a quick overview of how to use the USER-CUDA package: build the library in lib/cuda for your GPU hardware with desired precision include the USER-CUDA package and build LAMMPS use the mpirun command to specify 1 MPI task per GPU (on each node) enable the USER-CUDA package via the "-c on" command-line switch specify the # of GPUs per node use USER-CUDA styles in your input script :ul The latter two steps can be done using the "-pk cuda" and "-sf cuda" "command-line switches"_Section_start.html#start_7 respectively. Or the effect of the "-pk" or "-sf" switches can be duplicated by adding the "package cuda"_package.html or "suffix cuda"_suffix.html commands respectively to your input script. [Required hardware/software:] To use this package, you need to have one or more NVIDIA GPUs and install the NVIDIA Cuda software on your system: Your NVIDIA GPU needs to support Compute Capability 1.3. This list may help you to find out the Compute Capability of your card: http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the corresponding GPU drivers. The Nvidia Cuda SDK is not required, but we recommend it also be installed. You can then make sure its sample projects can be compiled without problems. [Building LAMMPS with the USER-CUDA package:] This requires two steps (a,b): build the USER-CUDA library, then build LAMMPS with the USER-CUDA package. You can do both these steps in one line, using the src/Make.py script, described in "Section 2.4"_Section_start.html#start_4 of the manual. Type "Make.py -h" for help. If run from the src directory, this command will create src/lmp_cuda using src/MAKE/Makefile.mpi as the starting Makefile.machine: -Make.py -p cuda -cuda mode=single arch=20 -o cuda lib-cuda file mpi :pre +Make.py -p cuda -cuda mode=single arch=20 -o cuda -a lib-cuda file mpi :pre Or you can follow these two (a,b) steps: (a) Build the USER-CUDA library The USER-CUDA library is in lammps/lib/cuda. If your {CUDA} toolkit is not installed in the default system directoy {/usr/local/cuda} edit the file {lib/cuda/Makefile.common} accordingly. To build the library with the settings in lib/cuda/Makefile.default, simply type: make :pre To set options when the library is built, type "make OPTIONS", where {OPTIONS} are one or more of the following. The settings will be written to the {lib/cuda/Makefile.defaults} before the build. {precision=N} to set the precision level N = 1 for single precision (default) N = 2 for double precision N = 3 for positions in double precision N = 4 for positions and velocities in double precision {arch=M} to set GPU compute capability M = 35 for Kepler GPUs M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default) M = 21 for CC2.1 (GF104/114, e.g. GTX560, GTX460, GTX450) M = 13 for CC1.3 (GF200, e.g. C1060, GTX285) {prec_timer=0/1} to use hi-precision timers 0 = do not use them (default) 1 = use them this is usually only useful for Mac machines {dbg=0/1} to activate debug mode 0 = no debug mode (default) 1 = yes debug mode this is only useful for developers {cufft=1} for use of the CUDA FFT library 0 = no CUFFT support (default) in the future other CUDA-enabled FFT libraries might be supported :pre If the build is successful, it will produce the files liblammpscuda.a and Makefile.lammps. Note that if you change any of the options (like precision), you need to re-build the entire library. Do a "make clean" first, followed by "make". (b) Build LAMMPS with the USER-CUDA package cd lammps/src make yes-user-cuda make machine :pre No additional compile/link flags are needed in Makefile.machine. Note that if you change the USER-CUDA library precision (discussed above) and rebuild the USER-CUDA library, then you also need to re-install the USER-CUDA package and re-build LAMMPS, so that all affected files are re-compiled and linked to the new USER-CUDA library. [Run with the USER-CUDA package from the command line:] The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode. When using the USER-CUDA package, you must use exactly one MPI task per physical GPU. You must use the "-c on" "command-line switch"_Section_start.html#start_7 to enable the USER-CUDA package. The "-c on" switch also issues a default "package cuda 1"_package.html command which sets various USER-CUDA options to default values, as discussed on the "package"_package.html command doc page. Use the "-sf cuda" "command-line switch"_Section_start.html#start_7, which will automatically append "cuda" to styles that support it. Use the "-pk cuda Ng" "command-line switch"_Section_start.html#start_7 to set Ng = # of GPUs per node to a different value than the default set by the "-c on" switch (1 GPU) or change other "package cuda"_package.html options. lmp_machine -c on -sf cuda -pk cuda 1 -in in.script # 1 MPI task uses 1 GPU mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # ditto on 12 16-core nodes :pre The syntax for the "-pk" switch is the same as same as the "package cuda" command. See the "package"_package.html command doc page for details, including the default values used for all its options if it is not specified. Note that the default for the "package cuda"_package.html command is to set the Newton flag to "off" for both pairwise and bonded interactions. This typically gives fastest performance. If the "newton"_newton.html command is used in the input script, it can override these defaults. [Or run with the USER-CUDA package by editing an input script:] The discussion above for the mpirun/mpiexec command and the requirement of one MPI task per GPU is the same. You must still use the "-c on" "command-line switch"_Section_start.html#start_7 to enable the USER-CUDA package. Use the "suffix cuda"_suffix.html command, or you can explicitly add a "cuda" suffix to individual styles in your input script, e.g. pair_style lj/cut/cuda 2.5 :pre You only need to use the "package cuda"_package.html command if you wish to change any of its option defaults, including the number of GPUs/node (default = 1), as set by the "-c on" "command-line switch"_Section_start.html#start_7. [Speed-ups to expect:] The performance of a GPU versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/GPU, and the precision used on the GPU (double, single, mixed). See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the LAMMPS web site for performance of the USER-CUDA package on different hardware. [Guidelines for best performance:] The USER-CUDA package offers more speed-up relative to CPU performance when the number of atoms per GPU is large, e.g. on the order of tens or hundreds of 1000s. :ulb,l As noted above, this package will continue to run a simulation entirely on the GPU(s) (except for inter-processor MPI communication), for multiple timesteps, until a CPU calculation is required, either by a fix or compute that is non-GPU-ized, or until output is performed (thermo or dump snapshot or restart file). The less often this occurs, the faster your simulation will run. :l,ule [Restrictions:] None. diff --git a/doc/accelerate_gpu.html b/doc/accelerate_gpu.html index 95edb9fd7..eddf4e064 100644 --- a/doc/accelerate_gpu.html +++ b/doc/accelerate_gpu.html @@ -1,401 +1,401 @@ <!DOCTYPE html> <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]--> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>5.GPU package — LAMMPS 15 May 2015 version documentation</title> <link rel="stylesheet" href="_static/css/theme.css" type="text/css" /> <link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" /> <link rel="top" title="LAMMPS 15 May 2015 version documentation" href="index.html"/> <script src="_static/js/modernizr.min.js"></script> </head> <body class="wy-body-for-nav" role="document"> <div class="wy-grid-for-nav"> <nav data-toggle="wy-nav-shift" class="wy-nav-side"> <div class="wy-side-nav-search"> <a href="Manual.html" class="icon icon-home"> LAMMPS </a> <div role="search"> <form id="rtd-search-form" class="wy-form" action="search.html" method="get"> <input type="text" name="q" placeholder="Search docs" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> </div> </div> <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation"> <ul> <li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance & scalability</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying & extending LAMMPS</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li> </ul> </div> </nav> <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"> <nav class="wy-nav-top" role="navigation" aria-label="top navigation"> <i data-toggle="wy-nav-top" class="fa fa-bars"></i> <a href="Manual.html">LAMMPS</a> </nav> <div class="wy-nav-content"> <div class="rst-content"> <div role="navigation" aria-label="breadcrumbs navigation"> <ul class="wy-breadcrumbs"> <li><a href="Manual.html">Docs</a> »</li> <li>5.GPU package</li> <li class="wy-breadcrumbs-aside"> <a href="http://lammps.sandia.gov">Website</a> <a href="Section_commands.html#comm">Commands</a> </li> </ul> <hr/> </div> <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> <div itemprop="articleBody"> <p><a class="reference internal" href="Section_accelerate.html"><em>Return to Section accelerate overview</em></a></p> <div class="section" id="gpu-package"> <h1>5.GPU package<a class="headerlink" href="#gpu-package" title="Permalink to this headline">¶</a></h1> <p>The GPU package was developed by Mike Brown at ORNL and his collaborators, particularly Trung Nguyen (ORNL). It provides GPU versions of many pair styles, including the 3-body Stillinger-Weber pair style, and for <a class="reference internal" href="kspace_style.html"><em>kspace_style pppm</em></a> for long-range Coulombics. It has the following general features:</p> <ul class="simple"> <li>It is designed to exploit common GPU hardware configurations where one or more GPUs are coupled to many cores of one or more multi-core CPUs, e.g. within a node of a parallel machine.</li> <li>Atom-based data (e.g. coordinates, forces) moves back-and-forth between the CPU(s) and GPU every timestep.</li> <li>Neighbor lists can be built on the CPU or on the GPU</li> <li>The charge assignement and force interpolation portions of PPPM can be run on the GPU. The FFT portion, which requires MPI communication between processors, runs on the CPU.</li> <li>Asynchronous force computations can be performed simultaneously on the CPU(s) and GPU.</li> <li>It allows for GPU computations to be performed in single or double precision, or in mixed-mode precision, where pairwise forces are computed in single precision, but accumulated into double-precision force vectors.</li> <li>LAMMPS-specific code is in the GPU package. It makes calls to a generic GPU library in the lib/gpu directory. This library provides NVIDIA support as well as more general OpenCL support, so that the same functionality can eventually be supported on a variety of GPU hardware.</li> </ul> <p>Here is a quick overview of how to use the GPU package:</p> <ul class="simple"> <li>build the library in lib/gpu for your GPU hardware wity desired precision</li> <li>include the GPU package and build LAMMPS</li> <li>use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU</li> <li>specify the # of GPUs per node</li> <li>use GPU styles in your input script</li> </ul> <p>The latter two steps can be done using the “-pk gpu” and “-sf gpu” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> respectively. Or the effect of the “-pk” or “-sf” switches can be duplicated by adding the <a class="reference internal" href="package.html"><em>package gpu</em></a> or <a class="reference internal" href="suffix.html"><em>suffix gpu</em></a> commands respectively to your input script.</p> <p><strong>Required hardware/software:</strong></p> <p>To use this package, you currently need to have an NVIDIA GPU and install the NVIDIA Cuda software on your system:</p> <ul class="simple"> <li>Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information</li> <li>Go to <a class="reference external" href="http://www.nvidia.com/object/cuda_get.html">http://www.nvidia.com/object/cuda_get.html</a></li> <li>Install a driver and toolkit appropriate for your system (SDK is not necessary)</li> <li>Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties</li> </ul> <p><strong>Building LAMMPS with the GPU package:</strong></p> <p>This requires two steps (a,b): build the GPU library, then build LAMMPS with the GPU package.</p> <p>You can do both these steps in one line, using the src/Make.py script, described in <a class="reference internal" href="Section_start.html#start-4"><span>Section 2.4</span></a> of the manual. Type “Make.py -h” for help. If run from the src directory, this command will create src/lmp_gpu using src/MAKE/Makefile.mpi as the starting Makefile.machine:</p> -<div class="highlight-python"><div class="highlight"><pre>Make.py -p gpu -gpu mode=single arch=31 -o gpu lib-gpu file mpi +<div class="highlight-python"><div class="highlight"><pre>Make.py -p gpu -gpu mode=single arch=31 -o gpu -a lib-gpu file mpi </pre></div> </div> <p>Or you can follow these two (a,b) steps:</p> <ol class="loweralpha simple"> <li>Build the GPU library</li> </ol> <p>The GPU library is in lammps/lib/gpu. Select a Makefile.machine (in lib/gpu) appropriate for your system. You should pay special attention to 3 settings in this makefile.</p> <ul class="simple"> <li>CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system</li> <li>CUDA_ARCH = needs to be appropriate to your GPUs</li> <li>CUDA_PREC = precision (double, mixed, single) you desire</li> </ul> <p>See lib/gpu/Makefile.linux.double for examples of the ARCH settings for different GPU choices, e.g. Fermi vs Kepler. It also lists the possible precision settings:</p> <div class="highlight-python"><div class="highlight"><pre><span class="n">CUDA_PREC</span> <span class="o">=</span> <span class="o">-</span><span class="n">D_SINGLE_SINGLE</span> <span class="c"># single precision for all calculations</span> <span class="n">CUDA_PREC</span> <span class="o">=</span> <span class="o">-</span><span class="n">D_DOUBLE_DOUBLE</span> <span class="c"># double precision for all calculations</span> <span class="n">CUDA_PREC</span> <span class="o">=</span> <span class="o">-</span><span class="n">D_SINGLE_DOUBLE</span> <span class="c"># accumulation of forces, etc, in double</span> </pre></div> </div> <p>The last setting is the mixed mode referred to above. Note that your GPU must support double precision to use either the 2nd or 3rd of these settings.</p> <p>To build the library, type:</p> <div class="highlight-python"><div class="highlight"><pre>make -f Makefile.machine </pre></div> </div> <p>If successful, it will produce the files libgpu.a and Makefile.lammps.</p> <p>The latter file has 3 settings that need to be appropriate for the paths and settings for the CUDA system software on your machine. Makefile.lammps is a copy of the file specified by the EXTRAMAKE setting in Makefile.machine. You can change EXTRAMAKE or create your own Makefile.lammps.machine if needed.</p> <p>Note that to change the precision of the GPU library, you need to re-build the entire library. Do a “clean” first, e.g. “make -f Makefile.linux clean”, followed by the make command above.</p> <ol class="loweralpha simple" start="2"> <li>Build LAMMPS with the GPU package</li> </ol> <div class="highlight-python"><div class="highlight"><pre>cd lammps/src make yes-gpu make machine </pre></div> </div> <p>No additional compile/link flags are needed in Makefile.machine.</p> <p>Note that if you change the GPU library precision (discussed above) and rebuild the GPU library, then you also need to re-install the GPU package and re-build LAMMPS, so that all affected files are re-compiled and linked to the new GPU library.</p> <p><strong>Run with the GPU package from the command line:</strong></p> <p>The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.</p> <p>When using the GPU package, you cannot assign more than one GPU to a single MPI task. However multiple MPI tasks can share the same GPU, and in many cases it will be more efficient to run this way. Likewise it may be more efficient to use less MPI tasks/node than the available # of CPU cores. Assignment of multiple MPI tasks to a GPU will happen automatically if you create more MPI tasks/node than there are GPUs/mode. E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be shared by 4 MPI tasks.</p> <p>Use the “-sf gpu” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>, which will automatically append “gpu” to styles that support it. Use the “-pk gpu Ng” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to set Ng = # of GPUs/node to use.</p> <div class="highlight-python"><div class="highlight"><pre>lmp_machine -sf gpu -pk gpu 1 -in in.script # 1 MPI task uses 1 GPU mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # ditto on 4 16-core nodes </pre></div> </div> <p>Note that if the “-sf gpu” switch is used, it also issues a default <a class="reference internal" href="package.html"><em>package gpu 1</em></a> command, which sets the number of GPUs/node to 1.</p> <p>Using the “-pk” switch explicitly allows for setting of the number of GPUs/node to use and additional options. Its syntax is the same as same as the “package gpu” command. See the <a class="reference internal" href="package.html"><em>package</em></a> command doc page for details, including the default values used for all its options if it is not specified.</p> <p>Note that the default for the <a class="reference internal" href="package.html"><em>package gpu</em></a> command is to set the Newton flag to “off” pairwise interactions. It does not affect the setting for bonded interactions (LAMMPS default is “on”). The “off” setting for pairwise interaction is currently required for GPU package pair styles.</p> <p><strong>Or run with the GPU package by editing an input script:</strong></p> <p>The discussion above for the mpirun/mpiexec command, MPI tasks/node, and use of multiple MPI tasks/GPU is the same.</p> <p>Use the <a class="reference internal" href="suffix.html"><em>suffix gpu</em></a> command, or you can explicitly add an “gpu” suffix to individual styles in your input script, e.g.</p> <div class="highlight-python"><div class="highlight"><pre>pair_style lj/cut/gpu 2.5 </pre></div> </div> <p>You must also use the <a class="reference internal" href="package.html"><em>package gpu</em></a> command to enable the GPU package, unless the “-sf gpu” or “-pk gpu” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> were used. It specifies the number of GPUs/node to use, as well as other options.</p> <p><strong>Speed-ups to expect:</strong></p> <p>The performance of a GPU versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/GPU, and the precision used on the GPU (double, single, mixed).</p> <p>See the <a class="reference external" href="http://lammps.sandia.gov/bench.html">Benchmark page</a> of the LAMMPS web site for performance of the GPU package on various hardware, including the Titan HPC platform at ORNL.</p> <p>You should also experiment with how many MPI tasks per GPU to use to give the best performance for your problem and machine. This is also a function of the problem size and the pair style being using. Likewise, you should experiment with the precision setting for the GPU library to see if single or mixed precision will give accurate results, since they will typically be faster.</p> <p><strong>Guidelines for best performance:</strong></p> <ul class="simple"> <li>Using multiple MPI tasks per GPU will often give the best performance, as allowed my most multi-core CPU/GPU configurations.</li> <li>If the number of particles per MPI task is small (e.g. 100s of particles), it can be more efficient to run with fewer MPI tasks per GPU, even if you do not use all the cores on the compute node.</li> <li>The <a class="reference internal" href="package.html"><em>package gpu</em></a> command has several options for tuning performance. Neighbor lists can be built on the GPU or CPU. Force calculations can be dynamically balanced across the CPU cores and GPUs. GPU-specific settings can be made which can be optimized for different hardware. See the <a class="reference internal" href="package.html"><em>packakge</em></a> command doc page for details.</li> <li>As described by the <a class="reference internal" href="package.html"><em>package gpu</em></a> command, GPU accelerated pair styles can perform computations asynchronously with CPU computations. The “Pair” time reported by LAMMPS will be the maximum of the time required to complete the CPU pair style computations and the time required to complete the GPU pair style computations. Any time spent for GPU-enabled pair styles for computations that run simultaneously with <a class="reference internal" href="bond_style.html"><em>bond</em></a>, <a class="reference internal" href="angle_style.html"><em>angle</em></a>, <a class="reference internal" href="dihedral_style.html"><em>dihedral</em></a>, <a class="reference internal" href="improper_style.html"><em>improper</em></a>, and <a class="reference internal" href="kspace_style.html"><em>long-range</em></a> calculations will not be included in the “Pair” time.</li> <li>When the <em>mode</em> setting for the package gpu command is force/neigh, the time for neighbor list calculations on the GPU will be added into the “Pair” time, not the “Neigh” time. An additional breakdown of the times required for various tasks on the GPU (data copy, neighbor calculations, force computations, etc) are output only with the LAMMPS screen output (not in the log file) at the end of each run. These timings represent total time spent on the GPU for each routine, regardless of asynchronous CPU calculations.</li> <li>The output section “GPU Time Info (average)” reports “Max Mem / Proc”. This is the maximum memory used at one time on the GPU for data storage by a single MPI process.</li> </ul> <div class="section" id="restrictions"> <h2>Restrictions<a class="headerlink" href="#restrictions" title="Permalink to this headline">¶</a></h2> <p>None.</p> </div> </div> </div> </div> <footer> <hr/> <div role="contentinfo"> <p> © Copyright . </p> </div> Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. </footer> </div> </div> </section> </div> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT:'./', VERSION:'15 May 2015 version', COLLAPSE_INDEX:false, FILE_SUFFIX:'.html', HAS_SOURCE: true }; </script> <script type="text/javascript" src="_static/jquery.js"></script> <script type="text/javascript" src="_static/underscore.js"></script> <script type="text/javascript" src="_static/doctools.js"></script> <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script> <script type="text/javascript" src="_static/js/theme.js"></script> <script type="text/javascript"> jQuery(function () { SphinxRtdTheme.StickyNav.enable(); }); </script> </body> </html> \ No newline at end of file diff --git a/doc/accelerate_gpu.txt b/doc/accelerate_gpu.txt index bfe9ae7e2..b06e409cd 100644 --- a/doc/accelerate_gpu.txt +++ b/doc/accelerate_gpu.txt @@ -1,252 +1,252 @@ "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c :link(lws,http://lammps.sandia.gov) :link(ld,Manual.html) :link(lc,Section_commands.html#comm) :line "Return to Section accelerate overview"_Section_accelerate.html 5.3.2 GPU package :h4 The GPU package was developed by Mike Brown at ORNL and his collaborators, particularly Trung Nguyen (ORNL). It provides GPU versions of many pair styles, including the 3-body Stillinger-Weber pair style, and for "kspace_style pppm"_kspace_style.html for long-range Coulombics. It has the following general features: It is designed to exploit common GPU hardware configurations where one or more GPUs are coupled to many cores of one or more multi-core CPUs, e.g. within a node of a parallel machine. :ulb,l Atom-based data (e.g. coordinates, forces) moves back-and-forth between the CPU(s) and GPU every timestep. :l Neighbor lists can be built on the CPU or on the GPU :l The charge assignement and force interpolation portions of PPPM can be run on the GPU. The FFT portion, which requires MPI communication between processors, runs on the CPU. :l Asynchronous force computations can be performed simultaneously on the CPU(s) and GPU. :l It allows for GPU computations to be performed in single or double precision, or in mixed-mode precision, where pairwise forces are computed in single precision, but accumulated into double-precision force vectors. :l LAMMPS-specific code is in the GPU package. It makes calls to a generic GPU library in the lib/gpu directory. This library provides NVIDIA support as well as more general OpenCL support, so that the same functionality can eventually be supported on a variety of GPU hardware. :l,ule Here is a quick overview of how to use the GPU package: build the library in lib/gpu for your GPU hardware wity desired precision include the GPU package and build LAMMPS use the mpirun command to set the number of MPI tasks/node which determines the number of MPI tasks/GPU specify the # of GPUs per node use GPU styles in your input script :ul The latter two steps can be done using the "-pk gpu" and "-sf gpu" "command-line switches"_Section_start.html#start_7 respectively. Or the effect of the "-pk" or "-sf" switches can be duplicated by adding the "package gpu"_package.html or "suffix gpu"_suffix.html commands respectively to your input script. [Required hardware/software:] To use this package, you currently need to have an NVIDIA GPU and install the NVIDIA Cuda software on your system: Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/0/information Go to http://www.nvidia.com/object/cuda_get.html Install a driver and toolkit appropriate for your system (SDK is not necessary) Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to list supported devices and properties :ul [Building LAMMPS with the GPU package:] This requires two steps (a,b): build the GPU library, then build LAMMPS with the GPU package. You can do both these steps in one line, using the src/Make.py script, described in "Section 2.4"_Section_start.html#start_4 of the manual. Type "Make.py -h" for help. If run from the src directory, this command will create src/lmp_gpu using src/MAKE/Makefile.mpi as the starting Makefile.machine: -Make.py -p gpu -gpu mode=single arch=31 -o gpu lib-gpu file mpi :pre +Make.py -p gpu -gpu mode=single arch=31 -o gpu -a lib-gpu file mpi :pre Or you can follow these two (a,b) steps: (a) Build the GPU library The GPU library is in lammps/lib/gpu. Select a Makefile.machine (in lib/gpu) appropriate for your system. You should pay special attention to 3 settings in this makefile. CUDA_HOME = needs to be where NVIDIA Cuda software is installed on your system CUDA_ARCH = needs to be appropriate to your GPUs CUDA_PREC = precision (double, mixed, single) you desire :ul See lib/gpu/Makefile.linux.double for examples of the ARCH settings for different GPU choices, e.g. Fermi vs Kepler. It also lists the possible precision settings: CUDA_PREC = -D_SINGLE_SINGLE # single precision for all calculations CUDA_PREC = -D_DOUBLE_DOUBLE # double precision for all calculations CUDA_PREC = -D_SINGLE_DOUBLE # accumulation of forces, etc, in double :pre The last setting is the mixed mode referred to above. Note that your GPU must support double precision to use either the 2nd or 3rd of these settings. To build the library, type: make -f Makefile.machine :pre If successful, it will produce the files libgpu.a and Makefile.lammps. The latter file has 3 settings that need to be appropriate for the paths and settings for the CUDA system software on your machine. Makefile.lammps is a copy of the file specified by the EXTRAMAKE setting in Makefile.machine. You can change EXTRAMAKE or create your own Makefile.lammps.machine if needed. Note that to change the precision of the GPU library, you need to re-build the entire library. Do a "clean" first, e.g. "make -f Makefile.linux clean", followed by the make command above. (b) Build LAMMPS with the GPU package cd lammps/src make yes-gpu make machine :pre No additional compile/link flags are needed in Makefile.machine. Note that if you change the GPU library precision (discussed above) and rebuild the GPU library, then you also need to re-install the GPU package and re-build LAMMPS, so that all affected files are re-compiled and linked to the new GPU library. [Run with the GPU package from the command line:] The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode. When using the GPU package, you cannot assign more than one GPU to a single MPI task. However multiple MPI tasks can share the same GPU, and in many cases it will be more efficient to run this way. Likewise it may be more efficient to use less MPI tasks/node than the available # of CPU cores. Assignment of multiple MPI tasks to a GPU will happen automatically if you create more MPI tasks/node than there are GPUs/mode. E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be shared by 4 MPI tasks. Use the "-sf gpu" "command-line switch"_Section_start.html#start_7, which will automatically append "gpu" to styles that support it. Use the "-pk gpu Ng" "command-line switch"_Section_start.html#start_7 to set Ng = # of GPUs/node to use. lmp_machine -sf gpu -pk gpu 1 -in in.script # 1 MPI task uses 1 GPU mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # ditto on 4 16-core nodes :pre Note that if the "-sf gpu" switch is used, it also issues a default "package gpu 1"_package.html command, which sets the number of GPUs/node to 1. Using the "-pk" switch explicitly allows for setting of the number of GPUs/node to use and additional options. Its syntax is the same as same as the "package gpu" command. See the "package"_package.html command doc page for details, including the default values used for all its options if it is not specified. Note that the default for the "package gpu"_package.html command is to set the Newton flag to "off" pairwise interactions. It does not affect the setting for bonded interactions (LAMMPS default is "on"). The "off" setting for pairwise interaction is currently required for GPU package pair styles. [Or run with the GPU package by editing an input script:] The discussion above for the mpirun/mpiexec command, MPI tasks/node, and use of multiple MPI tasks/GPU is the same. Use the "suffix gpu"_suffix.html command, or you can explicitly add an "gpu" suffix to individual styles in your input script, e.g. pair_style lj/cut/gpu 2.5 :pre You must also use the "package gpu"_package.html command to enable the GPU package, unless the "-sf gpu" or "-pk gpu" "command-line switches"_Section_start.html#start_7 were used. It specifies the number of GPUs/node to use, as well as other options. [Speed-ups to expect:] The performance of a GPU versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/GPU, and the precision used on the GPU (double, single, mixed). See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the LAMMPS web site for performance of the GPU package on various hardware, including the Titan HPC platform at ORNL. You should also experiment with how many MPI tasks per GPU to use to give the best performance for your problem and machine. This is also a function of the problem size and the pair style being using. Likewise, you should experiment with the precision setting for the GPU library to see if single or mixed precision will give accurate results, since they will typically be faster. [Guidelines for best performance:] Using multiple MPI tasks per GPU will often give the best performance, as allowed my most multi-core CPU/GPU configurations. :ulb,l If the number of particles per MPI task is small (e.g. 100s of particles), it can be more efficient to run with fewer MPI tasks per GPU, even if you do not use all the cores on the compute node. :l The "package gpu"_package.html command has several options for tuning performance. Neighbor lists can be built on the GPU or CPU. Force calculations can be dynamically balanced across the CPU cores and GPUs. GPU-specific settings can be made which can be optimized for different hardware. See the "packakge"_package.html command doc page for details. :l As described by the "package gpu"_package.html command, GPU accelerated pair styles can perform computations asynchronously with CPU computations. The "Pair" time reported by LAMMPS will be the maximum of the time required to complete the CPU pair style computations and the time required to complete the GPU pair style computations. Any time spent for GPU-enabled pair styles for computations that run simultaneously with "bond"_bond_style.html, "angle"_angle_style.html, "dihedral"_dihedral_style.html, "improper"_improper_style.html, and "long-range"_kspace_style.html calculations will not be included in the "Pair" time. :l When the {mode} setting for the package gpu command is force/neigh, the time for neighbor list calculations on the GPU will be added into the "Pair" time, not the "Neigh" time. An additional breakdown of the times required for various tasks on the GPU (data copy, neighbor calculations, force computations, etc) are output only with the LAMMPS screen output (not in the log file) at the end of each run. These timings represent total time spent on the GPU for each routine, regardless of asynchronous CPU calculations. :l The output section "GPU Time Info (average)" reports "Max Mem / Proc". This is the maximum memory used at one time on the GPU for data storage by a single MPI process. :l,ule [Restrictions:] None. diff --git a/doc/accelerate_intel.html b/doc/accelerate_intel.html index 984302118..bc04eacac 100644 --- a/doc/accelerate_intel.html +++ b/doc/accelerate_intel.html @@ -1,484 +1,484 @@ <!DOCTYPE html> <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]--> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>5.USER-INTEL package — LAMMPS 15 May 2015 version documentation</title> <link rel="stylesheet" href="_static/css/theme.css" type="text/css" /> <link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" /> <link rel="top" title="LAMMPS 15 May 2015 version documentation" href="index.html"/> <script src="_static/js/modernizr.min.js"></script> </head> <body class="wy-body-for-nav" role="document"> <div class="wy-grid-for-nav"> <nav data-toggle="wy-nav-shift" class="wy-nav-side"> <div class="wy-side-nav-search"> <a href="Manual.html" class="icon icon-home"> LAMMPS </a> <div role="search"> <form id="rtd-search-form" class="wy-form" action="search.html" method="get"> <input type="text" name="q" placeholder="Search docs" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> </div> </div> <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation"> <ul> <li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance & scalability</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying & extending LAMMPS</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li> </ul> </div> </nav> <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"> <nav class="wy-nav-top" role="navigation" aria-label="top navigation"> <i data-toggle="wy-nav-top" class="fa fa-bars"></i> <a href="Manual.html">LAMMPS</a> </nav> <div class="wy-nav-content"> <div class="rst-content"> <div role="navigation" aria-label="breadcrumbs navigation"> <ul class="wy-breadcrumbs"> <li><a href="Manual.html">Docs</a> »</li> <li>5.USER-INTEL package</li> <li class="wy-breadcrumbs-aside"> <a href="http://lammps.sandia.gov">Website</a> <a href="Section_commands.html#comm">Commands</a> </li> </ul> <hr/> </div> <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> <div itemprop="articleBody"> <p><a class="reference internal" href="Section_accelerate.html"><em>Return to Section accelerate overview</em></a></p> <div class="section" id="user-intel-package"> <h1>5.USER-INTEL package<a class="headerlink" href="#user-intel-package" title="Permalink to this headline">¶</a></h1> <p>The USER-INTEL package was developed by Mike Brown at Intel Corporation. It provides a capability to accelerate simulations by offloading neighbor list and non-bonded force calculations to Intel(R) Xeon Phi(TM) coprocessors (not native mode like the KOKKOS package). Additionally, it supports running simulations in single, mixed, or double precision with vectorization, even if a coprocessor is not present, i.e. on an Intel(R) CPU. The same C++ code is used for both cases. When offloading to a coprocessor, the routine is run twice, once with an offload flag.</p> <p>The USER-INTEL package can be used in tandem with the USER-OMP package. This is useful when offloading pair style computations to coprocessors, so that other styles not supported by the USER-INTEL package, e.g. bond, angle, dihedral, improper, and long-range electrostatics, can run simultaneously in threaded mode on the CPU cores. Since less MPI tasks than CPU cores will typically be invoked when running with coprocessors, this enables the extra CPU cores to be used for useful computation.</p> <p>If LAMMPS is built with both the USER-INTEL and USER-OMP packages intsalled, this mode of operation is made easier to use, because the “-suffix intel” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> or the <a class="reference internal" href="suffix.html"><em>suffix intel</em></a> command will both set a second-choice suffix to “omp” so that styles from the USER-OMP package will be used if available, after first testing if a style from the USER-INTEL package is available.</p> <p>When using the USER-INTEL package, you must choose at build time whether you are building for CPU-only acceleration or for using the Xeon Phi in offload mode.</p> <p>Here is a quick overview of how to use the USER-INTEL package for CPU-only acceleration:</p> <ul class="simple"> <li>specify these CCFLAGS in your src/MAKE/Makefile.machine: -openmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost</li> <li>specify -openmp with LINKFLAGS in your Makefile.machine</li> <li>include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS</li> <li>specify how many OpenMP threads per MPI task to use</li> <li>use USER-INTEL and (optionally) USER-OMP styles in your input script</li> </ul> <p>Note that many of these settings can only be used with the Intel compiler, as discussed below.</p> <p>Using the USER-INTEL package to offload work to the Intel(R) Xeon Phi(TM) coprocessor is the same except for these additional steps:</p> <ul class="simple"> <li>add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine</li> <li>add the flag -offload to LINKFLAGS in your Makefile.machine</li> </ul> <p>The latter two steps in the first case and the last step in the coprocessor case can be done using the “-pk intel” and “-sf intel” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> respectively. Or the effect of the “-pk” or “-sf” switches can be duplicated by adding the <a class="reference internal" href="package.html"><em>package intel</em></a> or <a class="reference internal" href="suffix.html"><em>suffix intel</em></a> commands respectively to your input script.</p> <p><strong>Required hardware/software:</strong></p> <p>To use the offload option, you must have one or more Intel(R) Xeon Phi(TM) coprocessors and use an Intel(R) C++ compiler.</p> <p>Optimizations for vectorization have only been tested with the Intel(R) compiler. Use of other compilers may not result in vectorization or give poor performance.</p> <p>Use of an Intel C++ compiler is recommended, but not required (though g++ will not recognize some of the settings, so they cannot be used). The compiler must support the OpenMP interface.</p> <p>The recommended version of the Intel(R) compiler is 14.0.1.106. Versions 15.0.1.133 and later are also supported. If using Intel(R) MPI, versions 15.0.2.044 and later are recommended.</p> <p><strong>Building LAMMPS with the USER-INTEL package:</strong></p> <p>You can choose to build with or without support for offload to a Intel(R) Xeon Phi(TM) coprocessor. If you build with support for a coprocessor, the same binary can be used on nodes with and without coprocessors installed. However, if you do not have coprocessors on your system, building without offload support will produce a smaller binary.</p> <p>You can do either in one line, using the src/Make.py script, described in <a class="reference internal" href="Section_start.html#start-4"><span>Section 2.4</span></a> of the manual. Type “Make.py -h” for help. If run from the src directory, these commands will create src/lmp_intel_cpu and lmp_intel_phi using src/MAKE/Makefile.mpi as the starting Makefile.machine:</p> -<div class="highlight-python"><div class="highlight"><pre>Make.py -p intel omp -intel cpu -o intel_cpu -cc icc file mpi -Make.py -p intel omp -intel phi -o intel_phi -cc icc file mpi +<div class="highlight-python"><div class="highlight"><pre>Make.py -p intel omp -intel cpu -o intel_cpu -cc icc -a file mpi +Make.py -p intel omp -intel phi -o intel_phi -cc icc -a file mpi </pre></div> </div> <p>Note that this assumes that your MPI and its mpicxx wrapper is using the Intel compiler. If it is not, you should leave off the “-cc icc” switch.</p> <p>Or you can follow these steps:</p> <div class="highlight-python"><div class="highlight"><pre>cd lammps/src make yes-user-intel make yes-user-omp (if desired) make machine </pre></div> </div> <p>Note that if the USER-OMP package is also installed, you can use styles from both packages, as described below.</p> <p>The Makefile.machine needs a “-fopenmp” flag for OpenMP support in both the CCFLAGS and LINKFLAGS variables. You also need to add -DLAMMPS_MEMALIGN=64 and -restrict to CCFLAGS.</p> <p>If you are compiling on the same architecture that will be used for the runs, adding the flag <em>-xHost</em> to CCFLAGS will enable vectorization with the Intel(R) compiler. Otherwise, you must provide the correct compute node architecture to the -x option (e.g. -xAVX).</p> <p>In order to build with support for an Intel(R) Xeon Phi(TM) coprocessor, the flag <em>-offload</em> should be added to the LINKFLAGS line and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.</p> <p>Example makefiles Makefile.intel_cpu and Makefile.intel_phi are included in the src/MAKE/OPTIONS directory with settings that perform well with the Intel(R) compiler. The latter file has support for offload to coprocessors; the former does not.</p> <p><strong>Notes on CPU and core affinity:</strong></p> <p>Setting core affinity is often used to pin MPI tasks and OpenMP threads to a core or group of cores so that memory access can be uniform. Unless disabled at build time, affinity for MPI tasks and OpenMP threads on the host will be set by default on the host when using offload to a coprocessor. In this case, it is unnecessary to use other methods to control affinity (e.g. taskset, numactl, I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script with the <em>no_affinity</em> option to the <a class="reference internal" href="package.html"><em>package intel</em></a> command or by disabling the option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile). Disabling this option is not recommended, especially when running on a machine with hyperthreading disabled.</p> <p><strong>Running with the USER-INTEL package from the command line:</strong></p> <p>The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.</p> <p>If you plan to compute (any portion of) pairwise interactions using USER-INTEL pair styles on the CPU, or use USER-OMP styles on the CPU, you need to choose how many OpenMP threads per MPI task to use. Note that the product of MPI tasks * OpenMP threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer.</p> <p>If LAMMPS was built with coprocessor support for the USER-INTEL package, you also need to specify the number of coprocessor/node and the number of coprocessor threads per MPI task to use. Note that coprocessor threads (which run on the coprocessor) are totally independent from OpenMP threads (which run on the CPU). The default values for the settings that affect coprocessor threads are typically fine, as discussed below.</p> <p>Use the “-sf intel” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>, which will automatically append “intel” to styles that support it. If a style does not support it, an “omp” suffix is tried next. OpenMP threads per MPI task can be set via the “-pk intel Nphi omp Nt” or “-pk omp Nt” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a>, which set Nt = # of OpenMP threads per MPI task to use. The “-pk omp” form is only allowed if LAMMPS was also built with the USER-OMP package.</p> <p>Use the “-pk intel Nphi” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> to set Nphi = # of Xeon Phi(TM) coprocessors/node, if LAMMPS was built with coprocessor support. All the available coprocessor threads on each Phi will be divided among MPI tasks, unless the <em>tptask</em> option of the “-pk intel” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a> is used to limit the coprocessor threads per MPI task. See the <a class="reference internal" href="package.html"><em>package intel</em></a> command for details.</p> <div class="highlight-python"><div class="highlight"><pre>CPU-only without USER-OMP (but using Intel vectorization on CPU): lmp_machine -sf intel -in in.script # 1 MPI task mpirun -np 32 lmp_machine -sf intel -in in.script # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes) </pre></div> </div> <div class="highlight-python"><div class="highlight"><pre>CPU-only with USER-OMP (and Intel vectorization on CPU): lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node mpirun -np 4 lmp_machine -sf intel -pk omp 4 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node mpirun -np 32 lmp_machine -sf intel -pk omp 4 -in in.script # ditto on 8 16-core nodes </pre></div> </div> <div class="highlight-python"><div class="highlight"><pre>CPUs + Xeon Phi(TM) coprocessors with or without USER-OMP: lmp_machine -sf intel -pk intel 1 omp 16 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, all 240 coprocessor threads lmp_machine -sf intel -pk intel 1 omp 16 tptask 32 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, only 32 coprocessor threads mpirun -np 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # 4 MPI tasks, 4 OpenMP threads/task, 1 coprocessor, 60 coprocessor threads/task mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # ditto on 8 16-core nodes mpirun -np 8 lmp_machine -sf intel -pk intel 4 omp 2 -in in.script # 8 MPI tasks, 2 OpenMP threads/task, 4 coprocessors, 120 coprocessor threads/task </pre></div> </div> <p>Note that if the “-sf intel” switch is used, it also invokes two default commands: <a class="reference internal" href="package.html"><em>package intel 1</em></a>, followed by <a class="reference internal" href="package.html"><em>package omp 0</em></a>. These both set the number of OpenMP threads per MPI task via the OMP_NUM_THREADS environment variable. The first command sets the number of Xeon Phi(TM) coprocessors/node to 1 (and the precision mode to “mixed”, as one of its option defaults). The latter command is not invoked if LAMMPS was not built with the USER-OMP package. The Nphi = 1 value for the first command is ignored if LAMMPS was not built with coprocessor support.</p> <p>Using the “-pk intel” or “-pk omp” switches explicitly allows for direct setting of the number of OpenMP threads per MPI task, and additional options for either of the USER-INTEL or USER-OMP packages. In particular, the “-pk intel” switch sets the number of coprocessors/node and can limit the number of coprocessor threads per MPI task. The syntax for these two switches is the same as the <a class="reference internal" href="package.html"><em>package omp</em></a> and <a class="reference internal" href="package.html"><em>package intel</em></a> commands. See the <a class="reference internal" href="package.html"><em>package</em></a> command doc page for details, including the default values used for all its options if these switches are not specified, and how to set the number of OpenMP threads via the OMP_NUM_THREADS environment variable if desired.</p> <p><strong>Or run with the USER-INTEL package by editing an input script:</strong></p> <p>The discussion above for the mpirun/mpiexec command, MPI tasks/node, OpenMP threads per MPI task, and coprocessor threads per MPI task is the same.</p> <p>Use the <a class="reference internal" href="suffix.html"><em>suffix intel</em></a> command, or you can explicitly add an “intel” suffix to individual styles in your input script, e.g.</p> <div class="highlight-python"><div class="highlight"><pre>pair_style lj/cut/intel 2.5 </pre></div> </div> <p>You must also use the <a class="reference internal" href="package.html"><em>package intel</em></a> command, unless the “-sf intel” or “-pk intel” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> were used. It specifies how many coprocessors/node to use, as well as other OpenMP threading and coprocessor options. Its doc page explains how to set the number of OpenMP threads via an environment variable if desired.</p> <p>If LAMMPS was also built with the USER-OMP package, you must also use the <a class="reference internal" href="package.html"><em>package omp</em></a> command to enable that package, unless the “-sf intel” or “-pk omp” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> were used. It specifies how many OpenMP threads per MPI task to use, as well as other options. Its doc page explains how to set the number of OpenMP threads via an environment variable if desired.</p> <p><strong>Speed-ups to expect:</strong></p> <p>If LAMMPS was not built with coprocessor support when including the USER-INTEL package, then acclerated styles will run on the CPU using vectorization optimizations and the specified precision. This may give a substantial speed-up for a pair style, particularly if mixed or single precision is used.</p> <p>If LAMMPS was built with coproccesor support, the pair styles will run on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The performance of a Xeon Phi versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/coprocessor, and the precision used on the coprocessor (double, single, mixed).</p> <p>See the <a class="reference external" href="http://lammps.sandia.gov/bench.html">Benchmark page</a> of the LAMMPS web site for performance of the USER-INTEL package on different hardware.</p> <p><strong>Guidelines for best performance on an Intel(R) Xeon Phi(TM) coprocessor:</strong></p> <ul class="simple"> <li>The default for the <a class="reference internal" href="package.html"><em>package intel</em></a> command is to have all the MPI tasks on a given compute node use a single Xeon Phi(TM) coprocessor. In general, running with a large number of MPI tasks on each node will perform best with offload. Each MPI task will automatically get affinity to a subset of the hardware threads available on the coprocessor. For example, if your card has 61 cores, with 60 cores available for offload and 4 hardware threads per core (240 total threads), running with 24 MPI tasks per node will cause each MPI task to use a subset of 10 threads on the coprocessor. Fine tuning of the number of threads to use per MPI task or the number of threads to use per core can be accomplished with keyword settings of the <a class="reference internal" href="package.html"><em>package intel</em></a> command.</li> <li>If desired, only a fraction of the pair style computation can be offloaded to the coprocessors. This is accomplished by using the <em>balance</em> keyword in the <a class="reference internal" href="package.html"><em>package intel</em></a> command. A balance of 0 runs all calculations on the CPU. A balance of 1 runs all calculations on the coprocessor. A balance of 0.5 runs half of the calculations on the coprocessor. Setting the balance to -1 (the default) will enable dynamic load balancing that continously adjusts the fraction of offloaded work throughout the simulation. This option typically produces results within 5 to 10 percent of the optimal fixed balance.</li> <li>When using offload with CPU hyperthreading disabled, it may help performance to use fewer MPI tasks and OpenMP threads than available cores. This is due to the fact that additional threads are generated internally to handle the asynchronous offload tasks.</li> <li>If running short benchmark runs with dynamic load balancing, adding a short warm-up run (10-20 steps) will allow the load-balancer to find a near-optimal setting that will carry over to additional runs.</li> <li>If pair computations are being offloaded to an Intel(R) Xeon Phi(TM) coprocessor, a diagnostic line is printed to the screen (not to the log file), during the setup phase of a run, indicating that offload mode is being used and indicating the number of coprocessor threads per MPI task. Additionally, an offload timing summary is printed at the end of each run. When offloading, the frequency for <a class="reference internal" href="atom_modify.html"><em>atom sorting</em></a> is changed to 1 so that the per-atom data is effectively sorted at every rebuild of the neighbor lists.</li> <li>For simulations with long-range electrostatics or bond, angle, dihedral, improper calculations, computation and data transfer to the coprocessor will run concurrently with computations and MPI communications for these calculations on the host CPU. The USER-INTEL package has two modes for deciding which atoms will be handled by the coprocessor. This choice is controlled with the <em>ghost</em> keyword of the <a class="reference internal" href="package.html"><em>package intel</em></a> command. When set to 0, ghost atoms (atoms at the borders between MPI tasks) are not offloaded to the card. This allows for overlap of MPI communication of forces with computation on the coprocessor when the <a class="reference internal" href="newton.html"><em>newton</em></a> setting is “on”. The default is dependent on the style being used, however, better performance may be achieved by setting this option explictly.</li> </ul> <div class="section" id="restrictions"> <h2>Restrictions<a class="headerlink" href="#restrictions" title="Permalink to this headline">¶</a></h2> <p>When offloading to a coprocessor, <a class="reference internal" href="pair_hybrid.html"><em>hybrid</em></a> styles that require skip lists for neighbor builds cannot be offloaded. Using <a class="reference internal" href="pair_hybrid.html"><em>hybrid/overlay</em></a> is allowed. Only one intel accelerated style may be used with hybrid styles. <a class="reference internal" href="special_bonds.html"><em>Special_bonds</em></a> exclusion lists are not currently supported with offload, however, the same effect can often be accomplished by setting cutoffs for excluded atom types to 0. None of the pair styles in the USER-INTEL package currently support the “inner”, “middle”, “outer” options for rRESPA integration via the <a class="reference internal" href="run_style.html"><em>run_style respa</em></a> command; only the “pair” option is supported.</p> </div> </div> </div> </div> <footer> <hr/> <div role="contentinfo"> <p> © Copyright . </p> </div> Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. </footer> </div> </div> </section> </div> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT:'./', VERSION:'15 May 2015 version', COLLAPSE_INDEX:false, FILE_SUFFIX:'.html', HAS_SOURCE: true }; </script> <script type="text/javascript" src="_static/jquery.js"></script> <script type="text/javascript" src="_static/underscore.js"></script> <script type="text/javascript" src="_static/doctools.js"></script> <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script> <script type="text/javascript" src="_static/js/theme.js"></script> <script type="text/javascript"> jQuery(function () { SphinxRtdTheme.StickyNav.enable(); }); </script> </body> </html> \ No newline at end of file diff --git a/doc/accelerate_intel.txt b/doc/accelerate_intel.txt index c0cbafa44..879413893 100644 --- a/doc/accelerate_intel.txt +++ b/doc/accelerate_intel.txt @@ -1,347 +1,347 @@ "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c :link(lws,http://lammps.sandia.gov) :link(ld,Manual.html) :link(lc,Section_commands.html#comm) :line "Return to Section accelerate overview"_Section_accelerate.html 5.3.3 USER-INTEL package :h4 The USER-INTEL package was developed by Mike Brown at Intel Corporation. It provides a capability to accelerate simulations by offloading neighbor list and non-bonded force calculations to Intel(R) Xeon Phi(TM) coprocessors (not native mode like the KOKKOS package). Additionally, it supports running simulations in single, mixed, or double precision with vectorization, even if a coprocessor is not present, i.e. on an Intel(R) CPU. The same C++ code is used for both cases. When offloading to a coprocessor, the routine is run twice, once with an offload flag. The USER-INTEL package can be used in tandem with the USER-OMP package. This is useful when offloading pair style computations to coprocessors, so that other styles not supported by the USER-INTEL package, e.g. bond, angle, dihedral, improper, and long-range electrostatics, can run simultaneously in threaded mode on the CPU cores. Since less MPI tasks than CPU cores will typically be invoked when running with coprocessors, this enables the extra CPU cores to be used for useful computation. If LAMMPS is built with both the USER-INTEL and USER-OMP packages intsalled, this mode of operation is made easier to use, because the "-suffix intel" "command-line switch"_Section_start.html#start_7 or the "suffix intel"_suffix.html command will both set a second-choice suffix to "omp" so that styles from the USER-OMP package will be used if available, after first testing if a style from the USER-INTEL package is available. When using the USER-INTEL package, you must choose at build time whether you are building for CPU-only acceleration or for using the Xeon Phi in offload mode. Here is a quick overview of how to use the USER-INTEL package for CPU-only acceleration: specify these CCFLAGS in your src/MAKE/Makefile.machine: -openmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost specify -openmp with LINKFLAGS in your Makefile.machine include the USER-INTEL package and (optionally) USER-OMP package and build LAMMPS specify how many OpenMP threads per MPI task to use use USER-INTEL and (optionally) USER-OMP styles in your input script :ul Note that many of these settings can only be used with the Intel compiler, as discussed below. Using the USER-INTEL package to offload work to the Intel(R) Xeon Phi(TM) coprocessor is the same except for these additional steps: add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine add the flag -offload to LINKFLAGS in your Makefile.machine :ul The latter two steps in the first case and the last step in the coprocessor case can be done using the "-pk intel" and "-sf intel" "command-line switches"_Section_start.html#start_7 respectively. Or the effect of the "-pk" or "-sf" switches can be duplicated by adding the "package intel"_package.html or "suffix intel"_suffix.html commands respectively to your input script. [Required hardware/software:] To use the offload option, you must have one or more Intel(R) Xeon Phi(TM) coprocessors and use an Intel(R) C++ compiler. Optimizations for vectorization have only been tested with the Intel(R) compiler. Use of other compilers may not result in vectorization or give poor performance. Use of an Intel C++ compiler is recommended, but not required (though g++ will not recognize some of the settings, so they cannot be used). The compiler must support the OpenMP interface. The recommended version of the Intel(R) compiler is 14.0.1.106. Versions 15.0.1.133 and later are also supported. If using Intel(R) MPI, versions 15.0.2.044 and later are recommended. [Building LAMMPS with the USER-INTEL package:] You can choose to build with or without support for offload to a Intel(R) Xeon Phi(TM) coprocessor. If you build with support for a coprocessor, the same binary can be used on nodes with and without coprocessors installed. However, if you do not have coprocessors on your system, building without offload support will produce a smaller binary. You can do either in one line, using the src/Make.py script, described in "Section 2.4"_Section_start.html#start_4 of the manual. Type "Make.py -h" for help. If run from the src directory, these commands will create src/lmp_intel_cpu and lmp_intel_phi using src/MAKE/Makefile.mpi as the starting Makefile.machine: -Make.py -p intel omp -intel cpu -o intel_cpu -cc icc file mpi -Make.py -p intel omp -intel phi -o intel_phi -cc icc file mpi :pre +Make.py -p intel omp -intel cpu -o intel_cpu -cc icc -a file mpi +Make.py -p intel omp -intel phi -o intel_phi -cc icc -a file mpi :pre Note that this assumes that your MPI and its mpicxx wrapper is using the Intel compiler. If it is not, you should leave off the "-cc icc" switch. Or you can follow these steps: cd lammps/src make yes-user-intel make yes-user-omp (if desired) make machine :pre Note that if the USER-OMP package is also installed, you can use styles from both packages, as described below. The Makefile.machine needs a "-fopenmp" flag for OpenMP support in both the CCFLAGS and LINKFLAGS variables. You also need to add -DLAMMPS_MEMALIGN=64 and -restrict to CCFLAGS. If you are compiling on the same architecture that will be used for the runs, adding the flag {-xHost} to CCFLAGS will enable vectorization with the Intel(R) compiler. Otherwise, you must provide the correct compute node architecture to the -x option (e.g. -xAVX). In order to build with support for an Intel(R) Xeon Phi(TM) coprocessor, the flag {-offload} should be added to the LINKFLAGS line and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line. Example makefiles Makefile.intel_cpu and Makefile.intel_phi are included in the src/MAKE/OPTIONS directory with settings that perform well with the Intel(R) compiler. The latter file has support for offload to coprocessors; the former does not. [Notes on CPU and core affinity:] Setting core affinity is often used to pin MPI tasks and OpenMP threads to a core or group of cores so that memory access can be uniform. Unless disabled at build time, affinity for MPI tasks and OpenMP threads on the host will be set by default on the host when using offload to a coprocessor. In this case, it is unnecessary to use other methods to control affinity (e.g. taskset, numactl, I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script with the {no_affinity} option to the "package intel"_package.html command or by disabling the option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile). Disabling this option is not recommended, especially when running on a machine with hyperthreading disabled. [Running with the USER-INTEL package from the command line:] The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode. If you plan to compute (any portion of) pairwise interactions using USER-INTEL pair styles on the CPU, or use USER-OMP styles on the CPU, you need to choose how many OpenMP threads per MPI task to use. Note that the product of MPI tasks * OpenMP threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer. If LAMMPS was built with coprocessor support for the USER-INTEL package, you also need to specify the number of coprocessor/node and the number of coprocessor threads per MPI task to use. Note that coprocessor threads (which run on the coprocessor) are totally independent from OpenMP threads (which run on the CPU). The default values for the settings that affect coprocessor threads are typically fine, as discussed below. Use the "-sf intel" "command-line switch"_Section_start.html#start_7, which will automatically append "intel" to styles that support it. If a style does not support it, an "omp" suffix is tried next. OpenMP threads per MPI task can be set via the "-pk intel Nphi omp Nt" or "-pk omp Nt" "command-line switches"_Section_start.html#start_7, which set Nt = # of OpenMP threads per MPI task to use. The "-pk omp" form is only allowed if LAMMPS was also built with the USER-OMP package. Use the "-pk intel Nphi" "command-line switch"_Section_start.html#start_7 to set Nphi = # of Xeon Phi(TM) coprocessors/node, if LAMMPS was built with coprocessor support. All the available coprocessor threads on each Phi will be divided among MPI tasks, unless the {tptask} option of the "-pk intel" "command-line switch"_Section_start.html#start_7 is used to limit the coprocessor threads per MPI task. See the "package intel"_package.html command for details. CPU-only without USER-OMP (but using Intel vectorization on CPU): lmp_machine -sf intel -in in.script # 1 MPI task mpirun -np 32 lmp_machine -sf intel -in in.script # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes) :pre CPU-only with USER-OMP (and Intel vectorization on CPU): lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node mpirun -np 4 lmp_machine -sf intel -pk omp 4 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node mpirun -np 32 lmp_machine -sf intel -pk omp 4 -in in.script # ditto on 8 16-core nodes :pre CPUs + Xeon Phi(TM) coprocessors with or without USER-OMP: lmp_machine -sf intel -pk intel 1 omp 16 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, all 240 coprocessor threads lmp_machine -sf intel -pk intel 1 omp 16 tptask 32 -in in.script # 1 MPI task, 16 OpenMP threads on CPU, 1 coprocessor, only 32 coprocessor threads mpirun -np 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # 4 MPI tasks, 4 OpenMP threads/task, 1 coprocessor, 60 coprocessor threads/task mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 1 omp 4 -in in.script # ditto on 8 16-core nodes mpirun -np 8 lmp_machine -sf intel -pk intel 4 omp 2 -in in.script # 8 MPI tasks, 2 OpenMP threads/task, 4 coprocessors, 120 coprocessor threads/task :pre Note that if the "-sf intel" switch is used, it also invokes two default commands: "package intel 1"_package.html, followed by "package omp 0"_package.html. These both set the number of OpenMP threads per MPI task via the OMP_NUM_THREADS environment variable. The first command sets the number of Xeon Phi(TM) coprocessors/node to 1 (and the precision mode to "mixed", as one of its option defaults). The latter command is not invoked if LAMMPS was not built with the USER-OMP package. The Nphi = 1 value for the first command is ignored if LAMMPS was not built with coprocessor support. Using the "-pk intel" or "-pk omp" switches explicitly allows for direct setting of the number of OpenMP threads per MPI task, and additional options for either of the USER-INTEL or USER-OMP packages. In particular, the "-pk intel" switch sets the number of coprocessors/node and can limit the number of coprocessor threads per MPI task. The syntax for these two switches is the same as the "package omp"_package.html and "package intel"_package.html commands. See the "package"_package.html command doc page for details, including the default values used for all its options if these switches are not specified, and how to set the number of OpenMP threads via the OMP_NUM_THREADS environment variable if desired. [Or run with the USER-INTEL package by editing an input script:] The discussion above for the mpirun/mpiexec command, MPI tasks/node, OpenMP threads per MPI task, and coprocessor threads per MPI task is the same. Use the "suffix intel"_suffix.html command, or you can explicitly add an "intel" suffix to individual styles in your input script, e.g. pair_style lj/cut/intel 2.5 :pre You must also use the "package intel"_package.html command, unless the "-sf intel" or "-pk intel" "command-line switches"_Section_start.html#start_7 were used. It specifies how many coprocessors/node to use, as well as other OpenMP threading and coprocessor options. Its doc page explains how to set the number of OpenMP threads via an environment variable if desired. If LAMMPS was also built with the USER-OMP package, you must also use the "package omp"_package.html command to enable that package, unless the "-sf intel" or "-pk omp" "command-line switches"_Section_start.html#start_7 were used. It specifies how many OpenMP threads per MPI task to use, as well as other options. Its doc page explains how to set the number of OpenMP threads via an environment variable if desired. [Speed-ups to expect:] If LAMMPS was not built with coprocessor support when including the USER-INTEL package, then acclerated styles will run on the CPU using vectorization optimizations and the specified precision. This may give a substantial speed-up for a pair style, particularly if mixed or single precision is used. If LAMMPS was built with coproccesor support, the pair styles will run on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The performance of a Xeon Phi versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/coprocessor, and the precision used on the coprocessor (double, single, mixed). See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the LAMMPS web site for performance of the USER-INTEL package on different hardware. [Guidelines for best performance on an Intel(R) Xeon Phi(TM) coprocessor:] The default for the "package intel"_package.html command is to have all the MPI tasks on a given compute node use a single Xeon Phi(TM) coprocessor. In general, running with a large number of MPI tasks on each node will perform best with offload. Each MPI task will automatically get affinity to a subset of the hardware threads available on the coprocessor. For example, if your card has 61 cores, with 60 cores available for offload and 4 hardware threads per core (240 total threads), running with 24 MPI tasks per node will cause each MPI task to use a subset of 10 threads on the coprocessor. Fine tuning of the number of threads to use per MPI task or the number of threads to use per core can be accomplished with keyword settings of the "package intel"_package.html command. :ulb,l If desired, only a fraction of the pair style computation can be offloaded to the coprocessors. This is accomplished by using the {balance} keyword in the "package intel"_package.html command. A balance of 0 runs all calculations on the CPU. A balance of 1 runs all calculations on the coprocessor. A balance of 0.5 runs half of the calculations on the coprocessor. Setting the balance to -1 (the default) will enable dynamic load balancing that continously adjusts the fraction of offloaded work throughout the simulation. This option typically produces results within 5 to 10 percent of the optimal fixed balance. :l When using offload with CPU hyperthreading disabled, it may help performance to use fewer MPI tasks and OpenMP threads than available cores. This is due to the fact that additional threads are generated internally to handle the asynchronous offload tasks. :l If running short benchmark runs with dynamic load balancing, adding a short warm-up run (10-20 steps) will allow the load-balancer to find a near-optimal setting that will carry over to additional runs. :l If pair computations are being offloaded to an Intel(R) Xeon Phi(TM) coprocessor, a diagnostic line is printed to the screen (not to the log file), during the setup phase of a run, indicating that offload mode is being used and indicating the number of coprocessor threads per MPI task. Additionally, an offload timing summary is printed at the end of each run. When offloading, the frequency for "atom sorting"_atom_modify.html is changed to 1 so that the per-atom data is effectively sorted at every rebuild of the neighbor lists. :l For simulations with long-range electrostatics or bond, angle, dihedral, improper calculations, computation and data transfer to the coprocessor will run concurrently with computations and MPI communications for these calculations on the host CPU. The USER-INTEL package has two modes for deciding which atoms will be handled by the coprocessor. This choice is controlled with the {ghost} keyword of the "package intel"_package.html command. When set to 0, ghost atoms (atoms at the borders between MPI tasks) are not offloaded to the card. This allows for overlap of MPI communication of forces with computation on the coprocessor when the "newton"_newton.html setting is "on". The default is dependent on the style being used, however, better performance may be achieved by setting this option explictly. :l,ule [Restrictions:] When offloading to a coprocessor, "hybrid"_pair_hybrid.html styles that require skip lists for neighbor builds cannot be offloaded. Using "hybrid/overlay"_pair_hybrid.html is allowed. Only one intel accelerated style may be used with hybrid styles. "Special_bonds"_special_bonds.html exclusion lists are not currently supported with offload, however, the same effect can often be accomplished by setting cutoffs for excluded atom types to 0. None of the pair styles in the USER-INTEL package currently support the "inner", "middle", "outer" options for rRESPA integration via the "run_style respa"_run_style.html command; only the "pair" option is supported. diff --git a/doc/accelerate_omp.html b/doc/accelerate_omp.html index 0663e1268..2a78e2f1f 100644 --- a/doc/accelerate_omp.html +++ b/doc/accelerate_omp.html @@ -1,355 +1,355 @@ <!DOCTYPE html> <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]--> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>5.USER-OMP package — LAMMPS 15 May 2015 version documentation</title> <link rel="stylesheet" href="_static/css/theme.css" type="text/css" /> <link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" /> <link rel="top" title="LAMMPS 15 May 2015 version documentation" href="index.html"/> <script src="_static/js/modernizr.min.js"></script> </head> <body class="wy-body-for-nav" role="document"> <div class="wy-grid-for-nav"> <nav data-toggle="wy-nav-shift" class="wy-nav-side"> <div class="wy-side-nav-search"> <a href="Manual.html" class="icon icon-home"> LAMMPS </a> <div role="search"> <form id="rtd-search-form" class="wy-form" action="search.html" method="get"> <input type="text" name="q" placeholder="Search docs" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> </div> </div> <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation"> <ul> <li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance & scalability</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying & extending LAMMPS</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li> </ul> </div> </nav> <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"> <nav class="wy-nav-top" role="navigation" aria-label="top navigation"> <i data-toggle="wy-nav-top" class="fa fa-bars"></i> <a href="Manual.html">LAMMPS</a> </nav> <div class="wy-nav-content"> <div class="rst-content"> <div role="navigation" aria-label="breadcrumbs navigation"> <ul class="wy-breadcrumbs"> <li><a href="Manual.html">Docs</a> »</li> <li>5.USER-OMP package</li> <li class="wy-breadcrumbs-aside"> <a href="http://lammps.sandia.gov">Website</a> <a href="Section_commands.html#comm">Commands</a> </li> </ul> <hr/> </div> <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> <div itemprop="articleBody"> <p><a class="reference internal" href="Section_accelerate.html"><em>Return to Section accelerate overview</em></a></p> <div class="section" id="user-omp-package"> <h1>5.USER-OMP package<a class="headerlink" href="#user-omp-package" title="Permalink to this headline">¶</a></h1> <p>The USER-OMP package was developed by Axel Kohlmeyer at Temple University. It provides multi-threaded versions of most pair styles, nearly all bonded styles (bond, angle, dihedral, improper), several Kspace styles, and a few fix styles. The package currently uses the OpenMP interface for multi-threading.</p> <p>Here is a quick overview of how to use the USER-OMP package:</p> <ul class="simple"> <li>use the -fopenmp flag for compiling and linking in your Makefile.machine</li> <li>include the USER-OMP package and build LAMMPS</li> <li>use the mpirun command to set the number of MPI tasks/node</li> <li>specify how many threads per MPI task to use</li> <li>use USER-OMP styles in your input script</li> </ul> <p>The latter two steps can be done using the “-pk omp” and “-sf omp” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> respectively. Or the effect of the “-pk” or “-sf” switches can be duplicated by adding the <a class="reference internal" href="package.html"><em>package omp</em></a> or <a class="reference internal" href="suffix.html"><em>suffix omp</em></a> commands respectively to your input script.</p> <p><strong>Required hardware/software:</strong></p> <p>Your compiler must support the OpenMP interface. You should have one or more multi-core CPUs so that multiple threads can be launched by an MPI task running on a CPU.</p> <p><strong>Building LAMMPS with the USER-OMP package:</strong></p> <p>To do this in one line, use the src/Make.py script, described in <a class="reference internal" href="Section_start.html#start-4"><span>Section 2.4</span></a> of the manual. Type “Make.py -h” for help. If run from the src directory, this command will create src/lmp_omp using src/MAKE/Makefile.mpi as the starting Makefile.machine:</p> -<div class="highlight-python"><div class="highlight"><pre>Make.py -p omp -o omp file mpi +<div class="highlight-python"><div class="highlight"><pre>Make.py -p omp -o omp -a file mpi </pre></div> </div> <p>Or you can follow these steps:</p> <div class="highlight-python"><div class="highlight"><pre>cd lammps/src make yes-user-omp make machine </pre></div> </div> <p>The CCFLAGS setting in Makefile.machine needs “-fopenmp” to add OpenMP support. This works for both the GNU and Intel compilers. Without this flag the USER-OMP styles will still be compiled and work, but will not support multi-threading. For the Intel compilers the CCFLAGS setting also needs to include “-restrict”.</p> <p><strong>Run with the USER-OMP package from the command line:</strong></p> <p>The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.</p> <p>You need to choose how many threads per MPI task will be used by the USER-OMP package. Note that the product of MPI tasks * threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer.</p> <p>Use the “-sf omp” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>, which will automatically append “omp” to styles that support it. Use the “-pk omp Nt” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>, to set Nt = # of OpenMP threads per MPI task to use.</p> <div class="highlight-python"><div class="highlight"><pre>lmp_machine -sf omp -pk omp 16 -in in.script # 1 MPI task on a 16-core node mpirun -np 4 lmp_machine -sf omp -pk omp 4 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node mpirun -np 32 -ppn 4 lmp_machine -sf omp -pk omp 4 -in in.script # ditto on 8 16-core nodes </pre></div> </div> <p>Note that if the “-sf omp” switch is used, it also issues a default <a class="reference internal" href="package.html"><em>package omp 0</em></a> command, which sets the number of threads per MPI task via the OMP_NUM_THREADS environment variable.</p> <p>Using the “-pk” switch explicitly allows for direct setting of the number of threads and additional options. Its syntax is the same as the “package omp” command. See the <a class="reference internal" href="package.html"><em>package</em></a> command doc page for details, including the default values used for all its options if it is not specified, and how to set the number of threads via the OMP_NUM_THREADS environment variable if desired.</p> <p><strong>Or run with the USER-OMP package by editing an input script:</strong></p> <p>The discussion above for the mpirun/mpiexec command, MPI tasks/node, and threads/MPI task is the same.</p> <p>Use the <a class="reference internal" href="suffix.html"><em>suffix omp</em></a> command, or you can explicitly add an “omp” suffix to individual styles in your input script, e.g.</p> <div class="highlight-python"><div class="highlight"><pre>pair_style lj/cut/omp 2.5 </pre></div> </div> <p>You must also use the <a class="reference internal" href="package.html"><em>package omp</em></a> command to enable the USER-OMP package, unless the “-sf omp” or “-pk omp” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switches</span></a> were used. It specifies how many threads per MPI task to use, as well as other options. Its doc page explains how to set the number of threads via an environment variable if desired.</p> <p><strong>Speed-ups to expect:</strong></p> <p>Depending on which styles are accelerated, you should look for a reduction in the “Pair time”, “Bond time”, “KSpace time”, and “Loop time” values printed at the end of a run.</p> <p>You may see a small performance advantage (5 to 20%) when running a USER-OMP style (in serial or parallel) with a single thread per MPI task, versus running standard LAMMPS with its standard (un-accelerated) styles (in serial or all-MPI parallelization with 1 task/core). This is because many of the USER-OMP styles contain similar optimizations to those used in the OPT package, as described above.</p> <p>With multiple threads/task, the optimal choice of MPI tasks/node and OpenMP threads/task can vary a lot and should always be tested via benchmark runs for a specific simulation running on a specific machine, paying attention to guidelines discussed in the next sub-section.</p> <p>A description of the multi-threading strategy used in the USER-OMP package and some performance examples are <a class="reference external" href="http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented here</a></p> <p><strong>Guidelines for best performance:</strong></p> <p>For many problems on current generation CPUs, running the USER-OMP package with a single thread/task is faster than running with multiple threads/task. This is because the MPI parallelization in LAMMPS is often more efficient than multi-threading as implemented in the USER-OMP package. The parallel efficiency (in a threaded sense) also varies for different USER-OMP styles.</p> <p>Using multiple threads/task can be more effective under the following circumstances:</p> <ul class="simple"> <li>Individual compute nodes have a significant number of CPU cores but the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx (Clovertown) and 54xx (Harpertown) quad core processors. Running one MPI task per CPU core will result in significant performance degradation, so that running with 4 or even only 2 MPI tasks per node is faster. Running in hybrid MPI+OpenMP mode will reduce the inter-node communication bandwidth contention in the same way, but offers an additional speedup by utilizing the otherwise idle CPU cores.</li> <li>The interconnect used for MPI communication does not provide sufficient bandwidth for a large number of MPI tasks per node. For example, this applies to running over gigabit ethernet or on Cray XT4 or XT5 series supercomputers. As in the aforementioned case, this effect worsens when using an increasing number of nodes.</li> <li>The system has a spatially inhomogeneous particle density which does not map well to the <a class="reference internal" href="processors.html"><em>domain decomposition scheme</em></a> or <a class="reference internal" href="balance.html"><em>load-balancing</em></a> options that LAMMPS provides. This is because multi-threading achives parallelism over the number of particles, not via their distribution in space.</li> <li>A machine is being used in “capability mode”, i.e. near the point where MPI parallelism is maxed out. For example, this can happen when using the <a class="reference internal" href="kspace_style.html"><em>PPPM solver</em></a> for long-range electrostatics on large numbers of nodes. The scaling of the KSpace calculation (see the <a class="reference internal" href="kspace_style.html"><em>kspace_style</em></a> command) becomes the performance-limiting factor. Using multi-threading allows less MPI tasks to be invoked and can speed-up the long-range solver, while increasing overall performance by parallelizing the pairwise and bonded calculations via OpenMP. Likewise additional speedup can be sometimes be achived by increasing the length of the Coulombic cutoff and thus reducing the work done by the long-range solver. Using the <a class="reference internal" href="run_style.html"><em>run_style verlet/split</em></a> command, which is compatible with the USER-OMP package, is an alternative way to reduce the number of MPI tasks assigned to the KSpace calculation.</li> </ul> <p>Additional performance tips are as follows:</p> <ul class="simple"> <li>The best parallel efficiency from <em>omp</em> styles is typically achieved when there is at least one MPI task per physical processor, i.e. socket or die.</li> <li>It is usually most efficient to restrict threading to a single socket, i.e. use one or more MPI task per socket.</li> <li>Several current MPI implementation by default use a processor affinity setting that restricts each MPI task to a single CPU core. Using multi-threading in this mode will force the threads to share that core and thus is likely to be counterproductive. Instead, binding MPI tasks to a (multi-core) socket, should solve this issue.</li> </ul> <div class="section" id="restrictions"> <h2>Restrictions<a class="headerlink" href="#restrictions" title="Permalink to this headline">¶</a></h2> <p>None.</p> </div> </div> </div> </div> <footer> <hr/> <div role="contentinfo"> <p> © Copyright . </p> </div> Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. </footer> </div> </div> </section> </div> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT:'./', VERSION:'15 May 2015 version', COLLAPSE_INDEX:false, FILE_SUFFIX:'.html', HAS_SOURCE: true }; </script> <script type="text/javascript" src="_static/jquery.js"></script> <script type="text/javascript" src="_static/underscore.js"></script> <script type="text/javascript" src="_static/doctools.js"></script> <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script> <script type="text/javascript" src="_static/js/theme.js"></script> <script type="text/javascript"> jQuery(function () { SphinxRtdTheme.StickyNav.enable(); }); </script> </body> </html> \ No newline at end of file diff --git a/doc/accelerate_omp.txt b/doc/accelerate_omp.txt index 08b9f3c75..9d461f519 100644 --- a/doc/accelerate_omp.txt +++ b/doc/accelerate_omp.txt @@ -1,201 +1,201 @@ "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c :link(lws,http://lammps.sandia.gov) :link(ld,Manual.html) :link(lc,Section_commands.html#comm) :line "Return to Section accelerate overview"_Section_accelerate.html 5.3.5 USER-OMP package :h4 The USER-OMP package was developed by Axel Kohlmeyer at Temple University. It provides multi-threaded versions of most pair styles, nearly all bonded styles (bond, angle, dihedral, improper), several Kspace styles, and a few fix styles. The package currently uses the OpenMP interface for multi-threading. Here is a quick overview of how to use the USER-OMP package: use the -fopenmp flag for compiling and linking in your Makefile.machine include the USER-OMP package and build LAMMPS use the mpirun command to set the number of MPI tasks/node specify how many threads per MPI task to use use USER-OMP styles in your input script :ul The latter two steps can be done using the "-pk omp" and "-sf omp" "command-line switches"_Section_start.html#start_7 respectively. Or the effect of the "-pk" or "-sf" switches can be duplicated by adding the "package omp"_package.html or "suffix omp"_suffix.html commands respectively to your input script. [Required hardware/software:] Your compiler must support the OpenMP interface. You should have one or more multi-core CPUs so that multiple threads can be launched by an MPI task running on a CPU. [Building LAMMPS with the USER-OMP package:] To do this in one line, use the src/Make.py script, described in "Section 2.4"_Section_start.html#start_4 of the manual. Type "Make.py -h" for help. If run from the src directory, this command will create src/lmp_omp using src/MAKE/Makefile.mpi as the starting Makefile.machine: -Make.py -p omp -o omp file mpi :pre +Make.py -p omp -o omp -a file mpi :pre Or you can follow these steps: cd lammps/src make yes-user-omp make machine :pre The CCFLAGS setting in Makefile.machine needs "-fopenmp" to add OpenMP support. This works for both the GNU and Intel compilers. Without this flag the USER-OMP styles will still be compiled and work, but will not support multi-threading. For the Intel compilers the CCFLAGS setting also needs to include "-restrict". [Run with the USER-OMP package from the command line:] The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode. You need to choose how many threads per MPI task will be used by the USER-OMP package. Note that the product of MPI tasks * threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer. Use the "-sf omp" "command-line switch"_Section_start.html#start_7, which will automatically append "omp" to styles that support it. Use the "-pk omp Nt" "command-line switch"_Section_start.html#start_7, to set Nt = # of OpenMP threads per MPI task to use. lmp_machine -sf omp -pk omp 16 -in in.script # 1 MPI task on a 16-core node mpirun -np 4 lmp_machine -sf omp -pk omp 4 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node mpirun -np 32 -ppn 4 lmp_machine -sf omp -pk omp 4 -in in.script # ditto on 8 16-core nodes :pre Note that if the "-sf omp" switch is used, it also issues a default "package omp 0"_package.html command, which sets the number of threads per MPI task via the OMP_NUM_THREADS environment variable. Using the "-pk" switch explicitly allows for direct setting of the number of threads and additional options. Its syntax is the same as the "package omp" command. See the "package"_package.html command doc page for details, including the default values used for all its options if it is not specified, and how to set the number of threads via the OMP_NUM_THREADS environment variable if desired. [Or run with the USER-OMP package by editing an input script:] The discussion above for the mpirun/mpiexec command, MPI tasks/node, and threads/MPI task is the same. Use the "suffix omp"_suffix.html command, or you can explicitly add an "omp" suffix to individual styles in your input script, e.g. pair_style lj/cut/omp 2.5 :pre You must also use the "package omp"_package.html command to enable the USER-OMP package, unless the "-sf omp" or "-pk omp" "command-line switches"_Section_start.html#start_7 were used. It specifies how many threads per MPI task to use, as well as other options. Its doc page explains how to set the number of threads via an environment variable if desired. [Speed-ups to expect:] Depending on which styles are accelerated, you should look for a reduction in the "Pair time", "Bond time", "KSpace time", and "Loop time" values printed at the end of a run. You may see a small performance advantage (5 to 20%) when running a USER-OMP style (in serial or parallel) with a single thread per MPI task, versus running standard LAMMPS with its standard (un-accelerated) styles (in serial or all-MPI parallelization with 1 task/core). This is because many of the USER-OMP styles contain similar optimizations to those used in the OPT package, as described above. With multiple threads/task, the optimal choice of MPI tasks/node and OpenMP threads/task can vary a lot and should always be tested via benchmark runs for a specific simulation running on a specific machine, paying attention to guidelines discussed in the next sub-section. A description of the multi-threading strategy used in the USER-OMP package and some performance examples are "presented here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1 [Guidelines for best performance:] For many problems on current generation CPUs, running the USER-OMP package with a single thread/task is faster than running with multiple threads/task. This is because the MPI parallelization in LAMMPS is often more efficient than multi-threading as implemented in the USER-OMP package. The parallel efficiency (in a threaded sense) also varies for different USER-OMP styles. Using multiple threads/task can be more effective under the following circumstances: Individual compute nodes have a significant number of CPU cores but the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx (Clovertown) and 54xx (Harpertown) quad core processors. Running one MPI task per CPU core will result in significant performance degradation, so that running with 4 or even only 2 MPI tasks per node is faster. Running in hybrid MPI+OpenMP mode will reduce the inter-node communication bandwidth contention in the same way, but offers an additional speedup by utilizing the otherwise idle CPU cores. :ulb,l The interconnect used for MPI communication does not provide sufficient bandwidth for a large number of MPI tasks per node. For example, this applies to running over gigabit ethernet or on Cray XT4 or XT5 series supercomputers. As in the aforementioned case, this effect worsens when using an increasing number of nodes. :l The system has a spatially inhomogeneous particle density which does not map well to the "domain decomposition scheme"_processors.html or "load-balancing"_balance.html options that LAMMPS provides. This is because multi-threading achives parallelism over the number of particles, not via their distribution in space. :l A machine is being used in "capability mode", i.e. near the point where MPI parallelism is maxed out. For example, this can happen when using the "PPPM solver"_kspace_style.html for long-range electrostatics on large numbers of nodes. The scaling of the KSpace calculation (see the "kspace_style"_kspace_style.html command) becomes the performance-limiting factor. Using multi-threading allows less MPI tasks to be invoked and can speed-up the long-range solver, while increasing overall performance by parallelizing the pairwise and bonded calculations via OpenMP. Likewise additional speedup can be sometimes be achived by increasing the length of the Coulombic cutoff and thus reducing the work done by the long-range solver. Using the "run_style verlet/split"_run_style.html command, which is compatible with the USER-OMP package, is an alternative way to reduce the number of MPI tasks assigned to the KSpace calculation. :l,ule Additional performance tips are as follows: The best parallel efficiency from {omp} styles is typically achieved when there is at least one MPI task per physical processor, i.e. socket or die. :ulb,l It is usually most efficient to restrict threading to a single socket, i.e. use one or more MPI task per socket. :l Several current MPI implementation by default use a processor affinity setting that restricts each MPI task to a single CPU core. Using multi-threading in this mode will force the threads to share that core and thus is likely to be counterproductive. Instead, binding MPI tasks to a (multi-core) socket, should solve this issue. :l,ule [Restrictions:] None. diff --git a/doc/accelerate_opt.html b/doc/accelerate_opt.html index 60f1afb2a..fc5c77952 100644 --- a/doc/accelerate_opt.html +++ b/doc/accelerate_opt.html @@ -1,250 +1,250 @@ <!DOCTYPE html> <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]--> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>5.OPT package — LAMMPS 15 May 2015 version documentation</title> <link rel="stylesheet" href="_static/css/theme.css" type="text/css" /> <link rel="stylesheet" href="_static/sphinxcontrib-images/LightBox2/lightbox2/css/lightbox.css" type="text/css" /> <link rel="top" title="LAMMPS 15 May 2015 version documentation" href="index.html"/> <script src="_static/js/modernizr.min.js"></script> </head> <body class="wy-body-for-nav" role="document"> <div class="wy-grid-for-nav"> <nav data-toggle="wy-nav-shift" class="wy-nav-side"> <div class="wy-side-nav-search"> <a href="Manual.html" class="icon icon-home"> LAMMPS </a> <div role="search"> <form id="rtd-search-form" class="wy-form" action="search.html" method="get"> <input type="text" name="q" placeholder="Search docs" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> </div> </div> <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation"> <ul> <li class="toctree-l1"><a class="reference internal" href="Section_intro.html">1. Introduction</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_start.html">2. Getting Started</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_commands.html">3. Commands</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_packages.html">4. Packages</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_accelerate.html">5. Accelerating LAMMPS performance</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_howto.html">6. How-to discussions</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_example.html">7. Example problems</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_perf.html">8. Performance & scalability</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_tools.html">9. Additional tools</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_modify.html">10. Modifying & extending LAMMPS</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_python.html">11. Python interface to LAMMPS</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_errors.html">12. Errors</a></li> <li class="toctree-l1"><a class="reference internal" href="Section_history.html">13. Future and history</a></li> </ul> </div> </nav> <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"> <nav class="wy-nav-top" role="navigation" aria-label="top navigation"> <i data-toggle="wy-nav-top" class="fa fa-bars"></i> <a href="Manual.html">LAMMPS</a> </nav> <div class="wy-nav-content"> <div class="rst-content"> <div role="navigation" aria-label="breadcrumbs navigation"> <ul class="wy-breadcrumbs"> <li><a href="Manual.html">Docs</a> »</li> <li>5.OPT package</li> <li class="wy-breadcrumbs-aside"> <a href="http://lammps.sandia.gov">Website</a> <a href="Section_commands.html#comm">Commands</a> </li> </ul> <hr/> </div> <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> <div itemprop="articleBody"> <p><a class="reference internal" href="Section_accelerate.html"><em>Return to Section accelerate overview</em></a></p> <div class="section" id="opt-package"> <h1>5.OPT package<a class="headerlink" href="#opt-package" title="Permalink to this headline">¶</a></h1> <p>The OPT package was developed by James Fischer (High Performance Technologies), David Richie, and Vincent Natoli (Stone Ridge Technologies). It contains a handful of pair styles whose compute() methods were rewritten in C++ templated form to reduce the overhead due to if tests and other conditional code.</p> <p>Here is a quick overview of how to use the OPT package:</p> <ul class="simple"> <li>include the OPT package and build LAMMPS</li> <li>use OPT pair styles in your input script</li> </ul> <p>The last step can be done using the “-sf opt” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>. Or the effect of the “-sf” switch can be duplicated by adding a <a class="reference internal" href="suffix.html"><em>suffix opt</em></a> command to your input script.</p> <p><strong>Required hardware/software:</strong></p> <p>None.</p> <p><strong>Building LAMMPS with the OPT package:</strong></p> <p>Include the package and build LAMMPS:</p> <p>To do this in one line, use the src/Make.py script, described in <a class="reference internal" href="Section_start.html#start-4"><span>Section 2.4</span></a> of the manual. Type “Make.py -h” for help. If run from the src directory, this command will create src/lmp_opt using src/MAKE/Makefile.mpi as the starting Makefile.machine:</p> -<div class="highlight-python"><div class="highlight"><pre>Make.py -p opt -o opt file mpi +<div class="highlight-python"><div class="highlight"><pre>Make.py -p opt -o opt -a file mpi </pre></div> </div> <p>Or you can follow these steps:</p> <div class="highlight-python"><div class="highlight"><pre>cd lammps/src make yes-opt make machine </pre></div> </div> <p>If you are using Intel compilers, then the CCFLAGS setting in Makefile.machine needs to include “-restrict”.</p> <p><strong>Run with the OPT package from the command line:</strong></p> <p>Use the “-sf opt” <a class="reference internal" href="Section_start.html#start-7"><span>command-line switch</span></a>, which will automatically append “opt” to styles that support it.</p> <div class="highlight-python"><div class="highlight"><pre>lmp_machine -sf opt -in in.script mpirun -np 4 lmp_machine -sf opt -in in.script </pre></div> </div> <p><strong>Or run with the OPT package by editing an input script:</strong></p> <p>Use the <a class="reference internal" href="suffix.html"><em>suffix opt</em></a> command, or you can explicitly add an “opt” suffix to individual styles in your input script, e.g.</p> <div class="highlight-python"><div class="highlight"><pre>pair_style lj/cut/opt 2.5 </pre></div> </div> <p><strong>Speed-ups to expect:</strong></p> <p>You should see a reduction in the “Pair time” value printed at the end of a run. On most machines for reasonable problem sizes, it will be a 5 to 20% savings.</p> <p><strong>Guidelines for best performance:</strong></p> <p>None. Just try out an OPT pair style to see how it performs.</p> <div class="section" id="restrictions"> <h2>Restrictions<a class="headerlink" href="#restrictions" title="Permalink to this headline">¶</a></h2> <p>None.</p> </div> </div> </div> </div> <footer> <hr/> <div role="contentinfo"> <p> © Copyright . </p> </div> Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. </footer> </div> </div> </section> </div> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT:'./', VERSION:'15 May 2015 version', COLLAPSE_INDEX:false, FILE_SUFFIX:'.html', HAS_SOURCE: true }; </script> <script type="text/javascript" src="_static/jquery.js"></script> <script type="text/javascript" src="_static/underscore.js"></script> <script type="text/javascript" src="_static/doctools.js"></script> <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/jquery-1.11.0.min.js"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2/js/lightbox.min.js"></script> <script type="text/javascript" src="_static/sphinxcontrib-images/LightBox2/lightbox2-customize/jquery-noconflict.js"></script> <script type="text/javascript" src="_static/js/theme.js"></script> <script type="text/javascript"> jQuery(function () { SphinxRtdTheme.StickyNav.enable(); }); </script> </body> </html> \ No newline at end of file diff --git a/doc/accelerate_opt.txt b/doc/accelerate_opt.txt index 726f32686..23e853aec 100644 --- a/doc/accelerate_opt.txt +++ b/doc/accelerate_opt.txt @@ -1,82 +1,82 @@ "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c :link(lws,http://lammps.sandia.gov) :link(ld,Manual.html) :link(lc,Section_commands.html#comm) :line "Return to Section accelerate overview"_Section_accelerate.html 5.3.6 OPT package :h4 The OPT package was developed by James Fischer (High Performance Technologies), David Richie, and Vincent Natoli (Stone Ridge Technologies). It contains a handful of pair styles whose compute() methods were rewritten in C++ templated form to reduce the overhead due to if tests and other conditional code. Here is a quick overview of how to use the OPT package: include the OPT package and build LAMMPS use OPT pair styles in your input script :ul The last step can be done using the "-sf opt" "command-line switch"_Section_start.html#start_7. Or the effect of the "-sf" switch can be duplicated by adding a "suffix opt"_suffix.html command to your input script. [Required hardware/software:] None. [Building LAMMPS with the OPT package:] Include the package and build LAMMPS: To do this in one line, use the src/Make.py script, described in "Section 2.4"_Section_start.html#start_4 of the manual. Type "Make.py -h" for help. If run from the src directory, this command will create src/lmp_opt using src/MAKE/Makefile.mpi as the starting Makefile.machine: -Make.py -p opt -o opt file mpi :pre +Make.py -p opt -o opt -a file mpi :pre Or you can follow these steps: cd lammps/src make yes-opt make machine :pre If you are using Intel compilers, then the CCFLAGS setting in Makefile.machine needs to include "-restrict". [Run with the OPT package from the command line:] Use the "-sf opt" "command-line switch"_Section_start.html#start_7, which will automatically append "opt" to styles that support it. lmp_machine -sf opt -in in.script mpirun -np 4 lmp_machine -sf opt -in in.script :pre [Or run with the OPT package by editing an input script:] Use the "suffix opt"_suffix.html command, or you can explicitly add an "opt" suffix to individual styles in your input script, e.g. pair_style lj/cut/opt 2.5 :pre [Speed-ups to expect:] You should see a reduction in the "Pair time" value printed at the end of a run. On most machines for reasonable problem sizes, it will be a 5 to 20% savings. [Guidelines for best performance:] None. Just try out an OPT pair style to see how it performs. [Restrictions:] None.