accelerate_omp.html
No OneTemporary
Actions

Subscribers

None

File Metadata

Created: Tue, Oct 1, 06:08

accelerate_omp.html
View Options

	<HTML>
	<CENTER><A HREF = "Section_packages.html">Previous Section</A> - <A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> -
	<A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A>
	</CENTER>






	<HR>

	<P><A HREF = "Section_accelerate.html">Return to Section accelerate overview</A>
	</P>
	<H4>5.3.5 USER-OMP package
	</H4>
	<P>The USER-OMP package was developed by Axel Kohlmeyer at Temple
	University. It provides multi-threaded versions of most pair styles,
	nearly all bonded styles (bond, angle, dihedral, improper), several
	Kspace styles, and a few fix styles. The package currently uses the
	OpenMP interface for multi-threading.
	</P>
	<P>Here is a quick overview of how to use the USER-OMP package, assuming
	one or more 16-core nodes. More details follow.
	</P>
	<PRE>use -fopenmp with CCFLAGS and LINKFLAGS in Makefile.machine
	make yes-user-omp
	make mpi # build with USER-OMP package, if settings added to Makefile.mpi
	make omp # or Makefile.omp already has settings
	Make.py -v -p omp -o mpi -a file mpi # or one-line build via Make.py
	</PRE>
	<PRE>lmp_mpi -sf omp -pk omp 16 < in.script # 1 MPI task, 16 threads
	mpirun -np 4 lmp_mpi -sf omp -pk omp 4 -in in.script # 4 MPI tasks, 4 threads/task
	mpirun -np 32 -ppn 4 lmp_mpi -sf omp -pk omp 4 -in in.script # 8 nodes, 4 MPI tasks/node, 4 threads/task
	</PRE>
	<P><B>Required hardware/software:</B>
	</P>
	<P>Your compiler must support the OpenMP interface. You should have one
	or more multi-core CPUs so that multiple threads can be launched by
	each MPI task running on a CPU.
	</P>
	<P><B>Building LAMMPS with the USER-OMP package:</B>
	</P>
	<P>The lines above illustrate how to include/build with the USER-OMP
	package in two steps, using the "make" command. Or how to do it with
	one command via the src/Make.py script, described in <A HREF = "Section_start.html#start_4">Section
	2.4</A> of the manual. Type "Make.py -h" for
	help.
	</P>
	<P>Note that the CCFLAGS and LINKFLAGS settings in Makefile.machine must
	include "-fopenmp". Likewise, if you use an Intel compiler, the
	CCFLAGS setting must include "-restrict". The Make.py command will
	add these automatically.
	</P>
	<P><B>Run with the USER-OMP package from the command line:</B>
	</P>
	<P>The mpirun or mpiexec command sets the total number of MPI tasks used
	by LAMMPS (one or multiple per compute node) and the number of MPI
	tasks used per node. E.g. the mpirun command in MPICH does this via
	its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
	</P>
	<P>You need to choose how many OpenMP threads per MPI task will be used
	by the USER-OMP package. Note that the product of MPI tasks *
	threads/task should not exceed the physical number of cores (on a
	node), otherwise performance will suffer.
	</P>
	<P>As in the lines above, use the "-sf omp" <A HREF = "Section_start.html#start_7">command-line
	switch</A>, which will automatically append
	"omp" to styles that support it. The "-sf omp" switch also issues a
	default <A HREF = "package.html">package omp 0</A> command, which will set the
	number of threads per MPI task via the OMP_NUM_THREADS environment
	variable.
	</P>
	<P>You can also use the "-pk omp Nt" <A HREF = "Section_start.html#start_7">command-line
	switch</A>, to explicitly set Nt = # of OpenMP
	threads per MPI task to use, as well as additional options. Its
	syntax is the same as the <A HREF = "package.html">package omp</A> command whose doc
	page gives details, including the default values used if it is not
	specified. It also gives more details on how to set the number of
	threads via the OMP_NUM_THREADS environment variable.
	</P>
	<P><B>Or run with the USER-OMP package by editing an input script:</B>
	</P>
	<P>The discussion above for the mpirun/mpiexec command, MPI tasks/node,
	and threads/MPI task is the same.
	</P>
	<P>Use the <A HREF = "suffix.html">suffix omp</A> command, or you can explicitly add an
	"omp" suffix to individual styles in your input script, e.g.
	</P>
	<PRE>pair_style lj/cut/omp 2.5
	</PRE>
	<P>You must also use the <A HREF = "package.html">package omp</A> command to enable the
	USER-OMP package. When you do this you also specify how many threads
	per MPI task to use. The command doc page explains other options and
	how to set the number of threads via the OMP_NUM_THREADS environment
	variable.
	</P>
	<P><B>Speed-ups to expect:</B>
	</P>
	<P>Depending on which styles are accelerated, you should look for a
	reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
	time" values printed at the end of a run.
	</P>
	<P>You may see a small performance advantage (5 to 20%) when running a
	USER-OMP style (in serial or parallel) with a single thread per MPI
	task, versus running standard LAMMPS with its standard un-accelerated
	styles (in serial or all-MPI parallelization with 1 task/core). This
	is because many of the USER-OMP styles contain similar optimizations
	to those used in the OPT package, described in <A HREF = "accelerate_opt.html">Section accelerate
	5.3.6</A>.
	</P>
	<P>With multiple threads/task, the optimal choice of number of MPI
	tasks/node and OpenMP threads/task can vary a lot and should always be
	tested via benchmark runs for a specific simulation running on a
	specific machine, paying attention to guidelines discussed in the next
	sub-section.
	</P>
	<P>A description of the multi-threading strategy used in the USER-OMP
	package and some performance examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1">presented
	here</A>
	</P>
	<P><B>Guidelines for best performance:</B>
	</P>
	<P>For many problems on current generation CPUs, running the USER-OMP
	package with a single thread/task is faster than running with multiple
	threads/task. This is because the MPI parallelization in LAMMPS is
	often more efficient than multi-threading as implemented in the
	USER-OMP package. The parallel efficiency (in a threaded sense) also
	varies for different USER-OMP styles.
	</P>
	<P>Using multiple threads/task can be more effective under the following
	circumstances:
	</P>
	<UL><LI>Individual compute nodes have a significant number of CPU cores but
	the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
	(Clovertown) and 54xx (Harpertown) quad-core processors. Running one
	MPI task per CPU core will result in significant performance
	degradation, so that running with 4 or even only 2 MPI tasks per node
	is faster. Running in hybrid MPI+OpenMP mode will reduce the
	inter-node communication bandwidth contention in the same way, but
	offers an additional speedup by utilizing the otherwise idle CPU
	cores.

	<LI>The interconnect used for MPI communication does not provide
	sufficient bandwidth for a large number of MPI tasks per node. For
	example, this applies to running over gigabit ethernet or on Cray XT4
	or XT5 series supercomputers. As in the aforementioned case, this
	effect worsens when using an increasing number of nodes.

	<LI>The system has a spatially inhomogeneous particle density which does
	not map well to the <A HREF = "processors.html">domain decomposition scheme</A> or
	<A HREF = "balance.html">load-balancing</A> options that LAMMPS provides. This is
	because multi-threading achives parallelism over the number of
	particles, not via their distribution in space.

	<LI>A machine is being used in "capability mode", i.e. near the point
	where MPI parallelism is maxed out. For example, this can happen when
	using the <A HREF = "kspace_style.html">PPPM solver</A> for long-range
	electrostatics on large numbers of nodes. The scaling of the KSpace
	calculation (see the <A HREF = "kspace_style.html">kspace_style</A> command) becomes
	the performance-limiting factor. Using multi-threading allows less
	MPI tasks to be invoked and can speed-up the long-range solver, while
	increasing overall performance by parallelizing the pairwise and
	bonded calculations via OpenMP. Likewise additional speedup can be
	sometimes be achived by increasing the length of the Coulombic cutoff
	and thus reducing the work done by the long-range solver. Using the
	<A HREF = "run_style.html">run_style verlet/split</A> command, which is compatible
	with the USER-OMP package, is an alternative way to reduce the number
	of MPI tasks assigned to the KSpace calculation.
	</UL>
	<P>Additional performance tips are as follows:
	</P>
	<UL><LI>The best parallel efficiency from <I>omp</I> styles is typically achieved
	when there is at least one MPI task per physical CPU chip, i.e. socket
	or die.

	<LI>It is usually most efficient to restrict threading to a single
	socket, i.e. use one or more MPI task per socket.

	<LI>IMPORTANT NOTE: By default, several current MPI implementations use a
	processor affinity setting that restricts each MPI task to a single
	CPU core. Using multi-threading in this mode will force all threads
	to share the one core and thus is likely to be counterproductive.
	Instead, binding MPI tasks to a (multi-core) socket, should solve this
	issue.
	</UL>
	<P><B>Restrictions:</B>
	</P>
	<P>None.
	</P>
	</HTML>

accelerate_omp.htmlNo OneTemporaryActions

File Metadata

accelerate_omp.htmlView Options

Event Timeline

accelerate_omp.html
No OneTemporary
Actions

accelerate_omp.html
View Options