TEST

README
in.intel.eam
in.intel.lc
in.intel.lj
in.intel.rhodo
in.intel.sw
in.intel.tersoff
in.intel.water
in.lc_generate_restart
mW.sw
mW_32k_cube.data
run_benchmarks.sh

README

Benchmarks #
in.intel.lj - Atomic fluid (LJ Benchmark)
in.intel.rhodo - Protein (Rhodopsin Benchmark)
in.intel.lc - Liquid Crystal w/ Gay-Berne potential
in.intel.eam - Copper benchmark with Embedded Atom Method
in.intel.sw - Silicon benchmark with Stillinger-Weber
in.intel.tersoff - Silicon benchmark with Tersoff
in.intel.water - Coarse-grain water benchmark using Stillinger-Weber # #################

Expected Timesteps/second with turbo on and HT enabled, LAMMPS June-2017
- Compiled w/ Intel Parallel Studio 2017u2 and Makefile.intel_cpu_intelmpi #
Xeon E5-2697v4 Xeon Phi 7250
in.intel.lj - 199.5 282.3
in.intel.rhodo - 12.4 17.5
in.intel.lc - 19.0 25.7
in.intel.eam - 59.4 92.8
in.intel.sw - 132.4 161.9
in.intel.tersoff - 83.3 101.1
in.intel.water - 53.4 90.3 # #################

For Skylake server (Xeon) architectures, see notes in the USER-INTEL/README
for build flags that should be used. #################

For Haswell (Xeon v3) architectures, depending on the compiler version,
it may give better performance to compile for an AVX target (with -xAVX
compiler option) instead of -xHost or -xCORE-AVX2 for some of the
workloads. In most cases, FMA sensitive routines will still use AVX2
(MKL and SVML detect the processor at runtime). For Broadwell (Xeon v4)
architectures, -xCORE-AVX2 or -xHost will work best for all. #################

The default benchmark timesteps will run between 30s and 1 minute with
the Intel package. You can specify a multiplier for all of the benchmarks
to increase or decrease the runtime. Example commandline arguments: #
-v m 2 # Run for twice as long
-v m 0.5 # Run for half as long #################

The LAMMPS newton setting can be controlled from the commandline for the
benchmarks with the N variable: #
-v N on # newton on
-v N off # newton off #
The default is on for all of the benchmarks except for LJ where the off
setting performs best with the USER-INTEL package #################

# Example for running benchmarks (see run_benchmarks.sh for script):

Number of physical cores per node not including hyperthreads

export LMP_CORES=28

If hyperthreading is enabled, number of hyperthreads to use per core
(2 for Xeon; 2 or 4 for Xeon Phi)

export OMP_NUM_THREADS=2

Name of the LAMMPS binary

export LMP_BIN=../../lmp_intel_cpu

LAMMPS root directory

export LMP_ROOT=../../../

source source /opt/intel/parallel_studio_xe_2017.2.050/psxevars.sh export KMP_BLOCKTIME=0 export I_MPI_PIN_DOMAIN=core export I_MPI_FABRICS=shm # For single node

ONLY FOR INTEL XEON PHI x200 SERIES PROCESSORS

export I_MPI_SHM_LMT=shm

Generate the restart file for use with liquid crystal benchmark

mpirun -np $LMP_CORES $LMP_BIN -in in.lc_generate_restart -log none

Benchmark to run

export bench=in.intel.lj

For Intel Xeon Phi x200 series processors best performance is achieved by
using MCDRAM. In flat mode, this can be achieved with numactl,
MPI environment variables, or other options provided by batch schedulers #################

To run without a optimization package #################

mpirun -np $LMP_CORES $LMP_BIN -in $bench -log none -v N on

To run with USER-OMP package #################

mpirun -np $LMP_CORES $LMP_BIN -in $bench -log none -pk omp 0 -sf omp -v N on

To run with USER-INTEL package and no coprocessor #################

mpirun -np $LMP_CORES $LMP_BIN -in $bench -log none -pk intel 0 -sf intel

To run with USER-INTEL and automatic load balancing to 1 coprocessor #################

mpirun -np $LMP_CORES $LMP_BIN -in $bench -log none -pk intel 1 -sf intel

If using PPPM (e.g. in.intel.rhodo) on Intel Xeon Phi x200 series
or Skylake processors #################

export KMP_AFFINITY=none rthreads=$((OMP_NUM_THREADS-1)) mpirun -np $LMP_CORES $LMP_BIN -in $bench -log none -pk intel 0 omp $rthreads lrt yes -sf intel

lammps/src/USER-INTEL/TEST72a1364d854fmaster