diff --git a/doc/src/balance.txt b/doc/src/balance.txt
index 4aa11a2d5..e90547956 100644
--- a/doc/src/balance.txt
+++ b/doc/src/balance.txt
@@ -1,379 +1,410 @@
 "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 balance command :h3
 
 [Syntax:]
 
 balance thresh style args ... keyword value ... :pre
 
 thresh = imbalance threshhold that must be exceeded to perform a re-balance :ulb,l
 one style/arg pair can be used (or multiple for {x},{y},{z}) :l
 style = {x} or {y} or {z} or {shift} or {rcb} :l
   {x} args = {uniform} or Px-1 numbers between 0 and 1
     {uniform} = evenly spaced cuts between processors in x dimension
     numbers = Px-1 ascending values between 0 and 1, Px - # of processors in x dimension
     {x} can be specified together with {y} or {z}
   {y} args = {uniform} or Py-1 numbers between 0 and 1
     {uniform} = evenly spaced cuts between processors in y dimension
     numbers = Py-1 ascending values between 0 and 1, Py - # of processors in y dimension
     {y} can be specified together with {x} or {z}
   {z} args = {uniform} or Pz-1 numbers between 0 and 1
     {uniform} = evenly spaced cuts between processors in z dimension
     numbers = Pz-1 ascending values between 0 and 1, Pz - # of processors in z dimension
     {z} can be specified together with {x} or {y}
   {shift} args = dimstr Niter stopthresh
     dimstr = sequence of letters containing "x" or "y" or "z", each not more than once
     Niter = # of times to iterate within each dimension of dimstr sequence
     stopthresh = stop balancing when this imbalance threshhold is reached
   {rcb} args = none :pre
-zero or more keyword/value pairs may be appended :l
-keyword = {out} or {clock} or {group} :l
-  {out} value = filename
+zero or more optional keywords with their respective arguments may be appended :l
+keyword = {out} or {weight} :l
+  {out} arg = filename
     filename = write each processor's sub-domain to a file
-  {clock} value = weight
-    weight = weight factor between 0 and 1 for including wall clock data
-  {group} args = Ngroup groupID-1 weight-1 groupID-2 weight-2...
-    Ngroup = number of groups with assigned weights
-    groupID-1, groupID-2, ... = group names
-    weight-1, weight-2, ...   = corresponding weight factors :pre
+  {weight} style args = use weighted atom counts to compute the per processor load
+    {style} = {group} or {neigh} or {time} or {var} or {store}
+      {group} args = Ngroup groupID-1 weight-1 groupID-2 weight-2...
+        Ngroup = number of groups with assigned weights
+        groupID-1, groupID-2, ... = group names
+        weight-1, weight-2, ...   = corresponding weight factors
+      {neigh} factor = compute weight based on number of neighbors
+        factor = scaling factor (> 0)
+      {time} factor = compute weight based on time spend computing
+        factor = scaling factor (> 0)
+      {var} name = take weight from atom-style variable
+        name = name of the atom style variable
+      {store} name = store weight in custom atom property
+        name = name of the atom property (without d_ prefix) :pre
 :ule
 
 [Examples:]
 
 balance 0.9 x uniform y 0.4 0.5 0.6
 balance 1.2 shift xz 5 1.1
 balance 1.0 shift xz 5 1.1
 balance 1.1 rcb
-balance 1.0 shift x 10 1.1 group 2 fast 0.5 slow 2.0
-balance 1.0 shift x 10 1.1 clock 0.8
+balance 1.0 shift x 10 1.1 weight group 2 fast 0.5 slow 2.0
+balance 1.0 shift x 10 1.1 weight clock 0.8 weight neigh 0.5 weight store balance
 balance 1.0 shift x 20 1.0 out tmp.balance :pre
 
 [Description:]
 
 This command adjusts the size and shape of processor sub-domains
 within the simulation box, to attempt to balance the number of
 particles and thus indirectly the computational cost (load)
 more evenly across processors.  The load balancing is "static"
 in the sense that this command performs the balancing once, before
 or between simulations.
 The processor sub-domains will then remain static during the
 subsequent run.  To perform "dynamic" balancing, see the "fix
 balance"_fix_balance.html command, which can adjust processor
 sub-domain sizes and shapes on-the-fly during a "run"_run.html.
 
-With the optional {group} keyword different weight factors can be
-assigned to particles according to which groups they belong to.
-The {group} keyword is followed by the number of groups with
-custom weights (default weight is 1.0) and pairs of group ID and
-the corresponding weight factor. The final weight for each atom
-is the product of all weight factors from the groups it belongs to.
+With the optional {weight} keyword different weight factors can be
+assigned to particles according several styles and balancing will
+be performed on the weighted particle counts. Multiple weight
+styles may be given and they are processed in order by multiplying
+the existing weight factor, which defaults to 1.0 with the newly
+computed weight factor. The {store} weight style is an exception,
+as does not compute a weight, but instead stores the current
+accumulated weights in a custom per-atom property defined with
+"fix property/atom"_fix_property_atom.html.
+
+The {group} weight style assigns fixed weight factors according
+to which group atoms belong to. The {group} style keyword is
+followed by the number of groups with custom weights
+(default weight is 1.0) and pairs of group ID and the corresponding
+weight factor. The final weight for each atom is the product of
+all individual weight factors from the groups it belongs to.
 An atom with a total weight of 5 then be will be considered to
 have 5x the computational cost than an atom with the default weight
 of 1.0.
 
-With the optional {clock} keyword, "timer data"_timer.html is
-incorporated into the load balancing cost function. The required
-weight factor argument (a number between 0 and 1) determines to
-which degree timing information is included. The timer information
-is taken from the preceding run. If no such information is
-available, e.g. at the beginning of an input, of when the
-"timer"_timer.html level is set to either {loop} or {off},
-the clock keyword has no effect.
+The {neigh} weight style assigns weights computed from the number
+of neighbors divided by the avergage number of neighbors. The
+scaling factor argument determines the relative impact of this
+weight factor. This weight style will use the first suitable neighbor
+list that is internally available and by inactive and print a
+warning, if there is not suitable list available. This is typically
+the case before the first "run"_run.html or "minimize"_minimize.html
+command is issued.
+
+The {time} weight style allows to incorporate "timer data"_timer.html 
+into the load balancing cost function. The required weight factor
+rgument (a number > 0) determines to which degree timing information
+is included. The timer information is taken from the preceding run.
+If no such information is available, e.g. at the beginning of an input,
+of when the "timer"_timer.html level is set to either {loop} or {off},
+this style is ignored.
+
+The {var} weight style allows to set per-atom weights from an
+atom-style "variable"_variable.html into the load balancing cost
+function.
 
 Load-balancing is typically most useful if the particles in the
 simulation box have a spatially-varying density distribution or
 where the computational cost varies signficantly between different
 atoms. E.g. a model of a vapor/liquid interface, or a solid with
 an irregular-shaped geometry containing void regions, or
 "hybrid pair style simulations"_pair_hybrid.html which combine
 pair styles with different computational cost.  In these cases, the
 LAMMPS default of dividing the simulation box volume into a
 regular-spaced grid of 3d bricks, with one equal-volume sub-domain
 per procesor, may assign numbers of particles per processor in a
 way that the computational effort varies significantly.  This can
 lead to poor performance when the simulation is run in parallel.
 
 Note that the "processors"_processors.html command allows some control
 over how the box volume is split across processors.  Specifically, for
 a Px by Py by Pz grid of processors, it allows choice of Px, Py, and
 Pz, subject to the constraint that Px * Py * Pz = P, the total number
 of processors.  This is sufficient to achieve good load-balance for
 some problems on some processor counts.  However, all the processor
 sub-domains will still have the same shape and same volume.
 
 The requested load-balancing operation is only performed if the
 current "imbalance factor" in particles owned by each processor
 exceeds the specified {thresh} parameter.  The imbalance factor is
 defined as the maximum number of particles owned by any processor,
 divided by the average number of particles per processor.  Thus an
 imbalance factor of 1.0 is perfect balance.
 
 As an example, for 10000 particles running on 10 processors, if the
 most heavily loaded processor has 1200 particles, then the factor is
 1.2, meaning there is a 20% imbalance.  Note that a re-balance can be
 forced even if the current balance is perfect (1.0) be specifying a
 {thresh} < 1.0.
 
 NOTE: Balancing is performed even if the imbalance factor does not
 exceed the {thresh} parameter if a "grid" style is specified when the
 current partitioning is "tiled".  The meaning of "grid" vs "tiled" is
 explained below.  This is to allow forcing of the partitioning to
 "grid" so that the "comm_style brick"_comm_style.html command can then
 be used to replace a current "comm_style tiled"_comm_style.html
 setting.
 
 When the balance command completes, it prints statistics about the
 result, including the change in the imbalance factor and the change in
 the maximum number of particles on any processor.  For "grid" methods
 (defined below) that create a logical 3d grid of processors, the
 positions of all cutting planes in each of the 3 dimensions (as
 fractions of the box length) are also printed.
 
 NOTE: This command attempts to minimize the imbalance factor, as
 defined above.  But depending on the method a perfect balance (1.0)
 may not be achieved.  For example, "grid" methods (defined below) that
 create a logical 3d grid cannot achieve perfect balance for many
 irregular distributions of particles.  Likewise, if a portion of the
 system is a perfect lattice, e.g. the initial system is generated by
 the "create_atoms"_create_atoms.html command, then "grid" methods may
 be unable to achieve exact balance.  This is because entire lattice
 planes will be owned or not owned by a single processor.
 
 NOTE: The imbalance factor is also an estimate of the maximum speed-up
 you can hope to achieve by running a perfectly balanced simulation
 versus an imbalanced one.  In the example above, the 10000 particle
 simulation could run up to 20% faster if it were perfectly balanced,
 versus when imbalanced.  However, computational cost is not strictly
 proportional to particle count, and changing the relative size and
 shape of processor sub-domains may lead to additional computational
 and communication overheads, e.g. in the PPPM solver used via the
 "kspace_style"_kspace_style.html command.  Thus you should benchmark
 the run times of a simulation before and after balancing.
 
 :line
 
 The method used to perform a load balance is specified by one of the
 listed styles (or more in the case of {x},{y},{z}), which are
 described in detail below.  There are 2 kinds of styles.
 
 The {x}, {y}, {z}, and {shift} styles are "grid" methods which produce
 a logical 3d grid of processors.  They operate by changing the cutting
 planes (or lines) between processors in 3d (or 2d), to adjust the
 volume (area in 2d) assigned to each processor, as in the following 2d
 diagram where processor sub-domains are shown and atoms are colored by
 the processor that owns them.  The leftmost diagram is the default
 partitioning of the simulation box across processors (one sub-box for
 each of 16 processors); the middle diagram is after a "grid" method
 has been applied.
 
 :c,image(JPG/balance_uniform_small.jpg,JPG/balance_uniform.jpg),image(JPG/balance_nonuniform_small.jpg,JPG/balance_nonuniform.jpg),image(JPG/balance_rcb_small.jpg,JPG/balance_rcb.jpg)
 
 The {rcb} style is a "tiling" method which does not produce a logical
 3d grid of processors.  Rather it tiles the simulation domain with
 rectangular sub-boxes of varying size and shape in an irregular
-fashion so as to have equal numbers of particles in each sub-box, as
-in the rightmost diagram above.
+fashion so as to have equal numbers of particles (or an equal
+load in case weighted load-balancing is requested) in each sub-box,
+as in the rightmost diagram above.
 
 The "grid" methods can be used with either of the
 "comm_style"_comm_style.html command options, {brick} or {tiled}.  The
 "tiling" methods can only be used with "comm_style
 tiled"_comm_style.html.  Note that it can be useful to use a "grid"
 method with "comm_style tiled"_comm_style.html to return the domain
 partitioning to a logical 3d grid of processors so that "comm_style
 brick" can afterwords be specified for subsequent "run"_run.html
 commands.
 
 When a "grid" method is specified, the current domain partitioning can
 be either a logical 3d grid or a tiled partitioning.  In the former
 case, the current logical 3d grid is used as a starting point and
 changes are made to improve the imbalance factor.  In the latter case,
 the tiled partitioning is discarded and a logical 3d grid is created
 with uniform spacing in all dimensions.  This becomes the starting
 point for the balancing operation.
 
 When a "tiling" method is specified, the current domain partitioning
 ("grid" or "tiled") is ignored, and a new partitioning is computed
 from scratch.
 
 :line
 
 The {x}, {y}, and {z} styles invoke a "grid" method for balancing, as
 described above.  Note that any or all of these 3 styles can be
 specified together, one after the other, but they cannot be used with
 any other style.  This style adjusts the position of cutting planes
 between processor sub-domains in specific dimensions.  Only the
 specified dimensions are altered.
 
 The {uniform} argument spaces the planes evenly, as in the left
 diagrams above.  The {numeric} argument requires listing Ps-1 numbers
 that specify the position of the cutting planes.  This requires
 knowing Ps = Px or Py or Pz = the number of processors assigned by
 LAMMPS to the relevant dimension.  This assignment is made (and the
 Px, Py, Pz values printed out) when the simulation box is created by
 the "create_box" or "read_data" or "read_restart" command and is
 influenced by the settings of the "processors"_processors.html
 command.
 
 Each of the numeric values must be between 0 and 1, and they must be
 listed in ascending order.  They represent the fractional position of
 the cutting place.  The left (or lower) edge of the box is 0.0, and
 the right (or upper) edge is 1.0.  Neither of these values is
 specified.  Only the interior Ps-1 positions are specified.  Thus is
 there are 2 procesors in the x dimension, you specify a single value
 such as 0.75, which would make the left processor's sub-domain 3x
 larger than the right processor's sub-domain.
 
 :line
 
 The {shift} style invokes a "grid" method for balancing, as
 described above.  It changes the positions of cutting planes between
 processors in an iterative fashion, seeking to reduce the imbalance
 factor, similar to how the "fix balance shift"_fix_balance.html
 command operates.
 
 The {dimstr} argument is a string of characters, each of which must be
 an "x" or "y" or "z".  Eacn character can appear zero or one time,
 since there is no advantage to balancing on a dimension more than
 once.  You should normally only list dimensions where you expect there
 to be a density variation in the particles.
 
 Balancing proceeds by adjusting the cutting planes in each of the
 dimensions listed in {dimstr}, one dimension at a time.  For a single
 dimension, the balancing operation (described below) is iterated on up
 to {Niter} times.  After each dimension finishes, the imbalance factor
 is re-computed, and the balancing operation halts if the {stopthresh}
 criterion is met.
 
 A rebalance operation in a single dimension is performed using a
 recursive multisectioning algorithm, where the position of each
 cutting plane (line in 2d) in the dimension is adjusted independently.
 This is similar to a recursive bisectioning for a single value, except
 that the bounds used for each bisectioning take advantage of
 information from neighboring cuts if possible.  At each iteration, the
 count of particles on either side of each plane is tallied.  If the
 counts do not match the target value for the plane, the position of
 the cut is adjusted to be halfway between a low and high bound.  The
 low and high bounds are adjusted on each iteration, using new count
 information, so that they become closer together over time.  Thus as
 the recursion progresses, the count of particles on either side of the
 plane gets closer to the target value.
 
 Once the rebalancing is complete and final processor sub-domains
 assigned, particles are migrated to their new owning processor, and
 the balance procedure ends.
 
 NOTE: At each rebalance operation, the bisectioning for each cutting
 plane (line in 2d) typcially starts with low and high bounds separated
 by the extent of a processor's sub-domain in one dimension.  The size
 of this bracketing region shrinks by 1/2 every iteration.  Thus if
 {Niter} is specified as 10, the cutting plane will typically be
 positioned to 1 part in 1000 accuracy (relative to the perfect target
 position).  For {Niter} = 20, it will be accurate to 1 part in a
 million.  Thus there is no need ot set {Niter} to a large value.
 LAMMPS will check if the threshold accuracy is reached (in a
 dimension) is less iterations than {Niter} and exit early.  However,
 {Niter} should also not be set too small, since it will take roughly
 the same number of iterations to converge even if the cutting plane is
 initially close to the target value.
 
 :line
 
 The {rcb} style invokes a "tiled" method for balancing, as described
 above.  It performs a recursive coordinate bisectioning (RCB) of the
 simulation domain. The basic idea is as follows.
 
 The simulation domain is cut into 2 boxes by an axis-aligned cut in
 the longest dimension, leaving one new box on either side of the cut.
 All the processors are also partitioned into 2 groups, half assigned
 to the box on the lower side of the cut, and half to the box on the
 upper side.  (If the processor count is odd, one side gets an extra
 processor.)  The cut is positioned so that the number of atoms in the
 lower box is exactly the number that the processors assigned to that
 box should own for load balance to be perfect.  This also makes load
 balance for the upper box perfect.  The positioning is done
 iteratively, by a bisectioning method.  Note that counting atoms on
 either side of the cut requires communication between all processors
 at each iteration.
 
 That is the procedure for the first cut.  Subsequent cuts are made
 recursively, in exactly the same manner.  The subset of processors
 assigned to each box make a new cut in the longest dimension of that
 box, splitting the box, the subset of processsors, and the atoms in
 the box in two.  The recursion continues until every processor is
 assigned a sub-box of the entire simulation domain, and owns the atoms
 in that sub-box.
 
 :line
 
 The {out} keyword writes a text file to the specified {filename} with
 the results of the balancing operation.  The file contains the bounds
 of the sub-domain for each processor after the balancing operation
 completes.  The format of the file is compatible with the
 "Pizza.py"_pizza {mdump} tool which has support for manipulating and
 visualizing mesh files.  An example is shown here for a balancing by 4
 processors for a 2d problem:
 
 ITEM: TIMESTEP
 0
 ITEM: NUMBER OF NODES
 16
 ITEM: BOX BOUNDS
 0 10
 0 10
 0 10
 ITEM: NODES
 1 1 0 0 0
 2 1 5 0 0
 3 1 5 5 0
 4 1 0 5 0
 5 1 5 0 0
 6 1 10 0 0
 7 1 10 5 0
 8 1 5 5 0
 9 1 0 5 0
 10 1 5 5 0
 11 1 5 10 0
 12 1 10 5 0
 13 1 5 5 0
 14 1 10 5 0
 15 1 10 10 0
 16 1 5 10 0
 ITEM: TIMESTEP
 0
 ITEM: NUMBER OF SQUARES
 4
 ITEM: SQUARES
 1 1 1 2 3 4
 2 1 5 6 7 8
 3 1 9 10 11 12
 4 1 13 14 15 16 :pre
 
 The coordinates of all the vertices are listed in the NODES section, 5
 per processor.  Note that the 4 sub-domains share vertices, so there
 will be duplicate nodes in the list.
 
 The "SQUARES" section lists the node IDs of the 4 vertices in a
 rectangle for each processor (1 to 4).
 
 For a 3d problem, the syntax is similar with 8 vertices listed for
 each processor, instead of 4, and "SQUARES" replaced by "CUBES".
 
 :line
 
 [Restrictions:]
 
 For 2d simulations, the {z} style cannot be used.  Nor can a "z"
 appear in {dimstr} for the {shift} style.
 
 [Related commands:]
 
 "group"_group.html, "processors"_processors.html,
 "fix balance"_fix_balance.html
 
 [Default:] none
diff --git a/doc/src/fix_balance.txt b/doc/src/fix_balance.txt
index 8502ce753..7318c2b1c 100644
--- a/doc/src/fix_balance.txt
+++ b/doc/src/fix_balance.txt
@@ -1,375 +1,406 @@
 "LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
 
 :link(lws,http://lammps.sandia.gov)
 :link(ld,Manual.html)
 :link(lc,Section_commands.html#comm)
 
 :line
 
 fix balance command :h3
 
 [Syntax:]
 
 fix ID group-ID balance Nfreq thresh style args keyword value ... :pre
 
 ID, group-ID are documented in "fix"_fix.html command :ulb,l
 balance = style name of this fix command :l
 Nfreq = perform dynamic load balancing every this many steps :l
 thresh = imbalance threshhold that must be exceeded to perform a re-balance :l
 style = {shift} or {rcb} :l
   shift args = dimstr Niter stopthresh
     dimstr = sequence of letters containing "x" or "y" or "z", each not more than once
     Niter = # of times to iterate within each dimension of dimstr sequence
     stopthresh = stop balancing when this imbalance threshhold is reached
-  rcb args = none :pre
-zero or more keyword/value pairs may be appended :l
-keyword = {out} or {clock} or {group} :l
-  {out} value = filename
-    filename = write each processor's sub-domain to a file, at each re-balancing
-  {clock} value = weight
-    weight = weight factor between 0 and 1 for including wall clock data
-  {group} args = Ngroup groupID-1 weight-1 groupID-2 weight-2...
-    Ngroup = number of groups with assigned weights
-    groupID-1, groupID-2, ... = group names
-    weight-1, weight-2, ...   = corresponding weights :pre
+  {rcb} args = none :pre
+zero or more optional keywords with their respective arguments may be appended :l
+keyword = {out} or {weight} :l
+  {out} arg = filename
+    filename = write each processor's sub-domain to a file
+  {weight} style args = use weighted atom counts to compute the per processor load
+    {style} = {group} or {neigh} or {time} or {var} or {store}
+      {group} args = Ngroup groupID-1 weight-1 groupID-2 weight-2...
+        Ngroup = number of groups with assigned weights
+        groupID-1, groupID-2, ... = group names
+        weight-1, weight-2, ...   = corresponding weight factors
+      {neigh} factor = compute weight based on number of neighbors
+        factor = scaling factor (> 0)
+      {time} factor = compute weight based on time spend computing
+        factor = scaling factor (> 0)
+      {var} name = take weight from atom-style variable
+        name = name of the atom style variable
+      {store} name = store weight in custom atom property
+        name = name of the atom property (without d_ prefix) :pre
 :ule
 
 [Examples:]
 
 fix 2 all balance 1000 1.05 shift x 10 1.05
 fix 2 all balance 100 0.9 shift xy 20 1.1 out tmp.balance
-fix 2 all balance 100 0.9 shift xy 20 1.1 group 3 substrate 3.0 solvent 1.0 solute 0.8 out tmp.balance
-fix 2 all balance 100 1.0 shift x 10 1.1 clock 0.8
-fix 2 all balance 100 1.0 shift xy 5 1.1 group 2 fast 0.8 slow 2.0 clock 0.8
+fix 2 all balance 100 0.9 shift xy 20 1.1 weight group 3 substrate 3.0 solvent 1.0 solute 0.8 out tmp.balance
+fix 2 all balance 100 1.0 shift x 10 1.1 weight time 0.8
+fix 2 all balance 100 1.0 shift xy 5 1.1 weight var myweight weight neigh 0.6 weight store allweight
 fix 2 all balance 1000 1.1 rcb :pre
 
 [Description:]
 
 This command adjusts the size and shape of processor sub-domains
 within the simulation box, to attempt to balance the number of
 particles and thus the computational cost (load) evenly across
 processors.  The load balancing is "dynamic" in the sense that
 rebalancing is performed periodically during the simulation.  To
 perform "static" balancing, before or between runs, see the
 "balance"_balance.html command.
 
-With the optional {group} keyword different weight factors can be
-assigned to particles according to which groups they belong to.
-The {group} keyword is followed by the number of groups with
-custom weights (default weight is 1.0) and pairs of group ID and
-the corresponding weight factor. The final weight for each atom
-is the product of all weight factors from the groups it belongs to.
+With the optional {weight} keyword different weight factors can be
+assigned to particles according several styles and balancing will
+be performed on the weighted particle counts. Multiple weight
+styles may be given and they are processed in order by multiplying
+the existing weight factor, which defaults to 1.0 with the newly
+computed weight factor. The {store} weight style is an exception,
+as does not compute a weight, but instead stores the current
+accumulated weights in a custom per-atom property defined with
+"fix property/atom"_fix_property_atom.html.
+
+The {group} weight style assigns fixed weight factors according
+to which group atoms belong to. The {group} style keyword is
+followed by the number of groups with custom weights
+(default weight is 1.0) and pairs of group ID and the corresponding
+weight factor. The final weight for each atom is the product of
+all individual weight factors from the groups it belongs to.
 An atom with a total weight of 5 then be will be considered to
 have 5x the computational cost than an atom with the default weight
 of 1.0.
 
-With the optional {clock} keyword, "timer data"_timer.html is
-incorporated into the load balancing cost function. The required
-weight factor argument (a number between 0 and 1) determines to
-which degree timing information is included. The timer information
-is taken from the preceding run. If no such information is
-available, e.g. at the beginning of an input, of when the
-"timer"_timer.html level is set to either {loop} or {off},
-the clock keyword has no effect.
+The {neigh} weight style assigns weights computed from the number
+of neighbors divided by the avergage number of neighbors. The
+scaling factor argument determines the relative impact of this
+weight factor. This weight style will use the first suitable neighbor
+list that is internally available and by inactive and print a
+warning, if there is not suitable list available. This is typically
+the case before the first "run"_run.html or "minimize"_minimize.html
+command is issued.
+
+The {time} weight style allows to incorporate "timer data"_timer.html 
+into the load balancing cost function. The required weight factor
+rgument (a number > 0) determines to which degree timing information
+is included. The timer information is taken from the preceding run.
+If no such information is available, e.g. at the beginning of an input,
+of when the "timer"_timer.html level is set to either {loop} or {off},
+this style is ignored.
+
+The {var} weight style allows to set per-atom weights from an
+atom-style "variable"_variable.html into the load balancing cost
+function.
 
 Load-balancing is typically most useful if the particles in the
 simulation box have a spatially-varying density distribution or
 where the computational cost varies signficantly between different
 atoms. E.g. a model of a vapor/liquid interface, or a solid with
 an irregular-shaped geometry containing void regions, or
 "hybrid pair style simulations"_pair_hybrid.html which combine
 pair styles with different computational cost.  In these cases, the
 LAMMPS default of dividing the simulation box volume into a
 regular-spaced grid of 3d bricks, with one equal-volume sub-domain
 per procesor, may assign numbers of particles per processor in a
 way that the computational effort varies significantly.  This can
 lead to poor performance when the simulation is run in parallel.
 
 Note that the "processors"_processors.html command allows some control
 over how the box volume is split across processors.  Specifically, for
 a Px by Py by Pz grid of processors, it allows choice of Px, Py, and
 Pz, subject to the constraint that Px * Py * Pz = P, the total number
 of processors.  This is sufficient to achieve good load-balance for
 some problems on some processor counts.  However, all the processor
 sub-domains will still have the same shape and same volume.
 
 On a particular timestep, a load-balancing operation is only performed
 if the current "imbalance factor" in particles owned by each processor
 exceeds the specified {thresh} parameter.  The imbalance factor is
 defined as the maximum number of particles owned by any processor,
 divided by the average number of particles per processor.  Thus an
 imbalance factor of 1.0 is perfect balance.
 
 As an example, for 10000 particles running on 10 processors, if the
 most heavily loaded processor has 1200 particles, then the factor is
 1.2, meaning there is a 20% imbalance.  Note that re-balances can be
 forced even if the current balance is perfect (1.0) be specifying a
 {thresh} < 1.0.
 
 NOTE: This command attempts to minimize the imbalance factor, as
 defined above.  But depending on the method a perfect balance (1.0)
 may not be achieved.  For example, "grid" methods (defined below) that
 create a logical 3d grid cannot achieve perfect balance for many
 irregular distributions of particles.  Likewise, if a portion of the
 system is a perfect lattice, e.g. the initial system is generated by
 the "create_atoms"_create_atoms.html command, then "grid" methods may
 be unable to achieve exact balance.  This is because entire lattice
 planes will be owned or not owned by a single processor.
 
 NOTE: The imbalance factor is also an estimate of the maximum speed-up
 you can hope to achieve by running a perfectly balanced simulation
 versus an imbalanced one.  In the example above, the 10000 particle
 simulation could run up to 20% faster if it were perfectly balanced,
 versus when imbalanced.  However, computational cost is not strictly
 proportional to particle count, and changing the relative size and
 shape of processor sub-domains may lead to additional computational
 and communication overheads, e.g. in the PPPM solver used via the
 "kspace_style"_kspace_style.html command.  Thus you should benchmark
 the run times of a simulation before and after balancing.
 
 :line
 
 The method used to perform a load balance is specified by one of the
 listed styles, which are described in detail below.  There are 2 kinds
 of styles.
 
 The {shift} style is a "grid" method which produces a logical 3d grid
 of processors.  It operates by changing the cutting planes (or lines)
 between processors in 3d (or 2d), to adjust the volume (area in 2d)
 assigned to each processor, as in the following 2d diagram where
 processor sub-domains are shown and atoms are colored by the processor
 that owns them.  The leftmost diagram is the default partitioning of
 the simulation box across processors (one sub-box for each of 16
 processors); the middle diagram is after a "grid" method has been
 applied.
 
 :c,image(JPG/balance_uniform_small.jpg,JPG/balance_uniform.jpg),image(JPG/balance_nonuniform_small.jpg,JPG/balance_nonuniform.jpg),image(JPG/balance_rcb_small.jpg,JPG/balance_rcb.jpg)
 
 The {rcb} style is a "tiling" method which does not produce a logical
 3d grid of processors.  Rather it tiles the simulation domain with
 rectangular sub-boxes of varying size and shape in an irregular
-fashion so as to have equal numbers of particles in each sub-box, as
-in the rightmost diagram above.
+fashion so as to have equal numbers of particles (or an equal
+load in case weighted load-balancing is requested) in each sub-box,
+as in the rightmost diagram above.
 
 The "grid" methods can be used with either of the
 "comm_style"_comm_style.html command options, {brick} or {tiled}.  The
 "tiling" methods can only be used with "comm_style
 tiled"_comm_style.html.
 
 When a "grid" method is specified, the current domain partitioning can
 be either a logical 3d grid or a tiled partitioning.  In the former
 case, the current logical 3d grid is used as a starting point and
 changes are made to improve the imbalance factor.  In the latter case,
 the tiled partitioning is discarded and a logical 3d grid is created
 with uniform spacing in all dimensions.  This is the starting point
 for the balancing operation.
 
 When a "tiling" method is specified, the current domain partitioning
 ("grid" or "tiled") is ignored, and a new partitioning is computed
 from scratch.
 
 :line
 
 The {group-ID} is currently ignored. Load-balancing will always affect
 all atoms. However the different impact of different groups of atoms in
-a simulation can be considered through the {group} keyword and assigning
-different weight factors != 1.0 to atoms in these groups.
+a simulation can be considered through the {group} weight style and
+assigning different weight factors != 1.0 to atoms in these groups.
 
 The {Nfreq} setting determines how often a rebalance is performed.  If
 {Nfreq} > 0, then rebalancing will occur every {Nfreq} steps.  Each
 time a rebalance occurs, a reneighboring is triggered, so {Nfreq}
 should not be too small.  If {Nfreq} = 0, then rebalancing will be
 done every time reneighboring normally occurs, as determined by the
 the "neighbor"_neighbor.html and "neigh_modify"_neigh_modify.html
 command settings.
 
 On rebalance steps, rebalancing will only be attempted if the current
 imbalance factor, as defined above, exceeds the {thresh} setting.
 
 :line
 
 The {shift} style invokes a "grid" method for balancing, as described
 above.  It changes the positions of cutting planes between processors
 in an iterative fashion, seeking to reduce the imbalance factor.
 
 The {dimstr} argument is a string of characters, each of which must be
 an "x" or "y" or "z".  Eacn character can appear zero or one time,
 since there is no advantage to balancing on a dimension more than
 once.  You should normally only list dimensions where you expect there
 to be a density variation in the particles.
 
 Balancing proceeds by adjusting the cutting planes in each of the
 dimensions listed in {dimstr}, one dimension at a time.  For a single
 dimension, the balancing operation (described below) is iterated on up
 to {Niter} times.  After each dimension finishes, the imbalance factor
 is re-computed, and the balancing operation halts if the {stopthresh}
 criterion is met.
 
 A rebalance operation in a single dimension is performed using a
 density-dependent recursive multisectioning algorithm, where the
 position of each cutting plane (line in 2d) in the dimension is
 adjusted independently.  This is similar to a recursive bisectioning
 for a single value, except that the bounds used for each bisectioning
 take advantage of information from neighboring cuts if possible, as
 well as counts of particles at the bounds on either side of each cuts,
 which themselves were cuts in previous iterations.  The latter is used
 to infer a density of pariticles near each of the current cuts.  At
 each iteration, the count of particles on either side of each plane is
 tallied.  If the counts do not match the target value for the plane,
 the position of the cut is adjusted based on the local density.  The
 low and high bounds are adjusted on each iteration, using new count
 information, so that they become closer together over time.  Thus as
 the recursion progresses, the count of particles on either side of the
 plane gets closer to the target value.
 
 The density-dependent part of this algorithm is often an advantage
 when you rebalance a system that is already nearly balanced.  It
 typically converges more quickly than the geometric bisectioning
 algorithm used by the "balance"_balance.html command.  However, if can
 be a disadvantage if you attempt to rebalance a system that is far
 from balanced, and converge more slowly.  In this case you probably
 want to use the "balance"_balance.html command before starting a run,
 so that you begin the run with a balanced system.
 
 Once the rebalancing is complete and final processor sub-domains
 assigned, particles migrate to their new owning processor as part of
 the normal reneighboring procedure.
 
 NOTE: At each rebalance operation, the bisectioning for each cutting
 plane (line in 2d) typcially starts with low and high bounds separated
 by the extent of a processor's sub-domain in one dimension.  The size
 of this bracketing region shrinks based on the local density, as
 described above, which should typically be 1/2 or more every
 iteration.  Thus if {Niter} is specified as 10, the cutting plane will
 typically be positioned to better than 1 part in 1000 accuracy
 (relative to the perfect target position).  For {Niter} = 20, it will
 be accurate to better than 1 part in a million.  Thus there is no need
 to set {Niter} to a large value.  This is especially true if you are
 rebalancing often enough that each time you expect only an incremental
 adjustement in the cutting planes is necessary.  LAMMPS will check if
 the threshold accuracy is reached (in a dimension) is less iterations
 than {Niter} and exit early.
 
 :line
 
 The {rcb} style invokes a "tiled" method for balancing, as described
 above.  It performs a recursive coordinate bisectioning (RCB) of the
 simulation domain. The basic idea is as follows.
 
 The simulation domain is cut into 2 boxes by an axis-aligned cut in
 the longest dimension, leaving one new box on either side of the cut.
 All the processors are also partitioned into 2 groups, half assigned
 to the box on the lower side of the cut, and half to the box on the
 upper side.  (If the processor count is odd, one side gets an extra
 processor.)  The cut is positioned so that the number of atoms in the
 lower box is exactly the number that the processors assigned to that
 box should own for load balance to be perfect.  This also makes load
 balance for the upper box perfect.  The positioning is done
 iteratively, by a bisectioning method.  Note that counting atoms on
 either side of the cut requires communication between all processors
 at each iteration.
 
 That is the procedure for the first cut.  Subsequent cuts are made
 recursively, in exactly the same manner.  The subset of processors
 assigned to each box make a new cut in the longest dimension of that
 box, splitting the box, the subset of processsors, and the atoms in
 the box in two.  The recursion continues until every processor is
 assigned a sub-box of the entire simulation domain, and owns the atoms
 in that sub-box.
 
 :line
 
 The {out} keyword writes a text file to the specified {filename} with
 the results of each rebalancing operation.  The file contains the
 bounds of the sub-domain for each processor after the balancing
 operation completes.  The format of the file is compatible with the
 "Pizza.py"_pizza {mdump} tool which has support for manipulating and
 visualizing mesh files.  An example is shown here for a balancing by 4
 processors for a 2d problem:
 
 ITEM: TIMESTEP
 0
 ITEM: NUMBER OF NODES
 16
 ITEM: BOX BOUNDS
 0 10
 0 10
 0 10
 ITEM: NODES
 1 1 0 0 0
 2 1 5 0 0
 3 1 5 5 0
 4 1 0 5 0
 5 1 5 0 0
 6 1 10 0 0
 7 1 10 5 0
 8 1 5 5 0
 9 1 0 5 0
 10 1 5 5 0
 11 1 5 10 0
 12 1 10 5 0
 13 1 5 5 0
 14 1 10 5 0
 15 1 10 10 0
 16 1 5 10 0
 ITEM: TIMESTEP
 0
 ITEM: NUMBER OF SQUARES
 4
 ITEM: SQUARES
 1 1 1 2 3 4
 2 1 5 6 7 8
 3 1 9 10 11 12
 4 1 13 14 15 16 :pre
 
 The coordinates of all the vertices are listed in the NODES section, 5
 per processor.  Note that the 4 sub-domains share vertices, so there
 will be duplicate nodes in the list.
 
 The "SQUARES" section lists the node IDs of the 4 vertices in a
 rectangle for each processor (1 to 4).
 
 For a 3d problem, the syntax is similar with 8 vertices listed for
 each processor, instead of 4, and "SQUARES" replaced by "CUBES".
 
 :line
 
 [Restart, fix_modify, output, run start/stop, minimize info:]
 
 No information about this fix is written to "binary restart
 files"_restart.html.  None of the "fix_modify"_fix_modify.html options
 are relevant to this fix.
 
 This fix computes a global scalar which is the imbalance factor
 after the most recent rebalance and a global vector of length 3 with
 additional information about the most recent rebalancing.  The 3
 values in the vector are as follows:
 
 1 = max # of particles per processor
 2 = total # iterations performed in last rebalance
 3 = imbalance factor right before the last rebalance was performed :ul
 
 As explained above, the imbalance factor is the ratio of the maximum
 number of particles (or total weight) on any processor to the average
 number of particles (or total weight) per processor.
 
 These quantities can be accessed by various "output
 commands"_Section_howto.html#howto_15.  The scalar and vector values
 calculated by this fix are "intensive".
 
 No parameter of this fix can be used with the {start/stop} keywords of
 the "run"_run.html command.  This fix is not invoked during "energy
 minimization"_minimize.html.
 
 :line
 
 [Restrictions:]
 
 For 2d simulations, the {z} style cannot be used.  Nor can a "z"
 appear in {dimstr} for the {shift} style.
 
 [Related commands:]
 
 "group"_group.html, "processors"_processors.html, "balance"_balance.html
 
 [Default:] none