diff --git a/doc/sphinx/source/index.rst b/doc/sphinx/source/index.rst
index 3d83d1f5..530a1e31 100644
--- a/doc/sphinx/source/index.rst
+++ b/doc/sphinx/source/index.rst
@@ -1,51 +1,52 @@
 .. Tamaas documentation master file, created by sphinx-quickstart on Mon Apr 29
    18:08:33 2019. You can adapt this file completely to your liking, but it
    should at least contain the root `toctree` directive.
 
 Tamaas --- A high-performance library for periodic rough surface contact
 ========================================================================
 
 .. image:: https://img.shields.io/badge/code-gitlab.com%2Ftamaas-9cf.svg
    :target: https://gitlab.com/tamaas/tamaas
 
 .. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.3479236.svg
    :target: https://doi.org/10.5281/zenodo.3479236
 
 .. image:: https://joss.theoj.org/papers/86903c51f3c66964eef7776d8aeaf17d/status.svg
    :target: https://joss.theoj.org/papers/86903c51f3c66964eef7776d8aeaf17d
 
 .. image:: https://mybinder.org/badge_logo.svg
  :target: https://mybinder.org/v2/gl/tamaas%2Ftutorials/HEAD
 
 Tamaas (from تماس meaning "contact" in Arabic and Farsi) is a high-performance
 rough-surface periodic contact code based on boundary and volume integral
 equations. The clever mathematical formulation of the underlying numerical
 methods (see e.g. :doi:`10.1007/s00466-017-1392-5` and :arxiv:`1811.11558`)
 allows the use of the fast-Fourier Transform, a great help in achieving peak
 performance: Tamaas is consistently :doc:`two orders of magnitude faster
 <performance>` (and lighter) than traditional FEM! Tamaas is aimed at
 researchers and practitioners wishing to compute realistic contact solutions for
 the study of interface phenomena.
 
 You can see Tamaas in action with our interactive :ref:`tutorials <sect-tutorials>`.
 
 .. toctree::
    :maxdepth: 2
    :caption: Table of Contents:
 
    ./overview
    ./quickstart
    ./rough_surfaces
    ./model
    ./contact
+   ./mpi
    ./examples
    ./performance
    ./api_reference
 
 
 Indices and tables
 ==================
 
 * :ref:`genindex`
 * :ref:`modindex`
 * :ref:`search`
diff --git a/doc/sphinx/source/mpi.rst b/doc/sphinx/source/mpi.rst
new file mode 100644
index 00000000..0363464f
--- /dev/null
+++ b/doc/sphinx/source/mpi.rst
@@ -0,0 +1,160 @@
+Working with MPI
+================
+
+Distributed memory parallelism in Tamaas is implemented with `MPI
+<https://www.mpi-forum.org/>`_. Due to the bottleneck role of the fast-Fourier
+transform in Tamaas' core routines, the data layout of Tamaas is that of `FFTW
+<http://fftw.org/fftw3_doc/MPI-Data-Distribution.html#MPI-Data-Distribution>`_.
+Tamaas is somewhat affected by limitations of FFTW, and MPI only works on
+systems with a 2D boundary, i.e. ``basic_2d``, ``surface_2d`` and ``volume_2d``
+model types (which are the most important anyways, since rough contact mechanics
+can yield different scaling laws in 1D).
+
+MPI support in Tamaas is still experimental, but the following parts are tested:
+
+- Rough surface generation
+
+- Surface statistics computation
+
+- Elastic normal contact
+
+- Elastic-plastic contact (with :py:class:`DFSANECXXSolver
+  <tamaas.nonlinear_solvers.DFSANECXXSolver>`)
+
+- Dumping models with :py:class:`H5Dumper <tamaas.dumpers.H5Dumper>`.
+
+.. tip:: One can look at ``examples/plasticity.py`` for a full example of an
+         elastic-plastic contact simulation that can run in MPI.
+
+
+Transparent MPI context
+-----------------------
+
+Some parts of Tamaas work transparently with MPI and no additional work or logic
+is needed.
+
+Creating a model
+~~~~~~~~~~~~~~~~~~~~
+
+The following snippet creates a model whose global shape is ``[16, 2048, 2048]``::
+
+  import tamaas as tm
+
+  model = tm.ModelFactory.createModel(tm.model_type.volume_2d,
+                                      [0.1, 1, 1], [16, 2048, 2048])
+  print(model.shape, model.global_shape)
+
+Running this code with ``mpirun -np 3`` will print the following (not necessarily
+in this order)::
+
+  [16, 683, 2048] [16, 2048, 2048]
+  [16, 682, 2048] [16, 2048, 2048]
+  [16, 683, 2048] [16, 2048, 2048]
+
+Note that the partitioning occurs on the `x` dimension of the model (see below
+for more information on the data layout imposed by FFTW).
+
+Creating a rough surface
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Similarly, rough surface generators expect a global shape and return the
+partionned data::
+
+  iso = tm.Isopowerlaw2D()
+  iso.q0, iso.q1, iso.q2, iso.hurst = 4, 4, 32, .5
+  gen = tm.SurfaceGeneratorRandomPhase2D([2048, 2048])
+  gen.spectrum = iso
+
+  surface = gen.buildSurface()
+  print(surface.shape, tm.mpi.global_shape(surface.shape))
+
+With ``mpirun -np 3`` this should print::
+
+  (682, 2048) [2048, 2048]
+  (683, 2048) [2048, 2048]
+  (683, 2048) [2048, 2048]
+
+Computing statistics
+~~~~~~~~~~~~~~~~~~~~
+
+With a model's data distributed among independent process, computing global
+properties, like the true contact area, must be done in a collective fashion.
+This is transparently handled by the :py:class:`Statistics
+<tamaas._tamaas.Statistics2D>` class, e.g. with::
+
+  contact = tm.Statistics2D.contact(model.traction)
+
+This gives the correct contact fraction, whereas something like
+``np.mean(model.traction > 0)`` will give a different result on each processor.
+
+Nonlinear solvers
+~~~~~~~~~~~~~~~~~
+
+The only nonlinear solver (for plastic contact) that works with MPI is
+:py:class:`DFSANECXXSolver <tamaas.nonlinear_solvers.DFSANECXXSolver>`, which is
+a C++ implementation of :py:class:`DFSANESolver
+<tamaas.nonlinear_solvers.DFSANESolver>` that works in an MPI context.
+
+.. note:: Scipy and Numpy use optimized BLAS routines for array operations,
+          while Tamaas does not, which results in `serial` performance of the C++
+          implementation of the DF-SANE algorithm being lower than the Scipy version.
+
+Dumping models
+~~~~~~~~~~~~~~
+
+The only dumper that properly works in MPI is the :py:class:`H5Dumper
+<tamaas.dumpers.H5Dumper>`. Output is then as simple as::
+
+  from tamaas.dumpers import H5Dumper
+
+  H5Dumper('output', all_fields=True) << model
+
+This is useful for doing post-processing separately from the main simulation:
+the post-processing can then be done in serial.
+
+
+MPI convenience methods
+-----------------------
+
+Not every use case can be handled transparently, but although adapting existing
+scripts to work in an MPI context can require some work, especially if said
+scripts rely on numpy and scipy for pre- and post-processing (e.g. constructing
+a parabolic surface for hertzian contact, computing the total contact area), the
+module :py:mod:`mpi <tamaas._tamaas.mpi>` provides some convenience functions to
+make that task easier. The functions :py:func:`mpi.scatter
+<tamaas._tamaas.mpi.scatter>` and :py:func:`mpi.gather
+<tamaas._tamaas.mpi.gather>` can be used to scatter/gather 2D data using the
+partitioning scheme expected from FFTW (see figure below). The functions
+:py:func:`mpi.rank <tamaas._tamaas.mpi.rank>` and :py:func:`mpi.size
+<tamaas._tamaas.mpi.size>` are used to determine the local process rank and the
+total number of processes respectively.
+
+If finer control is needed, the function :py:func:`mpi.local_shape
+<tamaas._tamaas.mpi.local_shape>` gives the 2D shape of the local data if given
+the global 2D shape (its counterpart :py:func:`mpi.global_shape
+<tamaas._tamaas.mpi.global_shape>` does the exact opposite), while
+:py:func:`mpi.local_offset <tamaas._tamaas.mpi.local_offset>` gives the offset
+of the local data in the global :math:`x` dimension. These two functions mirror
+FFTW's own data distribution `functions
+<http://fftw.org/fftw3_doc/Basic-and-advanced-distribution-interfaces.html#Basic-and-advanced-distribution-interfaces>`_.
+
+.. figure:: figures/mpi_data_distribution.svg
+            :align: center
+            :width: 75%
+
+            2D Data distribution scheme from FFTW. ``N0`` and ``N1`` are the
+            number of points in the :math:`x` and :math:`y` directions
+            respectively. The array ``local_N0``, indexed by the process rank,
+            give the local size of the :math:`x` dimension. The
+            :py:func:`local_offset <tamaas._tamaas.mpi.local_offset>` function
+            gives the offset in :math:`x` for each process rank.
+
+The :py:mod:`mpi <tamaas._tamaas.mpi>` module also contains a function
+:py:func:`sequential <tamaas._tamaas.mpi.sequential>` whose return value is
+meant to be used as a context manager. Within the sequential context the default
+communicator is ``MPI_COMM_SELF`` instead of ``MPI_COMM_WORLD``.
+
+For other MPI functionality not covered by Tamaas that may be required, one can
+use `mpi4py <https://mpi4py.readthedocs.io>`_, which in conjunction with the
+methods in :py:mod:`mpi <tamaas._tamaas.mpi>` should handle just about any use
+case.
diff --git a/doc/sphinx/source/performance.rst b/doc/sphinx/source/performance.rst
index adb870fd..cd2ab6df 100644
--- a/doc/sphinx/source/performance.rst
+++ b/doc/sphinx/source/performance.rst
@@ -1,212 +1,158 @@
 Performance
 ===========
 
 Parallelism
 -----------
 
 Tamaas implements shared-memory parallelism using `thrust
 <https://github.com/thrust/thrust>`_. The Thrust backend can be controlled with
 the following values of the ``backend`` build option:
 
 ``omp``
     Thrust uses its OpenMP backend (the default). The number of threads is
     controlled by OpenMP.
 
 ``cpp``
     Thurst does not run in threads (i.e. sequential). This is the recommanded
     option if running multiple MPI tasks.
 
 ``tbb``
     Thrust uses its `TBB
     <https://en.wikipedia.org/wiki/Threading_Building_Blocks>`_ backend. Note
     that this option is not fully supported by Tamaas.
 
 .. tip:: When using the OpenMP or TBB backend, the number of threads can be
    manually controlled by the :py:func:`initialize <tamaas._tamaas.initialize>`
    function. When OpenMP is selected for the backend, the environment variable
    ``OMP_NUM_THREADS`` can also be used to set the number of threads.
 
 FFTW has its own system for thread-level parallelism, which can be controlled
 via the ``fftw_threads`` option:
 
 ``none``
     FFTW does not use threads.
 
 ``threads``
     FFTW uses POSIX/Win32 threads for parallelism.
 
 ``omp``
     FFTW uses OpenMP.
 
 .. note:: As with the Thrust backend, the number of threads for FFTW can be
    controlled with :py:func:`initialize <tamaas._tamaas.initialize>`.
 
 Finally, the boolean variable ``use_mpi`` controls wheter Tamaas is compiled
 with MPI-parallelism. If yes, Tamaas will be linked against ``libfftw3_mpi``
 regardless of the thread model.
 
-Multi-process parallelism
-^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Distributed memory parallelism in Tamaas is implemented with `MPI
-<https://www.mpi-forum.org/>`_. Due to the bottleneck role of the fast-Fourier
-transform in Tamaas' core routines, the data layout of Tamaas is that of `FFTW
-<http://fftw.org/fftw3_doc/MPI-Data-Distribution.html#MPI-Data-Distribution>`_.
-Tamaas is somewhat affected by limitations of FFTW, and MPI only works on
-systems with a 2D boundary, i.e. ``basic_2d``, ``surface_2d`` and ``volume_2d``
-model types (which are the most important anyways, since rough contact mechanics
-can yield different scaling laws in 1D).
-
-MPI support in Tamaas is still experimental, but the following parts are tested:
-
-- Rough surface generation
-
-- Surface statistics computation
-
-- Elastic normal contact
-
-- Elastic-plastic contact (with :py:class:`DFSANECXXSolver
-  <tamaas.nonlinear_solvers.DFSANECXXSolver>`).
-
-Adapting existing scripts to work in an MPI context can require some work,
-especially if said scripts rely on numpy and scipy for pre- and post-processing
-(e.g. constructing a parabolic surface for hertzian contact, computing the total
-contact area). It is the user's responsibility to make the changes necessary,
-but Tamaas provides some convenience functions to make this task easier. In the
-module :py:mod:`mpi <tamaas._tamaas.mpi>`, the functions :py:mod:`mpi.scatter
-<tamaas._tamaas.mpi.scatter>` and :py:mod:`mpi.gather
-<tamaas._tamaas.mpi.gather>` can be used to scatter/gather 2D data using the
-partitioning scheme expected from FFTW. If finer control is needed, the function
-:py:func:`mpi.local_shape <tamaas._tamaas.mpi.local_shape>` gives the 2D shape
-of the local data if given the global 2D shape (its counterpart
-:py:func:`mpi.global_shape <tamaas._tamaas.mpi.global_shape>` does the exact
-opposite), while :py:func:`mpi.local_offset <tamaas._tamaas.mpi.local_offset>`
-gives the offset of the local data in the global :math:`x` dimension. These two
-functions mirror FFTW's own data distribution `functions
-<http://fftw.org/fftw3_doc/Basic-and-advanced-distribution-interfaces.html#Basic-and-advanced-distribution-interfaces>`_.
-
-.. figure:: figures/mpi_data_distribution.svg
-            :align: center
-            :width: 75%
-
-            2D Data distribution scheme from FFTW. ``N0`` and ``N1`` are the
-            number of points in the :math:`x` and :math:`y` directions
-            respectively. The array ``local_N0``, indexed by the process rank,
-            give the local size of the :math:`x` dimension. The
-            :py:func:`local_offset <tamaas._tamaas.mpi.local_offset>` function
-            gives the offset in :math:`x` for each process rank.
-
-The :py:mod:`mpi <tamaas._tamaas.mpi>` module also contains a function
-:py:func:`sequential <tamaas._tamaas.mpi.sequential>` whose return value is
-meant to be used as a context manager. Within the sequential context the default
-communicator is ``MPI_COMM_SELF`` instead of ``MPI_COMM_WORLD``.
-
-In addition to the low-level functions in :py:mod:`mpi <tamaas._tamaas.mpi>`,
-many functions in Tamaas do not require special treatment, e.g. the statistics
-functions.
+.. important:: Users wary of performance should use MPI, as it yields remarkably
+               better scaling properties than the shared memory parallelism
+               models. Care should also be taken when compiling with both OpenMP
+               and MPI support: setting the number of threads to more than one
+               in an MPI context can decrease performance. 
 
 
 Integration algorithm
 ---------------------
 
 In its implementation of the volume integral operators necessary for
 elastic-plastic solutions, Tamaas differenciates two way of computing the
 intermediate integral along :math:`z` in the partial Fourier domain:
 
 - Cutoff integration: because the intermediate integral involves kernels of the
   form :math:`\exp(q(x-y))`, it is easy to truncate the integral when :math:`x`
   and :math:`y` are far apart, especially for large values of :math:`q`. This
   changes the complexity of the intermediate integral from
   :math:`O(N_1N_2N_3^2)` (the naive implementation) to :math:`O(\sqrt{N_1^2 +
   N_2^2}N_3^2)`.
 - Linear integration: this method relies on a separation of variables
   :math:`\exp(q(x-y)) = \exp(qx)\cdot\exp(-qy)`. This allows to break the
   dependency in :math:`N_3^2` of the number of operations, so that the overall
   complexity of the intermediate integral is :math:`O(N_1N_2N_3)`.
 
 Details on both algorithms can be found in [1]_. Tamaas uses linear integration
 by default because it is faster in many cases without introducing a truncation
 error. Unfortunatly, it has a severe drawback when considering systems with a
 fine surface discretization: due to :math:`q` increasing with the number of
 points on the surface, the separated terms :math:`\exp(qx)` and
 :math:`\exp(-qy)` may overflow and underflow respectively. Tamaas will warn
 if that is the case, and users have two options to remedy the situation:
 
 - Change the integration method by calling :func:`setIntegrationMethod
   <tamaas._tamaas.Residual.setIntegrationMethod>` with the desired
   :class:`integration_method <tamaas._tamaas.integration_method>` on the
   :class:`Residual <tamaas._tamaas.Residual>` object you use in the computation.
 
 - Compile Tamaas with the option ``real_type='long double'``. To make
   manipulation of numpy arrays easier, a :class:`dtype` is provided in the
   :py:mod:`tamaas` module which can be used to create numpy arrays compatible
   with Tamaas' floating point type (e.g. ``x = np.linspace(0, 1,
   dtype=tamaas.dtype)``)
 
 Both these options negatively affect the performance, and it is up to the user
 to select the optimal solution for their particular use case.
 
 
 Computational methods & Citations
 ---------------------------------
 
 Tamaas uses specialized numerical methods to efficiently solve elastic and
 elastoplastic periodic contact problems. Using a boundary integral formulation
 and a half-space geometry for the former allow (a) the focus of computational
 power to the contact interface since the bulk response can be represented
 exactly, (b) the use of the fast-Fourier transform for the computation of
 convolution integrals. In conjunction with a boundary integral formulation of
 the bulk state equations, a conjugate gradient approach is used to solve the
 contact problem.
 
 .. note:: The above methods are state-of-the-art in the domain of rough surface
    contact. Below are selected publications detailing the methods used in
    elastic contact with and without adhesion:
 
    - Boundary integral formulation:
 
      - Stanley and Kato (:doi:`J. of Tribology, 1997 <10.1115/1.2833523>`)
 
    - Conjugate Gradient:
 
      - Polonsky and Keer (:doi:`Wear, 1999 <10.1016/S0043-1648(99)00113-1>`)
 
      - Rey, Anciaux and Molinari (:doi:`Computational Mechanics, 2017
        <10.1007/s00466-017-1392-5>`)
 
    - Frictional contact:
 
      - Condat (:doi:`J. of Optimization Theory and Applications, 2012
        <10.1007/s10957-012-0245-9>`)
 
 
 For elastic-plastic contact, Tamaas uses a similar approach by implementing a
 *volume* integral formulation of the bulk equilibrium equations. Thanks to
 kernel expressions that are directly formulated in the Fourier domain, the
 method reduces the algorithmic complexity, memory requirements and sampling
 errors compared to traditional volume integral methods (Frérot, Bonnet, Anciaux
 and Molinari, :doi:`Computer Methods in Applied Mechanics and Engineering, 2019
 <10.1016/j.cma.2019.04.006>`, :arxiv:`1811.11558`). The figure below shows a
 comparison of run times for an elasticity problem (only a single solve step)
 between Tamaas and `Akantu <https://gitlab.com/akantu/akantu>`_, a
 high-performance FEM code using the direct solver `MUMPS
 <http://mumps.enseeiht.fr/>`_.
 
 .. figure:: figures/complexity.svg
             :align: center
             :width: 75%
 
             Comparison of run times between the volume integral implementation
             (with cutoff integration) of Tamaas and an FEM solve step with a
             Cholesky factorization performed by Akantu+MUMPS. :math:`N` is the
             total number of points.
 
 Further discussion about the elastic-plastic solver implemented in Tamaas can be
 found in Frérot, Bonnet, Anciaux and Molinari, (:doi:`Computer Methods in
 Applied Mechanics and Engineering, 2019 <10.1016/j.cma.2019.04.006>`,
 :arxiv:`1811.11558`).
 
 
 .. [1] L. Frérot, “Bridging scales in wear modeling with volume integral
        methods for elastic-plastic contact,” École Polytechnique Fédérale de
        Lausanne, 2020 (Section 2.3.2). :doi:`10.5075/epfl-thesis-7640`.