5 files changed, 86 insertions, 90 deletions
diff --git a/Docs/source/index.rst b/Docs/source/index.rst
index d34bd6788..c1755f44f 100644
--- a/Docs/source/index.rst
+++ b/Docs/source/index.rst
@@ -75,6 +75,7 @@ Usage
    :hidden:
 
    usage/how_to_run
+   usage/domain_decomposition
    usage/parameters
    usage/python
    usage/examples
diff --git a/Docs/source/usage/domain_decomposition.rst b/Docs/source/usage/domain_decomposition.rst
new file mode 100644
index 000000000..d3df5b78f
--- /dev/null
+++ b/Docs/source/usage/domain_decomposition.rst
@@ -0,0 +1,84 @@
+.. _usage_domain_decomposition:
+
+Domain Decomposition
+====================
+
+WarpX relies on a spatial domain decomposition for MPI parallelization. It provides two different ways for users to specify this decomposition, a `simple` way recommended for most users, and a `flexible` way recommended if more control is desired. The `flexible` method is required for dynamic load balancing to be useful.
+
+1. Simple Method
+----------------
+
+The first and simplest method is to provide the ``warpx.numprocs = nx ny nz`` parameter, either at the command line or somewhere in your inputs deck. In this case, WarpX will split up the overall problem domain into exactly the specified number of subdomains, or `Boxes <https://amrex-codes.github.io/amrex/docs_html/Basics.html#box-intvect-and-indextype>`__ in the AMReX terminology, with the data defined on each `Box` having its own guard cells. The product of ``nx, ny, and nz`` must be exactly the desired number of MPI ranks. Note that, because there is exactly one `Box` per MPI rank when run this way, dynamic load balancing will not be possible, as there is no way of shifting `Boxes` around to achieve a more even load. This is the approach recommended for new users as it is the easiest to use.
+
+.. note::
+
+   If ``warpx.numprocs`` is *not* specified, WarpX will fall back on using the ``amr.max_grid_size`` and ``amr.blocking_factor`` parameters, described below.
+
+2. More General Method
+----------------------
+
+The second way of specifying the domain decomposition provides greater flexibility and enables dynamic load balancing, but is not as easy to use. In this method, the user specifies inputs parameters ``amr.max_grid_size`` and ``amr.blocking_factor``, which can be thought of as the maximum and minimum allowed `Box` sizes. Now, the overall problem domain (specified by the ``amr.ncell`` input parameter) will be broken up into some number of `Boxes` with the specified characteristics. By default, WarpX will make the `Boxes` as big as possible given the constraints.
+
+For example, if ``amr.ncell = 768 768 768``, ``amr.max_grid_size =  128``, and ``amr.blocking_factor = 32``, then AMReX will make 6 `Boxes` in each direction, for a total of 216 (the ``amr.blocking_factor`` does not factor in yet; however, see the section on mesh refinement below). If this problem is then run on 54 MPI ranks, there will be 4 boxes per rank initially. This problem could be run on as many as 216 ranks without performing any splitting.
+
+.. note::
+
+   Both ``amr.ncell`` and ``amr.max_grid_size`` must be divisible by ``amr.blocking_factor``, in each direction.
+
+When WarpX is run using this approach to domain decomposition, the number of MPI ranks does not need to be exactly equal to the number of ``Boxes``. Note also that if you run WarpX with more MPI ranks than there are boxes on the base level, WarpX will attempt to split the available ``Boxes`` until there is at least one for each rank to work on; this may cause it violate the constraints of ``amr.max_grid_size`` and ``amr.blocking_factor``.
+
+.. note::
+
+   The AMReX documentation on `Grid Creation <https://amrex-codes.github.io/amrex/docs_html/GridCreation.html#sec-grid-creation>`__ may also be helpful.
+
+You can also specify a separate `max_grid_size` and `blocking_factor` for each direction, using the parameters ``amr.max_grid_size_x``, ``amr.max_grid_size_y``, etc... . This allows you to request, for example, a "pencil" type domain decomposition that is long in one direction. Note that, in RZ geometry, the parameters corresponding to the longitudinal direction are ``amr.max_grid_size_y`` and ``amr.blocking_factor_y``.
+
+3. Performance Considerations
+-----------------------------
+
+In terms of performance, in general there is a trade off. Having many small boxes provides flexibility in terms of load balancing; however, the cost is increased time spent in communication due to surface-to-volume effects and increased kernel launch overhead when running on the GPUs. The ideal number of boxes per rank depends on how important dynamic load balancing is on your problem. If your problem is intrinsically well-balanced, like in a uniform plasma, then having a few, large boxes is best. But, if the problem is non-uniform and achieving a good load balance is critical for performance, having more, smaller `Boxes` can be worth it. In general, we find that running with something in the range of 4-8 `Boxes` per process is a good compromise for most problems.
+
+.. note::
+
+   For specific information on the dynamic load balancer used in WarpX, visit the
+   `Load Balancing <https://amrex-codes.github.io/amrex/docs_html/LoadBalancing.html>`__
+   page on the AMReX documentation.
+
+The best values for these parameters can also depend strongly on a number of
+numerical parameters:
+
+* Algorithms used (Maxwell/spectral field solver, filters, order of the
+  particle shape factor)
+
+* Number of guard cells (that depends on the particle shape factor and
+  the type and order of the Maxwell solver, the filters used, `etc.`)
+
+* Number of particles per cell, and the number of species
+
+and the details of the on-node parallelization and computer architecture used for the run:
+
+* GPU or CPU
+
+* Number of OpenMP threads
+
+* Amount of high-bandwidth memory.
+
+Because these parameters put additional constraints on the domain size for a
+simulation, it can be cumbersome to calculate the number of cells and the
+physical size of the computational domain for a given resolution. This
+:download:`Python script <../../../Tools/DevUtils/compute_domain.py>` does it
+automatically.
+
+When using the RZ spectral solver, the values of ``amr.max_grid_size`` and ``amr.blocking_factor`` are constrained since the solver
+requires that the full radial extent be within a each block.
+For the radial values, any input is ignored and the max grid size and blocking factor are both set equal to the number of radial cells.
+For the longitudinal values, the blocking factor has a minimum size of 8, allowing the computational domain of each block to be large enough relative to the guard cells for reasonable performance, but the max grid size and blocking factor must also be small enough so that there will be at least one block per processor.
+If max grid size and/or blocking factor are too large, they will be silently reduced as needed.
+If there are too many processors so that there is not enough blocks for the number processors, WarpX will abort.
+
+4. Mesh Refinement
+------------------
+
+With mesh refinement, the above picture is more complicated, as in general the number of boxes can not be predicted at the start of the simulation. The decomposition of the base level will proceed as outlined above. The refined region, however, will be covered by some number of Boxes whose sizes are consistent with ``amr.max_grid_size`` and ``amr.blocking_factor``. With mesh refinement, the blocking factor is important, as WarpX may decide to use `Boxes` smaller than ``amr.max_grid_size`` so as not to over-refine outside of the requested area. Note that you can specify a vector of values to make these parameters vary by level. For example, ``amr.max_grid_size = 128 64`` will make the max grid size be 128 on level 0 and 64 on level 1.
+
+In general, the above performance considerations apply - varying these values such that there are 4-8 Boxes per rank on each level is a good guideline.
diff --git a/Docs/source/usage/parameters.rst b/Docs/source/usage/parameters.rst
index bb82c28b5..36855a52d 100644
--- a/Docs/source/usage/parameters.rst
+++ b/Docs/source/usage/parameters.rst
@@ -2067,7 +2067,7 @@ In-situ capabilities can be used by turning on Sensei or Ascent (provided they a
     Reduce size of the field output by this ratio in each dimension.
     (This is done by averaging the field over 1 or 2 points along each direction, depending on the staggering).
     If ``blocking_factor`` and ``max_grid_size`` are used for the domain decomposition, as detailed in
-    the :ref:`parallelization <parallelization_warpx>` section, ``coarsening_ratio`` should be an integer
+    the :ref:`domain decomposition <usage_domain_decomposition>` section, ``coarsening_ratio`` should be an integer
     divisor of ``blocking_factor``. If ``warpx.numprocs`` is used instead, the total number of cells in a given
     dimension must be a multiple of the ``coarsening_ratio`` multiplied by ``numprocs`` in that dimension.
 
diff --git a/Docs/source/usage/workflows.rst b/Docs/source/usage/workflows.rst
index 74999c9df..6fefcd9a2 100644
--- a/Docs/source/usage/workflows.rst
+++ b/Docs/source/usage/workflows.rst
@@ -8,7 +8,6 @@ This section collects typical user workflows and best practices for WarpX.
 .. toctree::
    :maxdepth: 2
 
-   workflows/parallelization
    workflows/debugging
    workflows/libensemble
    workflows/plot_timestep_duration
diff --git a/Docs/source/usage/workflows/parallelization.rst b/Docs/source/usage/workflows/parallelization.rst
deleted file mode 100644
index baef93858..000000000
--- a/Docs/source/usage/workflows/parallelization.rst
+++ /dev/null
@@ -1,88 +0,0 @@
-.. _parallelization_warpx:
-
-Parallelization in WarpX
-========================
-
-When running a simulation, the domain is split into independent
-rectangular sub-domains (called **grids**). This is the way AMReX, a core
-component of WarpX, handles parallelization and/or mesh refinement. Furthermore,
-this decomposition makes load balancing possible: each MPI rank typically computes
-a few grids, and a rank with a lot of work can transfer one or several **grids**
-to their neighbors.
-
-A user
-does not specify this decomposition explicitly. Instead, the user gives hints to
-the code, and the actual decomposition is determined at runtime, depending on
-the parallelization. The main user-defined parameters are
-``amr.max_grid_size`` and ``amr.blocking_factor``.
-
-AMReX ``max_grid_size`` and ``blocking_factor``
------------------------------------------------
-
-* ``amr.max_grid_size`` is the maximum number of cells per **grid** along each
-  direction (default ``amr.max_grid_size=32`` in 3D).
-
-* ``amr.blocking_factor``: is the minimum number of cells per **grid** along each
-  direction (default ``amr.blocking_factor=8``).
-  Note that both the domain (at each level) and ``max_grid_size`` must be divisible by ``blocking_factor``.
-
-  .. note::
-
-     You can use the parameters above if you want the same number of cells in all directions.
-     Or you can set ``amr.max_grid_size_x``, ``amr.max_grid_size_y`` and ``amr.max_grid_size_z``;
-     ``amr.blocking_factor_x``, ``amr.blocking_factor_y`` and ``amr.blocking_factor_z`` to different numbers of cells.
-     Note that, in RZ geometry, the parameters corresponding to the longitudinal direction are ``amr.max_grid_size_y`` and ``amr.blocking_factor_y``.
-
-The total number of **grids** is determined using those two restrictions and the number of
-ranks used to run the simulation. You can visit `AMReX <https://amrex-codes.github.io/amrex/docs_html/GridCreation.html?highlight=blocking_factor>`_
-documentation for more information on the two parameters.
-
-These parameters can have a dramatic impact on the code performance. Each
-**grid** in the decomposition is surrounded by guard cells, thus increasing the
-amount of data, computation and communication. Hence having a too small
-``max_grid_size``, may ruin the code performance.
-
-On the other hand, a too-large ``max_grid_size`` is likely to result in a single
-grid per MPI rank, thus preventing load balancing. By setting these two
-parameters, the user wants to give some flexibility to the code while avoiding
-pathological behaviors.
-
-For more information on this decomposition, see the
-`Gridding and Load Balancing <https://amrex-codes.github.io/amrex/docs_html/ManagingGridHierarchy_Chapter.html>`__
-page on AMReX documentation.
-
-For specific information on the dynamic load balancer used in WarpX, visit the
-`Load Balancing <https://amrex-codes.github.io/amrex/docs_html/LoadBalancing.html>`__
-page on the AMReX documentation.
-
-The best values for these parameters strongly depends on a number of parameters,
-among which numerical parameters:
-
-* Algorithms used (Maxwell/spectral field solver, filters, order of the
-  particle shape factor)
-
-* Number of guard cells (that depends on the particle shape factor and
-  the type and order of the Maxwell solver, the filters used, `etc.`)
-
-* Number of particles per cell, and the number of species
-
-and MPI decomposition and computer architecture used for the run:
-
-* GPU or CPU
-
-* Number of OpenMP threads
-
-* Amount of high-bandwidth memory.
-
-Because these parameters put additional constraints on the domain size for a
-simulation, it can be cumbersome to calculate the number of cells and the
-physical size of the computational domain for a given resolution. This
-:download:`Python script <../../../../Tools/DevUtils/compute_domain.py>` does it
-automatically.
-
-When using the RZ spectral solver, the values of ``amr.max_grid_size`` and ``amr.blocking_factor`` are constrained since the solver
-requires that the full radial extent be within a each block.
-For the radial values, any input is ignored and the max grid size and blocking factor are both set equal to the number of radial cells.
-For the longitudinal values, the blocking factor has a minimum size of 8, allowing the computational domain of each block to be large enough relative to the guard cells for reasonable performance, but the max grid size and blocking factor must also be small enough so that there will be at least one block per processor.
-If max grid size and/or blocking factor are too large, they will be silently reduced as needed.
-If there are too many processors so that there is not enough blocks for the number processors, WarpX will abort.