2 files changed, 58 insertions, 0 deletions
diff --git a/Docs/source/running_cpp/platforms.rst b/Docs/source/running_cpp/platforms.rst
index 1d7093325..2a3ead381 100644
--- a/Docs/source/running_cpp/platforms.rst
+++ b/Docs/source/running_cpp/platforms.rst
@@ -37,6 +37,28 @@ regime), the following set of parameters provided good performance:
 
 * **2 grids per MPI**, *i.e.*, 16 grids per KNL node.
 
+Running on Cori Haswell at NERSC
+--------------------------------
+
+The batch script below can be used to run a WarpX simulation on 1 `Haswell node <https://docs.nersc.gov/systems/cori/>`_ on the supercomputer Cori at NERSC.
+
+.. literalinclude:: ../../../Tools/BatchScripts/batch_cori_haswell.sh
+   :language: bash
+
+To run a simulation, copy the lines above to a file ``batch_cori_haswell.sh`` and
+run
+::
+
+  sbatch batch_cori_haswell.sh
+
+to submit the job.
+
+For a 3D simulation with a few (1-4) particles per cell using FDTD Maxwell
+solver on Cori Haswell for a well load-balanced problem (in our case laser
+wakefield acceleration simulation in a boosted frame in the quasi-linear
+regime), the following set of parameters provided good performance:
+
+* **4 MPI ranks per Haswell node** (2 MPI ranks per `Intel Xeon E5-2698 v3 <https://ark.intel.com/content/www/us/en/ark/products/81060/intel-xeon-processor-e5-2698-v3-40m-cache-2-30-ghz.html>`_), with ``OMP_NUM_THREADS=16`` (which uses `2x hyperthreading <https://docs.nersc.gov/jobs/affinity/>`_)
 
 .. _running-cpp-summit:
 
diff --git a/Tools/BatchScripts/batch_cori_haswell.sh b/Tools/BatchScripts/batch_cori_haswell.sh
new file mode 100644
index 000000000..a1a21defc
--- /dev/null
+++ b/Tools/BatchScripts/batch_cori_haswell.sh
@@ -0,0 +1,36 @@
+#!/bin/bash -l
+
+# Just increase this number of you need more nodes.
+#SBATCH -N 1
+#SBATCH -t 03:00:00
+#SBATCH -q regular
+#SBATCH -C haswell
+#SBATCH -J <job name>
+#SBATCH -A <allocation ID>
+#SBATCH -e error.txt
+#SBATCH -o output.txt
+# one MPI rank per half-socket (see below)
+#SBATCH --tasks-per-node=4
+# request all logical (virtual) cores per half-socket
+#SBATCH --cpus-per-task=16
+
+
+# each Cori Haswell node has 2 sockets of Intel Xeon E5-2698 v3
+# each Xeon CPU is divided into 2 bus rings that each have direct L3 access
+export WARPX_NMPI_PER_NODE=4
+
+# each MPI rank per half-socket has 8 physical cores
+#   or 16 logical (virtual) cores
+# over-subscribing each physical core with 2x
+#   hyperthreading leads to a slight (3.5%) speedup
+# the settings below make sure threads are close to the
+#   controlling MPI rank (process) per half socket and
+#   distribute equally over close-by physical cores and,
+#   for N>8, also equally over close-by logical cores
+export OMP_PROC_BIND=spread
+export OMP_PLACES=threads
+export OMP_NUM_THREADS=16
+
+EXE="<path/to/executable>"
+
+srun --cpu_bind=cores -n $(( ${SLURM_JOB_NUM_NODES} * ${WARPX_NMPI_PER_NODE} )) ${EXE} <input file>