Shared memory charge and current deposition (#3368)

* Use GPU shared memory to accelerate charge deposition (#66) * WIP Apply charge deposition unconditionally in scratch memory * Ensure enough threads to touch every value in the array, even if there are no particles * Zero out the shared memory before accumulating into it * Replace box-aware accumulation of final results with simple pointers * Remove unused code * WIP * Account for shared memory being allocated per-block, not per grid/kernel * Wording * Fall back to non-shared memory for cases where the grid size is too big to fit, for now * Filter out additions of 0.0 from atomic accumulation * Restore non-GPU code path * Pick apart #if stuff to allow better formatting and comprehension * Fix egregious whitespace failure * Abort on insufficient shared memory, rather than falling back to global memory * Fix silly whitespace * Fix stray tab character * Sort and bin particles, pass bins to charge deposition * Contribute on a binned tile basis; memory errors now * Initialize array to the extent we actually allocated * Make sure we initialize the vector the tboxes with invalid Box objects * in 2D, we make sure we use the same particle position to tile box mapping as amrex * go ahead and skip empty bins in deposit charge * Quiet warning from HIP * Avoid signed/unsigned comparison * Code compiles for CPU ... * Rename intermediate buffer back to reduce extraneous diff bits * Remove another extraneous diff bit * Leave DPC++ out in the cold, since it expects different syntax for GPU-specific code * Reset failing value of rho that only slightly changes * Try tiling over ng_rho to capture particles moved into guard cells * Match box expansion in both call sites - maybe the third is also necessary * Grow last box by ng_rho as well. * Match macro syntax * Use WarpX dimensionality macros instead of AMREX_SPACEDIM * Add support for 1D case * Update benchmark checksums for ME tests * Fix macro used for 1D * Fix CUDA compilation * Rename variable to simplify diff * Clear up assertions now that stuff is working * Fix comment referring to current in ChargeDeposition.H * Fix warning about unused variable after assertion change * add runtime option * Convert flag variable from int to bool * Switch AMREX_SPACEDIM conditions to use WARPX_DIM_* macros * Once again, leave DPC++ out in the cold, as it doesn't support the same syntax as CUDA and HIP * Grow charge deposition boxes by the necessary amount * Mark a variable only used for assertions to suppress warnings * Fix compilation error for 1D * Re-add missing 1D support * Fix other bits of codfe specific to CUDA and HIP, and not DPC++ * restore missing accumulation of thread local charge into main fab. * reset benchmark for background_mcc because randomization makes it very sensitive * reset benchmark for Langmuir_multi_psatd_div_cleaning because diffing field is a numerical artifact * Calm nvcc about function missing a return * reset benchmark for background_mcc because it's randomized and numerically chaotic * reset benchmark for LaserAccelerationBoost because of numerical shift in momentum from charge deposition order * Remove extra nesting level * Skip sorting the particles and just access them according to the binned permutation * Load permutation pointer outside GPU kernel * Revert background_mcc benchmark values * Loosen overly-strict checksum tolerances in single-precision tests, rather than changing target values * Revert embedded_circle * Convert AMREX_ALWAYS_ASSERT to AMREX_ASSERT for particle bounds checking * Match assertion macro change from #2939 * Fix indentation * Disable shared memory charge deposition by default * Ignore variable only used in assertion * Add documentation of added input parameter warpx.do_shared_mem_charge_deposition * Add comments as suggested by Remi * Docs: Fix syntax issues in parameters.rst * Convert error check to unconditional assertion as requested * Make some arguments const to ease refactoring * Finished DepositCurrent function Ready to call the function from CurrentDeposition.H, but currently there is only a dummy function there * AMReX: Weekly Update * Reset: `reduced_diags_single_precision` * Reset: `background_mcc_dp_psp` * Merged with develop. runs on mpi no gpu * All funcs implemented. Compiles with bugs * Fixed typo in CurrentDeposition * Working on 2D version. there is bug jz doesn't line up correctly * Fixed 2d bug * Removed some debugging lines * Cleaning up comments * Added an input param for threads per block * Added a variable NS and START/STOP * Added a region for kernel * Not working on tilesize > 1 1 1 * Implemented Andrews new algo for max tilesize * Reduce the amount of shared memory needed by re-using the same buffer for all three components * Made default tilesize sort_bin_size LAST V1 COMMIT * bugfix - don't add 0.0 cells back to global memory. * Need to take abs before checking > 0.0 * Ran Whitespace Fixer As instructed * Updated Comments * change default tpb for current deposition to 128 * clean up comments * quiet compiler warning * remove unused variables * refactor shared current depo code * forgot to check in file * fix cpu compilation * fix uninitialized * fix typo * fix bad merge * Fixed default tilesize bug Previously had defualted shared_tilesize to sort_bin_size. This was overwritting the shared_tilesize. Some scanning shows that sort_bin_size isn't a very good default for tilesize anyways, so the new default is 1 1 1. * changed shared tilesize default to 6 6 8 Decision based on scan over tilesizes and ppc by @atmyers and @kaplannp * Put in switches for default tilesize 288 3d 144 2d Tested correctness in 2 and 3 d * Simplified parcticle contribution section In accordance to @AlexanderSinn feedback, and tested RZ, 2D, 3D * Cleanup tbox construction and depos->(depos+1)/2 in accordance to changes proposed by @AlexanderSinn. Tested on 3D 2D and RZ * Restored shared to previous version from 6acab48 Changes from before broke single precision * Found new spot to benefit from (depos_order+1)/2 * Cleaned up sloppy comments * Throw error on shared if no hip or cuda This commit makes the assumption that if you use shared, you must be using HIP or CUDA. This allows us to remove a bunch of macros that tried to quietly revert to non shared if you didn't use HIP/CUDA, and we now throw error if you try to run without HIP/CUDA * More cleanup to compile and test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add cost to GPU clock conditional Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja> * Update Source/Particles/Deposition/CurrentDeposition.H Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja> * add cost to GPU clock conditional Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja> * whitespace fix Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja> * Updated tilesize docs, and change 1d/rz default * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added docs for tpb * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changed grow (depos_order+1)/2->depos_order Turns out, above fails for shape 2 * Change default to non share, and add error check throws errors if you try vay or esirkepov with shared, and defaults to not using shared for all algos. * update to use ablaster kernel timer Compiles and runs, but at step 80 diverges from dev --------- Co-authored-by: Phil Miller <unmobile+gh@gmail.com> Co-authored-by: Phil Miller <phil@intensecomputing.com> Co-authored-by: Tools <warpx@lbl.gov> Co-authored-by: kaplannp <kaplannp@gmail.com> Co-authored-by: Axel Huebl <axel.huebl@plasma.ninja> Co-authored-by: kaplannp <56896283+kaplannp@users.noreply.github.com> Co-authored-by: kaplannp <kaplannp@whitman.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
author: Andrew Myers <atmyers2@gmail.com> 2023-04-26 20:48:43 -0700
committer: GitHub <noreply@github.com> 2023-04-26 20:48:43 -0700
commit: d907aef7bad98b80cc4aa941027f4969dfa99741 (patch)
tree: 8eb7a2ee2bd2bd44bc7a37dbff765e9db8534635 /Source/Particles/WarpXParticleContainer.cpp
parent: 3991a0e4fe8ff55b9ca0dc68b863794657bd0408 (diff)
download: WarpX-d907aef7bad98b80cc4aa941027f4969dfa99741.tar.gz
WarpX-d907aef7bad98b80cc4aa941027f4969dfa99741.tar.zst
WarpX-d907aef7bad98b80cc4aa941027f4969dfa99741.zip
1 files changed, 429 insertions, 112 deletions
diff --git a/Source/Particles/WarpXParticleContainer.cpp b/Source/Particles/WarpXParticleContainer.cpp
index de49bb12b..b6e33180a 100644
--- a/Source/Particles/WarpXParticleContainer.cpp
+++ b/Source/Particles/WarpXParticleContainer.cpp
@@ -12,6 +12,7 @@
 #include "ablastr/particles/DepositCharge.H"
 #include "Deposition/ChargeDeposition.H"
 #include "Deposition/CurrentDeposition.H"
+#include "Deposition/SharedDepositionUtils.H"
 #include "Pusher/GetAndSetPosition.H"
 #include "Pusher/UpdatePosition.H"
 #include "ParticleBoundaries_K.H"
@@ -372,6 +373,13 @@ WarpXParticleContainer::DepositCurrent (WarpXParIter& pti,
     const std::array<Real,3>& dx = WarpX::CellSize(std::max(depos_lev,0));
     amrex::ParticleReal q = this->charge;
 
+    WARPX_PROFILE_VAR_NS("WarpXParticleContainer::DepositCurrent::Sorting", blp_sort);
+    WARPX_PROFILE_VAR_NS("WarpXParticleContainer::DepositCurrent::FindMaxTilesize",
+            blp_get_max_tilesize);
+    WARPX_PROFILE_VAR_NS("WarpXParticleContainer::DepositCurrent::DirectCurrentDepKernel",
+            direct_current_dep_kernel);
+    WARPX_PROFILE_VAR_NS("WarpXParticleContainer::DepositCurrent::EsirkepovCurrentDepKernel",
+            esirkepov_current_dep_kernel);
     WARPX_PROFILE_VAR_NS("WarpXParticleContainer::DepositCurrent::CurrentDeposition", blp_deposit);
     WARPX_PROFILE_VAR_NS("WarpXParticleContainer::DepositCurrent::Accumulate", blp_accumulate);
 
@@ -445,75 +453,159 @@ WarpXParticleContainer::DepositCurrent (WarpXParIter& pti,
     amrex::LayoutData<amrex::Real> * const costs = WarpX::getCosts(lev);
     amrex::Real * const cost = costs ? &((*costs)[pti.index()]) : nullptr;
 
-    if (WarpX::current_deposition_algo == CurrentDepositionAlgo::Esirkepov) {
-        if        (WarpX::nox == 1){
-            doEsirkepovDepositionShapeN<1>(
-                GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
-                uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
-                jx_arr, jy_arr, jz_arr, np_to_depose, dt, relative_time, dx, xyzmin, lo, q,
-                WarpX::n_rz_azimuthal_modes, cost,
-                WarpX::load_balance_costs_update_algo);
-        } else if (WarpX::nox == 2){
-            doEsirkepovDepositionShapeN<2>(
-                GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
-                uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
-                jx_arr, jy_arr, jz_arr, np_to_depose, dt, relative_time, dx, xyzmin, lo, q,
-                WarpX::n_rz_azimuthal_modes, cost,
-                WarpX::load_balance_costs_update_algo);
-        } else if (WarpX::nox == 3){
-            doEsirkepovDepositionShapeN<3>(
-                GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
-                uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
-                jx_arr, jy_arr, jz_arr, np_to_depose, dt, relative_time, dx, xyzmin, lo, q,
-                WarpX::n_rz_azimuthal_modes, cost,
-                WarpX::load_balance_costs_update_algo);
+    // If doing shared mem current deposition, get tile info
+    if (WarpX::do_shared_mem_current_deposition) {
+        const Geometry& geom = Geom(lev);
+        const auto dxi = geom.InvCellSizeArray();
+        const auto plo = geom.ProbLoArray();
+        const auto domain = geom.Domain();
+
+        Box box = pti.validbox();
+        box.grow(ng_J);
+        amrex::IntVect bin_size = WarpX::shared_tilesize;
+
+        //sort particles by bin
+        WARPX_PROFILE_VAR_START(blp_sort);
+        amrex::DenseBins<ParticleType> bins;
+        {
+            auto& ptile = ParticlesAt(lev, pti);
+            auto& aos = ptile.GetArrayOfStructs();
+            auto pstruct_ptr = aos().dataPtr();
+
+            int ntiles = numTilesInBox(box, true, bin_size);
+
+            bins.build(ptile.numParticles(), pstruct_ptr, ntiles,
+                    [=] AMREX_GPU_HOST_DEVICE (const ParticleType& p) -> unsigned int
+                    {
+                        Box tbox;
+                        auto iv = getParticleCell(p, plo, dxi, domain);
+                        AMREX_ASSERT(box.contains(iv));
+                        auto tid = getTileIndex(iv, box, true, bin_size, tbox);
+                        return static_cast<unsigned int>(tid);
+                    });
         }
-    } else if (WarpX::current_deposition_algo == CurrentDepositionAlgo::Vay) {
-        // jx_fab, jy_fab and jz_fab are Vay currents (D), not physical currents (j)
-        if        (WarpX::nox == 1){
-            doVayDepositionShapeN<1>(
-                GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
-                uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
-                jx_fab, jy_fab, jz_fab, np_to_depose, dt, relative_time, dx, xyzmin, lo, q,
-                WarpX::n_rz_azimuthal_modes, cost,
-                WarpX::load_balance_costs_update_algo);
-        } else if (WarpX::nox == 2){
-            doVayDepositionShapeN<2>(
-                GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
-                uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
-                jx_fab, jy_fab, jz_fab, np_to_depose, dt, relative_time, dx, xyzmin, lo, q,
-                WarpX::n_rz_azimuthal_modes, cost,
-                WarpX::load_balance_costs_update_algo);
-        } else if (WarpX::nox == 3){
-            doVayDepositionShapeN<3>(
-                GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
-                uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
-                jx_fab, jy_fab, jz_fab, np_to_depose, dt, relative_time, dx, xyzmin, lo, q,
-                WarpX::n_rz_azimuthal_modes, cost,
-                WarpX::load_balance_costs_update_algo);
+        WARPX_PROFILE_VAR_STOP(blp_sort);
+        WARPX_PROFILE_VAR_START(blp_get_max_tilesize);
+            //get the maximum size necessary for shared mem
+            // get tile boxes
+        //get the maximum size necessary for shared mem
+#if AMREX_SPACEDIM > 0
+        int sizeX = getMaxTboxAlongDim(box.size()[0], WarpX::shared_tilesize[0]);
+#endif
+#if AMREX_SPACEDIM > 1
+        int sizeZ = getMaxTboxAlongDim(box.size()[1], WarpX::shared_tilesize[1]);
+#endif
+#if AMREX_SPACEDIM > 2
+        int sizeY = getMaxTboxAlongDim(box.size()[2], WarpX::shared_tilesize[2]);
+#endif
+        amrex::IntVect max_tbox_size( AMREX_D_DECL(sizeX,sizeZ,sizeY) );
+        WARPX_PROFILE_VAR_STOP(blp_get_max_tilesize);
+
+        // Now pick current deposition algorithm
+        if (WarpX::current_deposition_algo == CurrentDepositionAlgo::Esirkepov) {
+            amrex::Abort("Cannot do shared memory deposition with Esirkepov algorithm");
         }
-    } else {
-        if        (WarpX::nox == 1){
-            doDepositionShapeN<1>(
-                GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
-                uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
-                jx_fab, jy_fab, jz_fab, np_to_depose, relative_time, dx,
-                xyzmin, lo, q, WarpX::n_rz_azimuthal_modes, cost,
-                WarpX::load_balance_costs_update_algo);
-        } else if (WarpX::nox == 2){
-            doDepositionShapeN<2>(
-                GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
-                uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
-                jx_fab, jy_fab, jz_fab, np_to_depose, relative_time, dx,
-                xyzmin, lo, q, WarpX::n_rz_azimuthal_modes, cost,
-                WarpX::load_balance_costs_update_algo);
-        } else if (WarpX::nox == 3){
-            doDepositionShapeN<3>(
-                GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
-                uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
-                jx_fab, jy_fab, jz_fab, np_to_depose, relative_time, dx,
-                xyzmin, lo, q, WarpX::n_rz_azimuthal_modes, cost,
-                WarpX::load_balance_costs_update_algo);
+        else if (WarpX::current_deposition_algo == CurrentDepositionAlgo::Vay) {
+            amrex::Abort("Cannot do shared memory deposition with Vay algorithm");
+        }
+        else {
+            WARPX_PROFILE_VAR_START(direct_current_dep_kernel);
+            if        (WarpX::nox == 1){
+                doDepositionSharedShapeN<1>(
+                    GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
+                    uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
+                    jx_fab, jy_fab, jz_fab, np_to_depose, relative_time, dx,
+                    xyzmin, lo, q, WarpX::n_rz_azimuthal_modes, cost,
+                    WarpX::load_balance_costs_update_algo, bins, box, geom, max_tbox_size);
+            } else if (WarpX::nox == 2){
+                doDepositionSharedShapeN<2>(
+                    GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
+                    uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
+                    jx_fab, jy_fab, jz_fab, np_to_depose, relative_time, dx,
+                    xyzmin, lo, q, WarpX::n_rz_azimuthal_modes, cost,
+                    WarpX::load_balance_costs_update_algo, bins, box, geom, max_tbox_size);
+            } else if (WarpX::nox == 3){
+                doDepositionSharedShapeN<3>(
+                    GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
+                    uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
+                    jx_fab, jy_fab, jz_fab, np_to_depose, relative_time, dx,
+                    xyzmin, lo, q, WarpX::n_rz_azimuthal_modes, cost,
+                    WarpX::load_balance_costs_update_algo, bins, box, geom, max_tbox_size);
+            }
+            WARPX_PROFILE_VAR_STOP(direct_current_dep_kernel);
+        }
+    }
+    // If not doing shared memory deposition, call normal kernels
+    else {
+        if (WarpX::current_deposition_algo == CurrentDepositionAlgo::Esirkepov) {
+            if        (WarpX::nox == 1){
+                doEsirkepovDepositionShapeN<1>(
+                    GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
+                    uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
+                    jx_arr, jy_arr, jz_arr, np_to_depose, dt, relative_time, dx, xyzmin, lo, q,
+                    WarpX::n_rz_azimuthal_modes, cost,
+                    WarpX::load_balance_costs_update_algo);
+            } else if (WarpX::nox == 2){
+                doEsirkepovDepositionShapeN<2>(
+                    GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
+                    uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
+                    jx_arr, jy_arr, jz_arr, np_to_depose, dt, relative_time, dx, xyzmin, lo, q,
+                    WarpX::n_rz_azimuthal_modes, cost,
+                    WarpX::load_balance_costs_update_algo);
+            } else if (WarpX::nox == 3){
+                doEsirkepovDepositionShapeN<3>(
+                    GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
+                    uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
+                    jx_arr, jy_arr, jz_arr, np_to_depose, dt, relative_time, dx, xyzmin, lo, q,
+                    WarpX::n_rz_azimuthal_modes, cost,
+                    WarpX::load_balance_costs_update_algo);
+            }
+        } else if (WarpX::current_deposition_algo == CurrentDepositionAlgo::Vay) {
+            if        (WarpX::nox == 1){
+                doVayDepositionShapeN<1>(
+                    GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
+                    uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
+                    jx_fab, jy_fab, jz_fab, np_to_depose, dt, relative_time, dx, xyzmin, lo, q,
+                    WarpX::n_rz_azimuthal_modes, cost,
+                    WarpX::load_balance_costs_update_algo);
+            } else if (WarpX::nox == 2){
+                doVayDepositionShapeN<2>(
+                    GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
+                    uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
+                    jx_fab, jy_fab, jz_fab, np_to_depose, dt, relative_time, dx, xyzmin, lo, q,
+                    WarpX::n_rz_azimuthal_modes, cost,
+                    WarpX::load_balance_costs_update_algo);
+            } else if (WarpX::nox == 3){
+                doVayDepositionShapeN<3>(
+                    GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
+                    uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
+                    jx_fab, jy_fab, jz_fab, np_to_depose, dt, relative_time, dx, xyzmin, lo, q,
+                    WarpX::n_rz_azimuthal_modes, cost,
+                    WarpX::load_balance_costs_update_algo);
+            }
+        } else { // Direct deposition
+            if        (WarpX::nox == 1){
+                doDepositionShapeN<1>(
+                    GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
+                    uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
+                    jx_fab, jy_fab, jz_fab, np_to_depose, relative_time, dx,
+                    xyzmin, lo, q, WarpX::n_rz_azimuthal_modes, cost,
+                    WarpX::load_balance_costs_update_algo);
+            } else if (WarpX::nox == 2){
+                doDepositionShapeN<2>(
+                    GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
+                    uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
+                    jx_fab, jy_fab, jz_fab, np_to_depose, relative_time, dx,
+                    xyzmin, lo, q, WarpX::n_rz_azimuthal_modes, cost,
+                    WarpX::load_balance_costs_update_algo);
+            } else if (WarpX::nox == 3){
+                doDepositionShapeN<3>(
+                    GetPosition, wp.dataPtr() + offset, uxp.dataPtr() + offset,
+                    uyp.dataPtr() + offset, uzp.dataPtr() + offset, ion_lev,
+                    jx_fab, jy_fab, jz_fab, np_to_depose, relative_time, dx,
+                    xyzmin, lo, q, WarpX::n_rz_azimuthal_modes, cost,
+                    WarpX::load_balance_costs_update_algo);
+            }
         }
     }
     WARPX_PROFILE_VAR_STOP(blp_deposit);
@@ -590,59 +682,284 @@ WarpXParticleContainer::DepositCurrent (
 void
 WarpXParticleContainer::DepositCharge (WarpXParIter& pti, RealVector const& wp,
                                        const int * const ion_lev,
-                                       amrex::MultiFab* rho, int icomp,
+                                       amrex::MultiFab* rho, const int icomp,
                                        const long offset, const long np_to_depose,
-                                       int thread_num, int lev, int depos_lev)
+                                       const int thread_num, const int lev, const int depos_lev)
 {
-    WarpX& warpx = WarpX::GetInstance();
+    if (WarpX::do_shared_mem_charge_deposition)
+    {
+        WARPX_ALWAYS_ASSERT_WITH_MESSAGE((depos_lev==(lev-1)) ||
+                                         (depos_lev==(lev  )),
+                                         "Deposition buffers only work for lev-1");
 
-    // deposition guards
-    //   note: this is smaller than rho->nGrowVect() for PSATD
-    const amrex::IntVect& ng_rho = warpx.get_ng_depos_rho();
+        // If no particles, do not do anything
+        if (np_to_depose == 0) return;
 
-    const std::array<amrex::Real,3>& dx = WarpX::CellSize(std::max(depos_lev,0));
-    amrex::IntVect ref_ratio;
-    if (lev == depos_lev) {
-        ref_ratio = IntVect(AMREX_D_DECL(1, 1, 1 ));
-    } else {
-        ref_ratio = WarpX::RefRatio(depos_lev);
-    }
-    const int nc = WarpX::ncomps;
+        // Number of guard cells for local deposition of rho
+        WarpX& warpx = WarpX::GetInstance();
+        const amrex::IntVect& ng_rho = warpx.get_ng_depos_rho();
 
-    // Get tile box where charge is deposited.
-    // The tile box is different when depositing in the buffers (depos_lev<lev)
-    // or when depositing inside the level (depos_lev=lev)
-    amrex::Box tilebox;
-    if (lev == depos_lev) {
-        tilebox = pti.tilebox();
+        // Extract deposition order and check that particles shape fits within the guard cells.
+        // NOTE: In specific situations where the staggering of rho and the charge deposition algorithm
+        // are not trivial, this check might be too strict and we might need to relax it, as currently
+        // done for the current deposition.
+
+#if   defined(WARPX_DIM_1D_Z)
+        const amrex::IntVect shape_extent = amrex::IntVect(static_cast<int>(WarpX::noz/2));
+#elif defined(WARPX_DIM_XZ) || defined(WARPX_DIM_RZ)
+        const amrex::IntVect shape_extent = amrex::IntVect(static_cast<int>(WarpX::nox/2+1),
+                                                           static_cast<int>(WarpX::noz/2+1));
+#elif defined(WARPX_DIM_3D)
+        const amrex::IntVect shape_extent = amrex::IntVect(static_cast<int>(WarpX::nox/2+1),
+                                                           static_cast<int>(WarpX::noy/2+1),
+                                                           static_cast<int>(WarpX::noz/2+1));
+#endif
+
+        // On CPU: particles deposit on tile arrays, which have a small number of guard cells ng_rho
+        // On GPU: particles deposit directly on the rho array, which usually have a larger number of guard cells
+#ifndef AMREX_USE_GPU
+        const amrex::IntVect range = ng_rho - shape_extent;
+#else
+        const amrex::IntVect range = rho->nGrowVect() - shape_extent;
+#endif
+
+        AMREX_ASSERT_WITH_MESSAGE(
+                                  amrex::numParticlesOutOfRange(pti, range) == 0,
+                                  "Particles shape does not fit within tile (CPU) or guard cells (GPU) used for charge deposition");
+        amrex::ignore_unused(range); // In case the assertion isn't compiled
+
+        const std::array<Real,3>& dx = WarpX::CellSize(std::max(depos_lev,0));
+        const Real q = this->charge;
+
+        WARPX_PROFILE_VAR_NS("WarpXParticleContainer::DepositCharge::Sorting", blp_sort);
+        WARPX_PROFILE_VAR_NS("WarpXParticleContainer::DepositCharge::ChargeDeposition", blp_ppc_chd);
+        WARPX_PROFILE_VAR_NS("WarpXParticleContainer::DepositCharge::Accumulate", blp_accumulate);
+
+        // Get tile box where charge is deposited.
+        // The tile box is different when depositing in the buffers (depos_lev<lev)
+        // or when depositing inside the level (depos_lev=lev)
+        Box tilebox;
+        if (lev == depos_lev) {
+            tilebox = pti.tilebox();
+        } else {
+            const IntVect& ref_ratio = WarpX::RefRatio(depos_lev);
+            tilebox = amrex::coarsen(pti.tilebox(),ref_ratio);
+        }
+
+        const auto ix_type = rho->ixType().toIntVect();
+#ifndef AMREX_USE_GPU
+        // Staggered tile box
+        Box tb = amrex::convert( tilebox, ix_type );
+#endif
+
+        tilebox.grow(ng_rho);
+
+        const int nc = WarpX::ncomps;
+
+#ifdef AMREX_USE_GPU
+        amrex::ignore_unused(thread_num);
+        // GPU, no tiling: rho_fab points to the full rho array
+        MultiFab rhoi(*rho, amrex::make_alias, icomp*nc, nc);
+        auto & rho_fab = rhoi.get(pti);
+#else
+        tb.grow(ng_rho);
+
+        // CPU, tiling: rho_fab points to local_rho[thread_num]
+        local_rho[thread_num].resize(tb, nc);
+
+        // local_rho[thread_num] is set to zero
+        local_rho[thread_num].setVal(0.0);
+
+        auto & rho_fab = local_rho[thread_num];
+#endif
+
+        // Lower corner of tile box physical domain
+        // Note that this includes guard cells since it is after tilebox.ngrow
+        // Take into account Galilean shift
+        Real dt = warpx.getdt(lev);
+        const amrex::Real time_shift_delta = (icomp == 0 ? 0.0_rt : dt);
+        const std::array<amrex::Real,3>& xyzmin = WarpX::LowerCorner(
+                tilebox, depos_lev, time_shift_delta);
+
+        // Indices of the lower bound
+        const Dim3 lo = lbound(tilebox);
+
+        // HACK - sort particles by bin here.
+        WARPX_PROFILE_VAR_START(blp_sort);
+        amrex::DenseBins<ParticleType> bins;
+        {
+            const Geometry& geom = Geom(lev);
+            const auto dxi = geom.InvCellSizeArray();
+            const auto plo = geom.ProbLoArray();
+            const auto domain = geom.Domain();
+
+            auto& ptile = ParticlesAt(lev, pti);
+            auto& aos   = ptile.GetArrayOfStructs();
+            auto pstruct_ptr = aos().dataPtr();
+
+            Box box = pti.validbox();
+            box.grow(ng_rho);
+            amrex::IntVect bin_size = WarpX::shared_tilesize;
+            int ntiles = numTilesInBox(box, true, bin_size);
+
+            bins.build(ptile.numParticles(), pstruct_ptr, ntiles,
+                       [=] AMREX_GPU_HOST_DEVICE (const ParticleType& p) -> unsigned int
+                       {
+                           Box tbx;
+                           auto iv = getParticleCell(p, plo, dxi, domain);
+                           AMREX_ASSERT(box.contains(iv));
+                           auto tid = getTileIndex(iv, box, true, bin_size, tbx);
+                           return static_cast<unsigned int>(tid);
+                       });
+        }
+        WARPX_PROFILE_VAR_STOP(blp_sort);
+
+        // get tile boxes
+        amrex::Gpu::DeviceVector<Box> tboxes(bins.numBins(), amrex::Box());
+        {
+            const Geometry& geom = Geom(lev);
+            const auto dxi = geom.InvCellSizeArray();
+            const auto plo = geom.ProbLoArray();
+            const auto domain = geom.Domain();
+
+            auto& ptile = ParticlesAt(lev, pti);
+            auto& aos   = ptile.GetArrayOfStructs();
+            auto pstruct_ptr = aos().dataPtr();
+
+            Box box = pti.validbox();
+            box.grow(ng_rho);
+            amrex::IntVect bin_size = WarpX::shared_tilesize;
+
+            const auto offsets_ptr = bins.offsetsPtr();
+            auto tbox_ptr = tboxes.dataPtr();
+            auto permutation = bins.permutationPtr();
+            amrex::ParallelFor(bins.numBins(),
+                               [=] AMREX_GPU_DEVICE (int ibin) {
+                                   const int bin_start = offsets_ptr[ibin];
+                                   const int bin_stop = offsets_ptr[ibin+1];
+                                   if (bin_start < bin_stop) {
+                                       auto p = pstruct_ptr[permutation[bin_start]];
+                                       Box tbx;
+                                       auto iv = getParticleCell(p, plo, dxi, domain);
+                                       AMREX_ASSERT(box.contains(iv));
+                                       [[maybe_unused]] auto tid = getTileIndex(iv, box, true, bin_size, tbx);
+                                       AMREX_ASSERT(tid == ibin);
+                                       AMREX_ASSERT(tbx.contains(iv));
+                                       tbox_ptr[ibin] = tbx;
+                                   }
+                               });
+        }
+
+        // compute max tile box size in each direction
+        amrex::IntVect max_tbox_size;
+        {
+            ReduceOps<AMREX_D_DECL(ReduceOpMax, ReduceOpMax, ReduceOpMax)> reduce_op;
+            ReduceData<AMREX_D_DECL(int, int, int)> reduce_data(reduce_op);
+            using ReduceTuple = typename decltype(reduce_data)::Type;
+
+            const auto boxes_ptr = tboxes.dataPtr();
+            reduce_op.eval(tboxes.size(), reduce_data,
+                           [=] AMREX_GPU_DEVICE (int i) -> ReduceTuple
+                           {
+                               const Box& box = boxes_ptr[i];
+                               if (box.ok()) {
+                                   IntVect si = box.length();
+                                   return {AMREX_D_DECL(si[0], si[1], si[2])};
+                               } else {
+                                   return {AMREX_D_DECL(0, 0, 0)};
+                               }
+                           });
+
+            ReduceTuple hv = reduce_data.value();
+
+            max_tbox_size = IntVect(AMREX_D_DECL(amrex::get<0>(hv),
+                                                 amrex::get<1>(hv),
+                                                 amrex::get<2>(hv)));
+        }
+
+        WARPX_PROFILE_VAR_START(blp_ppc_chd);
+        amrex::LayoutData<amrex::Real>* costs = WarpX::getCosts(lev);
+        amrex::Real* cost = costs ? &((*costs)[pti.index()]) : nullptr;
+
+        const auto GetPosition = GetParticlePosition(pti, offset);
+        const Geometry& geom = Geom(lev);
+        Box box = pti.validbox();
+        box.grow(ng_rho);
+
+        if (WarpX::nox == 1){
+            doChargeDepositionSharedShapeN<1>(GetPosition, wp.dataPtr()+offset, ion_lev,
+                                              rho_fab, ix_type, np_to_depose, dx, xyzmin, lo, q,
+                                              WarpX::n_rz_azimuthal_modes, cost,
+                                              WarpX::load_balance_costs_update_algo, bins, box, geom, max_tbox_size);
+        } else if (WarpX::nox == 2){
+            doChargeDepositionSharedShapeN<2>(GetPosition, wp.dataPtr()+offset, ion_lev,
+                                              rho_fab, ix_type, np_to_depose, dx, xyzmin, lo, q,
+                                              WarpX::n_rz_azimuthal_modes, cost,
+                                              WarpX::load_balance_costs_update_algo, bins, box, geom, max_tbox_size);
+        } else if (WarpX::nox == 3){
+            doChargeDepositionSharedShapeN<3>(GetPosition, wp.dataPtr()+offset, ion_lev,
+                                              rho_fab, ix_type, np_to_depose, dx, xyzmin, lo, q,
+                                              WarpX::n_rz_azimuthal_modes, cost,
+                                              WarpX::load_balance_costs_update_algo, bins, box, geom, max_tbox_size);
+        }
+#ifndef AMREX_USE_GPU
+        // CPU, tiling: atomicAdd local_rho into rho
+        WARPX_PROFILE_VAR_START(blp_accumulate);
+        (*rho)[pti].atomicAdd(local_rho[thread_num], tb, tb, 0, icomp*nc, nc);
+        WARPX_PROFILE_VAR_STOP(blp_accumulate);
+#endif
     } else {
-        tilebox = amrex::coarsen(pti.tilebox(), ref_ratio);
-    }
-    tilebox.grow(ng_rho);
 
-    // Lower corner of tile box physical domain
-    // Note that this includes guard cells since it is after tilebox.ngrow
-    // Take into account Galilean shift
-    const amrex::Real dt = warpx.getdt(lev);
-    const amrex::Real time_shift_delta = (icomp == 0 ? 0.0_rt : dt);
-    const std::array<amrex::Real,3>& xyzmin = WarpX::LowerCorner(tilebox, depos_lev, time_shift_delta);
+        WarpX& warpx = WarpX::GetInstance();
 
-    // pointer to costs data
-    amrex::LayoutData<amrex::Real>* costs = WarpX::getCosts(lev);
-    amrex::Real* cost = costs ? &((*costs)[pti.index()]) : nullptr;
-
-    AMREX_ALWAYS_ASSERT(WarpX::nox == WarpX::noy);
-    AMREX_ALWAYS_ASSERT(WarpX::nox == WarpX::noz);
-
-    ablastr::particles::deposit_charge<WarpXParticleContainer>(
-        pti, wp, this->charge, ion_lev,
-        rho, local_rho[thread_num],
-        WarpX::noz, dx, xyzmin, WarpX::n_rz_azimuthal_modes,
-        ng_rho, depos_lev, ref_ratio,
-        offset, np_to_depose,
-        icomp, nc,
-        cost, WarpX::load_balance_costs_update_algo, WarpX::do_device_synchronize
-    );
+        // deposition guards
+        //   note: this is smaller than rho->nGrowVect() for PSATD
+        const amrex::IntVect& ng_rho = warpx.get_ng_depos_rho();
+
+        const std::array<amrex::Real,3>& dx = WarpX::CellSize(std::max(depos_lev,0));
+        amrex::IntVect ref_ratio;
+        if (lev == depos_lev) {
+            ref_ratio = IntVect(AMREX_D_DECL(1, 1, 1 ));
+        } else {
+            ref_ratio = WarpX::RefRatio(depos_lev);
+        }
+        const int nc = WarpX::ncomps;
+
+        // Get tile box where charge is deposited.
+        // The tile box is different when depositing in the buffers (depos_lev<lev)
+        // or when depositing inside the level (depos_lev=lev)
+        amrex::Box tilebox;
+        if (lev == depos_lev) {
+            tilebox = pti.tilebox();
+        } else {
+            tilebox = amrex::coarsen(pti.tilebox(), ref_ratio);
+        }
+        tilebox.grow(ng_rho);
+
+        // Lower corner of tile box physical domain
+        // Note that this includes guard cells since it is after tilebox.ngrow
+        // Take into account Galilean shift
+        const amrex::Real dt = warpx.getdt(lev);
+        const amrex::Real time_shift_delta = (icomp == 0 ? 0.0_rt : dt);
+        const std::array<amrex::Real,3>& xyzmin = WarpX::LowerCorner(tilebox, depos_lev, time_shift_delta);
+
+        // pointer to costs data
+        amrex::LayoutData<amrex::Real>* costs = WarpX::getCosts(lev);
+        amrex::Real* cost = costs ? &((*costs)[pti.index()]) : nullptr;
+
+        AMREX_ALWAYS_ASSERT(WarpX::nox == WarpX::noy);
+        AMREX_ALWAYS_ASSERT(WarpX::nox == WarpX::noz);
+
+        ablastr::particles::deposit_charge<WarpXParticleContainer>(
+            pti, wp, this->charge, ion_lev,
+            rho, local_rho[thread_num],
+            WarpX::noz, dx, xyzmin, WarpX::n_rz_azimuthal_modes,
+            ng_rho, depos_lev, ref_ratio,
+            offset, np_to_depose,
+            icomp, nc,
+            cost, WarpX::load_balance_costs_update_algo, WarpX::do_device_synchronize
+        );
+    }
 }
 
 void
author	Andrew Myers <atmyers2@gmail.com>	2023-04-26 20:48:43 -0700
committer	GitHub <noreply@github.com>	2023-04-26 20:48:43 -0700
commit	d907aef7bad98b80cc4aa941027f4969dfa99741 (patch)
tree	8eb7a2ee2bd2bd44bc7a37dbff765e9db8534635 /Source/Particles/WarpXParticleContainer.cpp
parent	3991a0e4fe8ff55b9ca0dc68b863794657bd0408 (diff)
download	WarpX-d907aef7bad98b80cc4aa941027f4969dfa99741.tar.gz WarpX-d907aef7bad98b80cc4aa941027f4969dfa99741.tar.zst WarpX-d907aef7bad98b80cc4aa941027f4969dfa99741.zip