Cleanup/streamline ImplicitPushXPSubOrbits() #6160

JustinRayAngus · 2025-09-11T22:15:52Z

PR #6156 is failing a CI test because adding one additional int parameter in the ImplicitPushXPSubOrbits() routine pushes the kernel argument size beyond the 2048 byte limit.

This PR is an attempt to cleanup the ImplicitPushXPSubOrbits() function and to hopefully minimize the size of the kernel argument as much as possible.

@WeiqunZhang @ax3l @atmyers Any ideas for reducing this kernel argument size?

WeiqunZhang · 2025-09-11T23:35:44Z

This type of changes might help reduce kernel size.

diff --git a/Source/Particles/Pusher/ImplicitPushPX.cpp b/Source/Particles/Pusher/ImplicitPushPX.cpp
index 091ca3079..8686d84d9 100644
--- a/Source/Particles/Pusher/ImplicitPushPX.cpp
+++ b/Source/Particles/Pusher/ImplicitPushPX.cpp
@@ -699,6 +699,9 @@ PhysicalParticleContainer::ImplicitPushXPSubOrbits (WarpXParIter& pti,
     amrex::Array4<amrex::Real> const & Szy_arr = (deposit_mass_matrices ? Szy->array(pti) : amrex::Array4<amrex::Real>());
     amrex::Array4<amrex::Real> const & Szz_arr = (deposit_mass_matrices ? Szz->array(pti) : amrex::Array4<amrex::Real>());
 
+    amrex::Gpu::Buffer<amrex::Array4<amrex::Real>> buf({Sxx_arr, Sxy_arr, Sxz_arr, Syx_arr, Syy_arr, Syz_arr, Szx_arr, Szy_arr, Szz_arr});
+    auto const* pbuf = buf.data();
+
     auto& attribs = pti.GetAttribs();
     amrex::ParticleReal* const AMREX_RESTRICT ux = attribs[PIdx::ux].dataPtr();
     amrex::ParticleReal* const AMREX_RESTRICT uy = attribs[PIdx::uy].dataPtr();
@@ -900,7 +903,7 @@ PhysicalParticleContainer::ImplicitPushXPSubOrbits (WarpXParIter& pti,
                     // in a constexpr-if context.
                     amrex::ignore_unused(full_mass_matrices, max_crossings);
                     amrex::ignore_unused(Jx_arr, Jy_arr, Jz_arr, invvol);
-                    amrex::ignore_unused(Sxx_arr, Sxy_arr, Sxz_arr, Syx_arr, Syy_arr, Syz_arr, Szx_arr, Szy_arr, Szz_arr);
+                    amrex::ignore_unused(pbuf);
                     if constexpr (depos_order_control == order_one) {
                         if (!full_mass_matrices) {
                             doVillasenorJandSigmaDepositionKernel<1,false>(
@@ -911,9 +914,7 @@ PhysicalParticleContainer::ImplicitPushXPSubOrbits (WarpXParIter& pti,
                                                                   fpzx, fpzy, fpzz,
                                                                   Jx_arr, Jy_arr, Jz_arr,
                                                                   max_crossings,
-                                                                  Sxx_arr, Sxy_arr, Sxz_arr,
-                                                                  Syx_arr, Syy_arr, Syz_arr,
-                                                                  Szx_arr, Szy_arr, Szz_arr,
+                                                                  pbuf[0], pbuf[1], ..., pbuf[7],
                                                                   dt_suborbit, dinv, xyzmin, lo );
                         } else if (full_mass_matrices) {
                             doVillasenorJandSigmaDepositionKernel<1,true>(
@@ -1115,4 +1116,5 @@ PhysicalParticleContainer::ImplicitPushXPSubOrbits (WarpXParIter& pti,
 
     });
 
+    amrex::Gpu::streamSynchronize();
 }

JustinRayAngus · 2025-09-12T00:08:56Z

@WeiqunZhang Thanks for the suggestion. I plan to remove the mass matrices from this function in a near-future PR, but this is a great suggestion for now. I have a few questions.

How much do you expect this to reduce the kernel size?
Why is amrex::Gpu::streamSynchronize(); needed at the end of the ParallelFor routine?

WeiqunZhang · 2025-09-12T00:30:53Z

Instead of capturing N*sizeof(Array4), it only captures size of a pointer. It needs Gpu::streamSynchronize, because the pointer to the buffer must be valid when the async kernel is still running.

dpgrote · 2025-09-12T16:57:54Z

Source/Particles/Pusher/ImplicitPushPX.cpp

 #ifdef WARPX_QED
                                 , do_sync, t_chi_max, p_optical_depth_QSR, evolve_opt
 #endif
                                 );

-            if (!skip_deposition && doing_deposition) {


This ignores the skip_deposition flag. The presumption here is that with the implicit advance, the deposition will always be done?

The logic was just consolidated and moved below. See lines 1058-1059

dpgrote

Looks good!

ax3l

Thanks Justin & Weiqun, LGTM 👍

JustinRayAngus added 4 commits September 11, 2025 14:40

t_do_not_gather ==> do_gather

735b57b

simpler logic for skip_deposition.

e2c47e4

remove unused ScaleFields.

d068903

pull do_copy out of Push ParallelFor loop.

11a318f

using buffer to reduce kernel size.

0f7a6bd

JustinRayAngus changed the title ~~[WIP] Cleanup/streamline ImplicitPushXPSubOrbits()~~ Cleanup/streamline ImplicitPushXPSubOrbits() Sep 12, 2025

JustinRayAngus requested review from WeiqunZhang and dpgrote September 12, 2025 00:11

dpgrote reviewed Sep 12, 2025

View reviewed changes

dpgrote approved these changes Sep 12, 2025

View reviewed changes

ax3l assigned WeiqunZhang Sep 12, 2025

ax3l added backend: cuda Specific to CUDA execution (GPUs) Performance optimization backend: hip Specific to ROCm execution (GPUs) backend: sycl Specific to DPC++/SYCL execution (CPUs/GPUs) component: implicit solvers Anything related to implicit solvers labels Sep 12, 2025

ax3l approved these changes Sep 12, 2025

View reviewed changes

ax3l merged commit 7dbbea1 into BLAST-WarpX:development Sep 12, 2025
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleanup/streamline ImplicitPushXPSubOrbits() #6160

Cleanup/streamline ImplicitPushXPSubOrbits() #6160

Uh oh!

JustinRayAngus commented Sep 11, 2025

Uh oh!

WeiqunZhang commented Sep 11, 2025

Uh oh!

JustinRayAngus commented Sep 12, 2025

Uh oh!

WeiqunZhang commented Sep 12, 2025

Uh oh!

dpgrote Sep 12, 2025

Uh oh!

JustinRayAngus Sep 12, 2025

Uh oh!

dpgrote left a comment

Uh oh!

ax3l left a comment

Uh oh!

Uh oh!

Uh oh!

Cleanup/streamline ImplicitPushXPSubOrbits() #6160

Cleanup/streamline ImplicitPushXPSubOrbits() #6160

Uh oh!

Conversation

JustinRayAngus commented Sep 11, 2025

Uh oh!

WeiqunZhang commented Sep 11, 2025

Uh oh!

JustinRayAngus commented Sep 12, 2025

Uh oh!

WeiqunZhang commented Sep 12, 2025

Uh oh!

dpgrote Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

JustinRayAngus Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

dpgrote left a comment

Choose a reason for hiding this comment

Uh oh!

ax3l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!