Skip to content

Conversation

JustinRayAngus
Copy link
Contributor

PR #6156 is failing a CI test because adding one additional int parameter in the ImplicitPushXPSubOrbits() routine pushes the kernel argument size beyond the 2048 byte limit.

This PR is an attempt to cleanup the ImplicitPushXPSubOrbits() function and to hopefully minimize the size of the kernel argument as much as possible.

@WeiqunZhang @ax3l @atmyers Any ideas for reducing this kernel argument size?

@WeiqunZhang
Copy link
Member

This type of changes might help reduce kernel size.

diff --git a/Source/Particles/Pusher/ImplicitPushPX.cpp b/Source/Particles/Pusher/ImplicitPushPX.cpp
index 091ca3079..8686d84d9 100644
--- a/Source/Particles/Pusher/ImplicitPushPX.cpp
+++ b/Source/Particles/Pusher/ImplicitPushPX.cpp
@@ -699,6 +699,9 @@ PhysicalParticleContainer::ImplicitPushXPSubOrbits (WarpXParIter& pti,
     amrex::Array4<amrex::Real> const & Szy_arr = (deposit_mass_matrices ? Szy->array(pti) : amrex::Array4<amrex::Real>());
     amrex::Array4<amrex::Real> const & Szz_arr = (deposit_mass_matrices ? Szz->array(pti) : amrex::Array4<amrex::Real>());
 
+    amrex::Gpu::Buffer<amrex::Array4<amrex::Real>> buf({Sxx_arr, Sxy_arr, Sxz_arr, Syx_arr, Syy_arr, Syz_arr, Szx_arr, Szy_arr, Szz_arr});
+    auto const* pbuf = buf.data();
+
     auto& attribs = pti.GetAttribs();
     amrex::ParticleReal* const AMREX_RESTRICT ux = attribs[PIdx::ux].dataPtr();
     amrex::ParticleReal* const AMREX_RESTRICT uy = attribs[PIdx::uy].dataPtr();
@@ -900,7 +903,7 @@ PhysicalParticleContainer::ImplicitPushXPSubOrbits (WarpXParIter& pti,
                     // in a constexpr-if context.
                     amrex::ignore_unused(full_mass_matrices, max_crossings);
                     amrex::ignore_unused(Jx_arr, Jy_arr, Jz_arr, invvol);
-                    amrex::ignore_unused(Sxx_arr, Sxy_arr, Sxz_arr, Syx_arr, Syy_arr, Syz_arr, Szx_arr, Szy_arr, Szz_arr);
+                    amrex::ignore_unused(pbuf);
                     if constexpr (depos_order_control == order_one) {
                         if (!full_mass_matrices) {
                             doVillasenorJandSigmaDepositionKernel<1,false>(
@@ -911,9 +914,7 @@ PhysicalParticleContainer::ImplicitPushXPSubOrbits (WarpXParIter& pti,
                                                                   fpzx, fpzy, fpzz,
                                                                   Jx_arr, Jy_arr, Jz_arr,
                                                                   max_crossings,
-                                                                  Sxx_arr, Sxy_arr, Sxz_arr,
-                                                                  Syx_arr, Syy_arr, Syz_arr,
-                                                                  Szx_arr, Szy_arr, Szz_arr,
+                                                                  pbuf[0], pbuf[1], ..., pbuf[7],
                                                                   dt_suborbit, dinv, xyzmin, lo );
                         } else if (full_mass_matrices) {
                             doVillasenorJandSigmaDepositionKernel<1,true>(
@@ -1115,4 +1116,5 @@ PhysicalParticleContainer::ImplicitPushXPSubOrbits (WarpXParIter& pti,
 
     });
 
+    amrex::Gpu::streamSynchronize();
 }

@JustinRayAngus
Copy link
Contributor Author

@WeiqunZhang Thanks for the suggestion. I plan to remove the mass matrices from this function in a near-future PR, but this is a great suggestion for now. I have a few questions.

  1. How much do you expect this to reduce the kernel size?
  2. Why is amrex::Gpu::streamSynchronize(); needed at the end of the ParallelFor routine?

@JustinRayAngus JustinRayAngus changed the title [WIP] Cleanup/streamline ImplicitPushXPSubOrbits() Cleanup/streamline ImplicitPushXPSubOrbits() Sep 12, 2025
@WeiqunZhang
Copy link
Member

Instead of capturing N*sizeof(Array4), it only captures size of a pointer. It needs Gpu::streamSynchronize, because the pointer to the buffer must be valid when the async kernel is still running.

#ifdef WARPX_QED
, do_sync, t_chi_max, p_optical_depth_QSR, evolve_opt
#endif
);

if (!skip_deposition && doing_deposition) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ignores the skip_deposition flag. The presumption here is that with the implicit advance, the deposition will always be done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic was just consolidated and moved below. See lines 1058-1059

Copy link
Member

@dpgrote dpgrote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@ax3l ax3l added backend: cuda Specific to CUDA execution (GPUs) Performance optimization backend: hip Specific to ROCm execution (GPUs) backend: sycl Specific to DPC++/SYCL execution (CPUs/GPUs) component: implicit solvers Anything related to implicit solvers labels Sep 12, 2025
Copy link
Member

@ax3l ax3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Justin & Weiqun, LGTM 👍

@ax3l ax3l merged commit 7dbbea1 into BLAST-WarpX:development Sep 12, 2025
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: cuda Specific to CUDA execution (GPUs) backend: hip Specific to ROCm execution (GPUs) backend: sycl Specific to DPC++/SYCL execution (CPUs/GPUs) component: implicit solvers Anything related to implicit solvers Performance optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants