Split batched solver compilation #1629

MarcelKoch · 2024-06-24T08:46:31Z

This PR splits up the compilation of the batched solvers in order to reduce the compilation times. It splits up the instantiations of the kernel launches depending on the number of vectors in shared memory. This is based on the same CMake mechanism as for the csr and fbcsr kernels.

upsj · 2024-06-27T19:44:19Z

This should have a huge impact, excerpt from the HIP 5.14 debug build log

6534.89 hip/CMakeFiles/ginkgo_hip.dir/solver/batch_bicgstab_kernels.hip.cpp.o

MarcelKoch · 2024-06-28T10:58:40Z

core/solver/batch_dispatch.hpp

+#define GKO_BATCH_INSTANTIATE_STOP(macro, ...)                          \
+    macro(__VA_ARGS__,                                                  \
+          ::gko::batch::solver::device::batch_stop::SimpleAbsResidual); \
+    template macro(                                                     \


the template here (and in the other macros below) could be removed, if the value/index type instantiation macros would accept variable number or arguments.

That doesn't work until C++20. A macro with (arg, ...) requires two arguments before c++20.

pratikvn

In general, the idea looks good, but the pipelines are failing.

One thing against this approach is the readability and maintainability is seriously affected. The already complex batched code is even more complex and annoying to read now. We should maybe see if instead we dont do this split approach and instead maybe do what Jacobi does and have fewer cases as default, and only have full instantiations when necessary.

cuda/solver/batch_bicgstab_kernels.cuh

MarcelKoch · 2024-07-05T12:59:55Z

IMO the Jacobi instantiation is more complex than what is here. The kernel and the instantiations are directly together, instead of being generated by CMake, which makes it easier to follow for me.
I also merged the two .cpp files per solver, perhaps that can simplify things a bit again.

But I agree that the batch system needs an overhaul in general.

pratikvn · 2024-07-15T08:32:56Z

An alternative approach: https://github.com/ginkgo-project/ginkgo/tree/batch-optim

MarcelKoch · 2024-07-15T10:53:43Z

An alternative approach: https://github.com/ginkgo-project/ginkgo/tree/batch-optim

This seems to be quite orthogonal to this PR. With full optimizations enabled, there would be the same issue as before, so the fix from this PR is still needed. I don't see a reason why we should burden people that want the full optimizations enabled with those long compile times, for which we already have a fix available.
But we could add this into this PR.

pratikvn · 2024-09-17T11:33:34Z

@MarcelKoch, can you please rebase this when you have some time and we can try to get it merged ?

- adds header guard Co-authored-by: Pratik Nayak <[email protected]>

Co-authored-by: Tobias Ribizel <[email protected]>

yhmtsai · 2024-11-06T15:57:27Z

core/matrix/batch_struct.hpp

@@ -22,14 +22,14 @@ namespace csr {
 /**
 * Encapsulates one matrix from a batch of csr matrices.
 */
-template <typename ValueType, typename IndexType>
+template <typename ValueType, typename IndexType = const int32>


does it need to be const int32?
My personal preference is int32 and use const IndexType when it is necessary

It was necessary to add a default parameter, because the CSR and ELL have two template arguments, while Dense only has one. So, to handle them the same in the template instantiation, I had to use either add another template argument to Dense (which will not be used), or add a default argument to CSR and ELL. Since CSR and ELL already expect to only use int32 as index type, I chose that.
Additionally, it had to be const, because the default argument has to match how the solver functions will be called. Otherwise, there will be a mismatch between the instantiated functions and what is actually called, leading to linker errors.

Any more comments? If not, I will merge this today, or tomorrow morning.

codecov · 2024-11-06T17:07:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.37%. Comparing base (dc8cfeb) to head (02b4f27).

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #1629   +/-   ##
========================================
  Coverage    90.37%   90.37%           
========================================
  Files          782      782           
  Lines        63428    63429    +1     
========================================
+ Hits         57325    57326    +1     
  Misses        6103     6103

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yhmtsai

Sorry, I did not tend to hold this pr from my previous comments.
one other comments here. when getting number of reg in cuda, you take the max between worst-shared-memory and best-shared-memory case.
I do not see it in the previous release when we still have these optimation selection.
I assume it introduced some illegal configuration.
Could you elaborate it more?

yhmtsai · 2024-11-11T16:00:41Z

cuda/solver/batch_bicgstab_launch.instantiate.cu

+        get_num_regs(
+            batch_single_kernels::apply_kernel<StopType, 9, true, PrecType,
+                                               LogType, BatchMatrixType,
+                                               ValueType>),
+        get_num_regs(
+            batch_single_kernels::apply_kernel<StopType, 0, false, PrecType,
+                                               LogType, BatchMatrixType,
+                                               ValueType>));


where is the second one from?

I think first one is everything in shared memory, second one is nothing in shared memory.

yhmtsai

I think it will lead issue with single mode

yhmtsai · 2024-11-11T23:08:29Z

dpcpp/solver/batch_bicgstab_launch.instantiate.dp.cpp

+
+
+// begin
+GKO_INSTANTIATE_FOR_EACH_VALUE_TYPE(GKO_DECLARE_BATCH_BICGSTAB_LAUNCH_0);


It will not work with GINKGO_DPCPP_SINGLE_MODE=ON.
we use the instantiation to provide the specialization with unsupported exception on double precision.
with GKO_BATCH_INSTANTION, it will be wrong. Only the last one has the specialization, but the others will be instantiated.
template macro {GKO_UNSUPPORTED;} ->

template first_...; template second_...; ... template last {GKO_UNSUPPORTED;}

You're right. Thanks for bringing this up. I changed the order of macro application, so now it should be fixed.

sonarcloud · 2024-11-13T10:30:37Z

Quality Gate failed

Failed conditions
27.4% Duplication on New Code (required ≤ 20%)

See analysis details on SonarQube Cloud

pratikvn

I would wait for the CI to finish to merge this (maybe also the Intel SYCL pipelines), but looks good to me otherwise.

pratikvn · 2024-11-13T14:31:54Z

cuda/solver/batch_bicgstab_launch.instantiate.cu

+        get_num_regs(
+            batch_single_kernels::apply_kernel<StopType, 9, true, PrecType,
+                                               LogType, BatchMatrixType,
+                                               ValueType>),
+        get_num_regs(
+            batch_single_kernels::apply_kernel<StopType, 0, false, PrecType,
+                                               LogType, BatchMatrixType,
+                                               ValueType>));


I think first one is everything in shared memory, second one is nothing in shared memory.

pratikvn · 2024-11-13T14:35:05Z

cuda/solver/batch_bicgstab_launch.instantiate.cu

+    const int max_threads_regs =
+        ((max_regs_blk / static_cast<int>(num_regs_used)) / warp_sz) * warp_sz;
+    int max_threads = std::min(max_threads_regs, device_max_threads);
+    max_threads = max_threads <= max_bicgstab_threads ? max_threads
+                                                      : max_bicgstab_threads;


Just a comment and something for me to do in the future: I think this whole logic needs to be simplified. It seems it is now also possible to set the max number of registers similar to the launch_bounds with CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximum-number-of-registers-per-thread

But of course, that means we maybe cannot unify HIP and CUDA anymore, but something we need to investigate.

yhmtsai

LGTM. It is a bit hard to understand now though.

MarcelKoch self-assigned this Jun 24, 2024

MarcelKoch force-pushed the split-batched-solver-compilation branch from 259f2c1 to 8c25a83 Compare June 24, 2024 11:24

MarcelKoch commented Jun 28, 2024

View reviewed changes

pratikvn previously requested changes Jul 5, 2024

View reviewed changes

cuda/solver/batch_bicgstab_kernels.cuh Outdated Show resolved Hide resolved

MarcelKoch force-pushed the split-batched-solver-compilation branch from 8c25a83 to 870ad69 Compare July 5, 2024 12:47

MarcelKoch force-pushed the split-batched-solver-compilation branch 4 times, most recently from d04f06c to fa6d091 Compare July 9, 2024 07:42

MarcelKoch requested a review from pratikvn July 9, 2024 07:42

MarcelKoch force-pushed the split-batched-solver-compilation branch from fa6d091 to e59ab55 Compare July 10, 2024 07:36

This was referenced Aug 5, 2024

Temporarily disable optimized batched solver instantiations #1652

Merged

Enable batched optimizations and split solver instantiations. #1661

Open

MarcelKoch marked this pull request as draft August 9, 2024 09:31

MarcelKoch added the 1:ST:WIP This PR is a work in progress. Not ready for review. label Aug 15, 2024

MarcelKoch added this to the Ginkgo 1.9.0 milestone Aug 26, 2024

MarcelKoch force-pushed the split-batched-solver-compilation branch 2 times, most recently from 48fe94b to 045ad1c Compare September 17, 2024 14:43

MarcelKoch marked this pull request as ready for review September 17, 2024 14:45

MarcelKoch and others added 4 commits November 6, 2024 11:50

[batch] split cg compilation (cuda)

9180b81

[batch] review updates:

d1dfaf2

- adds header guard Co-authored-by: Pratik Nayak <[email protected]>

[batch] add launch bounds and fix register check

1f26da1

[batch] add macro indirection

45e6a9b

Co-authored-by: Tobias Ribizel <[email protected]>

MarcelKoch force-pushed the split-batched-solver-compilation branch from 8258bc9 to 45e6a9b Compare November 6, 2024 10:50

MarcelKoch added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Nov 6, 2024

MarcelKoch linked an issue Nov 6, 2024 that may be closed by this pull request

Enable batched optimizations and split solver instantiations. #1661

Open

yhmtsai reviewed Nov 6, 2024

View reviewed changes

MarcelKoch added 2 commits November 7, 2024 09:33

[batch] unify batch solver

6016f0e

[batch] split batch solver (sycl)

da12e95

MarcelKoch requested review from pratikvn and yhmtsai November 7, 2024 16:49

pratikvn approved these changes Nov 11, 2024

View reviewed changes

yhmtsai approved these changes Nov 11, 2024

View reviewed changes

yhmtsai requested changes Nov 11, 2024

View reviewed changes

MarcelKoch added 2 commits November 12, 2024 11:35

[core] add instantiation macro with variable args

7d9a37a

[batch] switch order of batch dispatch and value instantiation macros

cfd0263

MarcelKoch requested review from pratikvn and yhmtsai November 12, 2024 12:03

MarcelKoch added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:ready-to-merge This PR is ready to merge. labels Nov 12, 2024

[batch] fix windows build

02b4f27

MarcelKoch added the 1:ST:run-full-test label Nov 12, 2024

pratikvn approved these changes Nov 13, 2024

View reviewed changes

yhmtsai approved these changes Nov 13, 2024

View reviewed changes

MarcelKoch added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split batched solver compilation #1629

Split batched solver compilation #1629

MarcelKoch commented Jun 24, 2024

upsj commented Jun 27, 2024

MarcelKoch Jun 28, 2024

MarcelKoch Jul 5, 2024

pratikvn left a comment

MarcelKoch commented Jul 5, 2024

pratikvn commented Jul 15, 2024

MarcelKoch commented Jul 15, 2024

pratikvn commented Sep 17, 2024

yhmtsai Nov 6, 2024

MarcelKoch Nov 7, 2024

MarcelKoch Nov 11, 2024

codecov bot commented Nov 6, 2024 •

edited

Loading

yhmtsai left a comment

yhmtsai Nov 11, 2024

pratikvn Nov 13, 2024

yhmtsai left a comment

yhmtsai Nov 11, 2024

MarcelKoch Nov 12, 2024

sonarcloud bot commented Nov 13, 2024

pratikvn left a comment

pratikvn Nov 13, 2024

pratikvn Nov 13, 2024

yhmtsai left a comment



		// begin
		GKO_INSTANTIATE_FOR_EACH_VALUE_TYPE(GKO_DECLARE_BATCH_BICGSTAB_LAUNCH_0);

Split batched solver compilation #1629

Are you sure you want to change the base?

Split batched solver compilation #1629

Conversation

MarcelKoch commented Jun 24, 2024

upsj commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pratikvn left a comment

Choose a reason for hiding this comment

MarcelKoch commented Jul 5, 2024

pratikvn commented Jul 15, 2024

MarcelKoch commented Jul 15, 2024

pratikvn commented Sep 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 6, 2024 • edited Loading

Codecov Report

yhmtsai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Nov 13, 2024

Quality Gate failed

pratikvn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 6, 2024 •

edited

Loading