-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split batched solver compilation #1629
base: develop
Are you sure you want to change the base?
Conversation
259f2c1
to
8c25a83
Compare
This should have a huge impact, excerpt from the HIP 5.14 debug build log
|
core/solver/batch_dispatch.hpp
Outdated
#define GKO_BATCH_INSTANTIATE_STOP(macro, ...) \ | ||
macro(__VA_ARGS__, \ | ||
::gko::batch::solver::device::batch_stop::SimpleAbsResidual); \ | ||
template macro( \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the template
here (and in the other macros below) could be removed, if the value/index type instantiation macros would accept variable number or arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't work until C++20. A macro with (arg, ...)
requires two arguments before c++20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the idea looks good, but the pipelines are failing.
One thing against this approach is the readability and maintainability is seriously affected. The already complex batched code is even more complex and annoying to read now. We should maybe see if instead we dont do this split approach and instead maybe do what Jacobi does and have fewer cases as default, and only have full instantiations when necessary.
8c25a83
to
870ad69
Compare
IMO the Jacobi instantiation is more complex than what is here. The kernel and the instantiations are directly together, instead of being generated by CMake, which makes it easier to follow for me. But I agree that the batch system needs an overhaul in general. |
d04f06c
to
fa6d091
Compare
fa6d091
to
e59ab55
Compare
An alternative approach: https://github.com/ginkgo-project/ginkgo/tree/batch-optim |
This seems to be quite orthogonal to this PR. With full optimizations enabled, there would be the same issue as before, so the fix from this PR is still needed. I don't see a reason why we should burden people that want the full optimizations enabled with those long compile times, for which we already have a fix available. |
@MarcelKoch, can you please rebase this when you have some time and we can try to get it merged ? |
48fe94b
to
045ad1c
Compare
- adds header guard Co-authored-by: Pratik Nayak <[email protected]>
Co-authored-by: Tobias Ribizel <[email protected]>
8258bc9
to
45e6a9b
Compare
@@ -22,14 +22,14 @@ namespace csr { | |||
/** | |||
* Encapsulates one matrix from a batch of csr matrices. | |||
*/ | |||
template <typename ValueType, typename IndexType> | |||
template <typename ValueType, typename IndexType = const int32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need to be const int32
?
My personal preference is int32
and use const IndexType
when it is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was necessary to add a default parameter, because the CSR and ELL have two template arguments, while Dense only has one. So, to handle them the same in the template instantiation, I had to use either add another template argument to Dense (which will not be used), or add a default argument to CSR and ELL. Since CSR and ELL already expect to only use int32
as index type, I chose that.
Additionally, it had to be const
, because the default argument has to match how the solver functions will be called. Otherwise, there will be a mismatch between the instantiated functions and what is actually called, leading to linker errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any more comments? If not, I will merge this today, or tomorrow morning.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1629 +/- ##
========================================
Coverage 90.37% 90.37%
========================================
Files 782 782
Lines 63428 63429 +1
========================================
+ Hits 57325 57326 +1
Misses 6103 6103 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I did not tend to hold this pr from my previous comments.
one other comments here. when getting number of reg in cuda, you take the max between worst-shared-memory and best-shared-memory case.
I do not see it in the previous release when we still have these optimation selection.
I assume it introduced some illegal configuration.
Could you elaborate it more?
get_num_regs( | ||
batch_single_kernels::apply_kernel<StopType, 9, true, PrecType, | ||
LogType, BatchMatrixType, | ||
ValueType>), | ||
get_num_regs( | ||
batch_single_kernels::apply_kernel<StopType, 0, false, PrecType, | ||
LogType, BatchMatrixType, | ||
ValueType>)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the second one from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think first one is everything in shared memory, second one is nothing in shared memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will lead issue with single mode
|
||
|
||
// begin | ||
GKO_INSTANTIATE_FOR_EACH_VALUE_TYPE(GKO_DECLARE_BATCH_BICGSTAB_LAUNCH_0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not work with GINKGO_DPCPP_SINGLE_MODE=ON
.
we use the instantiation to provide the specialization with unsupported exception on double precision.
with GKO_BATCH_INSTANTION
, it will be wrong. Only the last one has the specialization, but the others will be instantiated.
template macro {GKO_UNSUPPORTED;}
->
template first_...;
template second_...;
...
template last {GKO_UNSUPPORTED;}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Thanks for bringing this up. I changed the order of macro application, so now it should be fixed.
Quality Gate failedFailed conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would wait for the CI to finish to merge this (maybe also the Intel SYCL pipelines), but looks good to me otherwise.
get_num_regs( | ||
batch_single_kernels::apply_kernel<StopType, 9, true, PrecType, | ||
LogType, BatchMatrixType, | ||
ValueType>), | ||
get_num_regs( | ||
batch_single_kernels::apply_kernel<StopType, 0, false, PrecType, | ||
LogType, BatchMatrixType, | ||
ValueType>)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think first one is everything in shared memory, second one is nothing in shared memory.
const int max_threads_regs = | ||
((max_regs_blk / static_cast<int>(num_regs_used)) / warp_sz) * warp_sz; | ||
int max_threads = std::min(max_threads_regs, device_max_threads); | ||
max_threads = max_threads <= max_bicgstab_threads ? max_threads | ||
: max_bicgstab_threads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment and something for me to do in the future: I think this whole logic needs to be simplified. It seems it is now also possible to set the max number of registers similar to the launch_bounds with CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximum-number-of-registers-per-thread
But of course, that means we maybe cannot unify HIP and CUDA anymore, but something we need to investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It is a bit hard to understand now though.
This PR splits up the compilation of the batched solvers in order to reduce the compilation times. It splits up the instantiations of the kernel launches depending on the number of vectors in shared memory. This is based on the same CMake mechanism as for the csr and fbcsr kernels.