Add device aggregators used by shared memory groupby #17031

PointKernel · 2024-10-09T20:27:37Z

Description

This work is part of splitting the original bulk shared memory groupby PR #16619.

It introduces two device-side element aggregators:

shmem_element_aggregator: aggregates data from global memory sources to shared memory targets,
gmem_element_aggregator: aggregates from shared memory sources to global memory targets.

These two aggregators are similar to the elementwise_aggregator functionality. Follow-up work is tracked via #17032.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

davidwendt · 2024-10-09T22:08:04Z

cpp/src/groupby/hash/shared_memory_aggregator.cuh

+    using DeviceTarget = cudf::detail::underlying_target_t<Source, aggregation::MIN>;
+    using DeviceSource = cudf::detail::underlying_source_t<Source, aggregation::MIN>;
+
+    DeviceTarget* target_casted = reinterpret_cast<DeviceTarget*>(target);


How will this work for strings columns?
The target type will be a string_view and this cast will be incorrect.

Good question. I wanted to give you a thorough response, so I took a deeper dive into the details to see how this works. The more I look into the current code, the more surprised I am that it's producing the correct results.

Still working on it. Need more time to fully understand this.

OK, finally got this. String columns don't have the atomic support so they use the sort-based groupby all the time.

This then this gets into a more philosophical discussion about column-device-view and shared memory. Technically atomics only work with simple fixed-width types (not including timestamp, duration too) and therefore you would not need something as fancy as wrapping a column-device-view for these types since they will always be representable by pointer, size (device-span) and a validity-mask. You can simply cast the std::byte* to the underlying type without really needing a column-device-view (IMO).

simply cast the std::byte* to the underlying type without really needing a column-device-view

Totally makes sense.

After removing dictionaries from the shared memory groupby, it now only handles integers and decimals. A pointer to the underlying type is sufficient for this. Let me unify the shared and global memory aggregators first. I’m really happy it has come down to such a simple solution. Thanks a lot for your input! @davidwendt

I've been working on unifying the aggregators locally, and this requires non-trivial effort around performance improvement and API design.

I started with updating null masks to use bitmask_type* instead of bool*, but I noticed around a 10% slowdown because bitmask_type always requires atomic operations to use. After that, there are some differences in behavior between the shared memory aggregator, global memory aggregator, and the existing row aggregator, particularly in how they handle nulls and supported data types. I would suggest that we merge the current PR as is and track the follow-up work through #17032.

cpp/src/groupby/hash/shared_memory_aggregator.cuh

…tors

mhaseeb123

Looks good to me. Discussion with @davidwendt clarified a few of my confusions as well. I am interested to see these aggregators unified and in action so subscribing to PR #17032 as well.

mhaseeb123 · 2024-10-16T22:25:39Z

cpp/src/groupby/hash/shared_memory_aggregator.cuh

+                             cudf::column_device_view source,
+                             cudf::size_type source_index) const noexcept
+  {
+    if constexpr (k != cudf::aggregation::COUNT_ALL) {


Non blocking nit: Maybe we can add the same comment as above here and in other aggregators as well.

cpp/include/cudf/detail/aggregation/device_aggregators.cuh

PointKernel added 4 commits October 9, 2024 12:44

Add shared memory aggregator

990eaa3

Add global memory aggregator

ae9589e

Clean up early exit logic for nulls

4fd00bf

Add comments + cuda::std::byte for device functors

b21bc12

PointKernel added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Oct 9, 2024

PointKernel mentioned this pull request Oct 9, 2024

[FEA] Refactor to eliminate redundant device aggregation logic #17032

Open

PointKernel self-assigned this Oct 9, 2024

PointKernel added the 3 - Ready for Review Ready for review by team label Oct 9, 2024

PointKernel requested a review from davidwendt October 9, 2024 21:53

PointKernel marked this pull request as ready for review October 9, 2024 21:53

PointKernel requested a review from a team as a code owner October 9, 2024 21:53

PointKernel requested a review from pmattione-nvidia October 9, 2024 21:53

davidwendt reviewed Oct 9, 2024

View reviewed changes

cpp/src/groupby/hash/shared_memory_aggregator.cuh Show resolved Hide resolved

PointKernel added 6 commits October 9, 2024 16:28

Add traits determing source and target type

23360f5

Merge remote-tracking branch 'upstream/branch-24.12' into add-aggrega…

77344aa

…tors

Merge remote-tracking branch 'upstream/branch-24.12' into add-aggrega…

91311fa

…tors

Add comment

c2b9dd1

Merge remote-tracking branch 'upstream/branch-24.12' into add-aggrega…

e0852b7

…tors

Merge remote-tracking branch 'upstream/branch-24.12' into add-aggrega…

745a24a

…tors

PointKernel requested a review from davidwendt October 11, 2024 20:25

PointKernel added 2 commits October 14, 2024 19:32

Merge remote-tracking branch 'upstream/branch-24.12' into add-aggrega…

1ed1908

…tors

API cleanups

bcc5e1f

GregoryKimball requested a review from mhaseeb123 October 16, 2024 18:34

mhaseeb123 approved these changes Oct 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add device aggregators used by shared memory groupby #17031

Add device aggregators used by shared memory groupby #17031

PointKernel commented Oct 9, 2024 •

edited

Loading

davidwendt Oct 9, 2024

PointKernel Oct 11, 2024

PointKernel Oct 11, 2024

davidwendt Oct 11, 2024

PointKernel Oct 11, 2024

PointKernel Oct 15, 2024

mhaseeb123 left a comment

mhaseeb123 Oct 16, 2024 •

edited

Loading

Add device aggregators used by shared memory groupby #17031

Are you sure you want to change the base?

Add device aggregators used by shared memory groupby #17031

Conversation

PointKernel commented Oct 9, 2024 • edited Loading

Description

Checklist

davidwendt Oct 9, 2024

Choose a reason for hiding this comment

PointKernel Oct 11, 2024

Choose a reason for hiding this comment

PointKernel Oct 11, 2024

Choose a reason for hiding this comment

davidwendt Oct 11, 2024

Choose a reason for hiding this comment

PointKernel Oct 11, 2024

Choose a reason for hiding this comment

PointKernel Oct 15, 2024

Choose a reason for hiding this comment

mhaseeb123 left a comment

Choose a reason for hiding this comment

mhaseeb123 Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

PointKernel commented Oct 9, 2024 •

edited

Loading

mhaseeb123 Oct 16, 2024 •

edited

Loading