Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add device aggregators used by shared memory groupby #17031

Open
wants to merge 12 commits into
base: branch-24.12
Choose a base branch
from

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Oct 9, 2024

Description

This work is part of splitting the original bulk shared memory groupby PR #16619.

It introduces two device-side element aggregators:

  • shmem_element_aggregator: aggregates data from global memory sources to shared memory targets,
  • gmem_element_aggregator: aggregates from shared memory sources to global memory targets.

These two aggregators are similar to the elementwise_aggregator functionality. Follow-up work is tracked via #17032.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Oct 9, 2024
@PointKernel PointKernel self-assigned this Oct 9, 2024
@PointKernel PointKernel added the 3 - Ready for Review Ready for review by team label Oct 9, 2024
@PointKernel PointKernel marked this pull request as ready for review October 9, 2024 21:53
@PointKernel PointKernel requested a review from a team as a code owner October 9, 2024 21:53
using DeviceTarget = cudf::detail::underlying_target_t<Source, aggregation::MIN>;
using DeviceSource = cudf::detail::underlying_source_t<Source, aggregation::MIN>;

DeviceTarget* target_casted = reinterpret_cast<DeviceTarget*>(target);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this work for strings columns?
The target type will be a string_view and this cast will be incorrect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I wanted to give you a thorough response, so I took a deeper dive into the details to see how this works. The more I look into the current code, the more surprised I am that it's producing the correct results.

Still working on it. Need more time to fully understand this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, finally got this. String columns don't have the atomic support so they use the sort-based groupby all the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This then this gets into a more philosophical discussion about column-device-view and shared memory. Technically atomics only work with simple fixed-width types (not including timestamp, duration too) and therefore you would not need something as fancy as wrapping a column-device-view for these types since they will always be representable by pointer, size (device-span) and a validity-mask. You can simply cast the std::byte* to the underlying type without really needing a column-device-view (IMO).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply cast the std::byte* to the underlying type without really needing a column-device-view

Totally makes sense.

After removing dictionaries from the shared memory groupby, it now only handles integers and decimals. A pointer to the underlying type is sufficient for this. Let me unify the shared and global memory aggregators first. I’m really happy it has come down to such a simple solution. Thanks a lot for your input! @davidwendt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been working on unifying the aggregators locally, and this requires non-trivial effort around performance improvement and API design.

I started with updating null masks to use bitmask_type* instead of bool*, but I noticed around a 10% slowdown because bitmask_type always requires atomic operations to use. After that, there are some differences in behavior between the shared memory aggregator, global memory aggregator, and the existing row aggregator, particularly in how they handle nulls and supported data types. I would suggest that we merge the current PR as is and track the follow-up work through #17032.

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Discussion with @davidwendt clarified a few of my confusions as well. I am interested to see these aggregators unified and in action so subscribing to PR #17032 as well.

cudf::column_device_view source,
cudf::size_type source_index) const noexcept
{
if constexpr (k != cudf::aggregation::COUNT_ALL) {
Copy link
Member

@mhaseeb123 mhaseeb123 Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking nit: Maybe we can add the same comment as above here and in other aggregators as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants