Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Refactor to eliminate redundant device aggregation logic #17032

Open
PointKernel opened this issue Oct 9, 2024 · 0 comments
Open

[FEA] Refactor to eliminate redundant device aggregation logic #17032

PointKernel opened this issue Oct 9, 2024 · 0 comments
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.

Comments

@PointKernel
Copy link
Member

PointKernel commented Oct 9, 2024

Is your feature request related to a problem? Please describe.
Once #17031 is merged, three copies of similar device aggregator logic will exist in libcudf, and we need to address this issue.

Currently, we cannot share the same code path because the existing device aggregator only accepts column_device_view as input, and libcudf is unable to construct a column_device_view from shared memory at this point.

Describe the solution you'd like
Based on our offline discussions, adding an overload for the column_device_view ctor may not be the best approach. Instead, we should consider introducing a new object similar to device_column_view specifically designed for use in shared memory.

Ultimately, we aim to have a single aggregator that manages all types of aggregations: global-global, shared-global, and global-shared.

Additional context

  • The new gmem_element_aggregator and shmem_element_aggregator cannot be used to deal with dictionary columns due to two reasons:
    • CUDA error: on V100, they will cause a cudaErrorInvalidValue when querying the available dynamic shared memory size with cudaOccupancyAvailableDynamicSMemPerBlock. This issue is likely caused by the dictionary template instantiation of the aggregator, which triggers a nested invocation of the type dispatcher. Note this error is seen on V100 but not on RTX8000.
    • Performance: Simply adding dictionary instantiations without actually using them, i.e., without dictionary columns involved in the calculation, can make the group performance up to 5x slower.
@PointKernel PointKernel added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function and removed feature request New feature or request labels Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

1 participant