[FEA] Refactor to eliminate redundant device aggregation logic #17032
Labels
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
Is your feature request related to a problem? Please describe.
Once #17031 is merged, three copies of similar device aggregator logic will exist in libcudf, and we need to address this issue.
Currently, we cannot share the same code path because the existing device aggregator only accepts
column_device_view
as input, and libcudf is unable to construct acolumn_device_view
from shared memory at this point.Describe the solution you'd like
Based on our offline discussions, adding an overload for the
column_device_view
ctor may not be the best approach. Instead, we should consider introducing a new object similar todevice_column_view
specifically designed for use in shared memory.Ultimately, we aim to have a single aggregator that manages all types of aggregations: global-global, shared-global, and global-shared.
Additional context
gmem_element_aggregator
andshmem_element_aggregator
cannot be used to deal with dictionary columns due to two reasons:cudaErrorInvalidValue
when querying the available dynamic shared memory size withcudaOccupancyAvailableDynamicSMemPerBlock
. This issue is likely caused by the dictionary template instantiation of the aggregator, which triggers a nested invocation of the type dispatcher. Note this error is seen on V100 but not on RTX8000.The text was updated successfully, but these errors were encountered: