Skip to content

Conversation

@res-life
Copy link
Contributor

@res-life res-life commented Oct 28, 2025

Description

Supports value indices for contiguousSplitGroupsAndGenUniqKeys

contributes to NVIDIA/spark-rapids#13679

Signed-off-by: Chong Gao [email protected]

API changes in detail

the original API contiguousSplitGroupsAndGenUniqKeys()

Only specify key indices, the split tables contains both key columns and other columns and keep the column order.
E.g.:
Input table is [c0, c1, c2, c3], key indices is [2, 0]
Outoup split table columns is [c0, c1, c2, c3], keeps the original column order.

new API contiguousSplitGroupsAndGenUniqKeys(valueIndices)

specify both key indices and value indices.
Input table is [c0, c1, c2, c3, c4], key indices is [2, 0], values indices is [3, 1]
Output split table columns is [c3, c1]

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@res-life res-life requested a review from a team as a code owner October 28, 2025 10:06
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 28, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Java Affects Java cuDF API. label Oct 28, 2025
@res-life res-life added feature request New feature or request non-breaking Non-breaking change and removed Java Affects Java cuDF API. labels Oct 28, 2025
@res-life
Copy link
Contributor Author

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 28, 2025

/ok to test

@res-life, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@res-life res-life marked this pull request as draft October 28, 2025 10:07
Signed-off-by: Chong Gao <[email protected]>
@github-actions github-actions bot added the Java Affects Java cuDF API. label Oct 28, 2025
@res-life
Copy link
Contributor Author

/ok to test 4485fda

Signed-off-by: Chong Gao <[email protected]>
@res-life
Copy link
Contributor Author

build

@res-life
Copy link
Contributor Author

/ok to test 996c9ae

Signed-off-by: Chong Gao <[email protected]>
@res-life
Copy link
Contributor Author

/ok to test 175769c

@res-life res-life marked this pull request as ready for review October 29, 2025 06:54
@res-life res-life requested review from mythrocks and ttnghia October 29, 2025 06:54
@mythrocks mythrocks changed the title Supports value indices for contiguousSplitGroupsAndGenUniqKeys [Java] Supports value indices for contiguousSplitGroupsAndGenUniqKeys Oct 30, 2025
@mythrocks
Copy link
Contributor

It took me some time to understand the intent here. The problem I had was with the term "value indices". Values are usually associated with the table rows, not references to columns.

If we could refer to "value indices" with a different term, say "projection column indices", I think it will be a clearer change.

I will continue to review this change again under that new lens.

* Similar to the above {@link #contiguousSplitGroupsAndGenUniqKeys}.
*
* The diff with the above method is:
* - Provide an extra input `valueIndices` which defines the columns to output.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

projectionColumnIndices would be a clearer term.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (std::find(key_indices.begin(), key_indices.end(), index) == key_indices.end()) {
// not key column, so adds it as value column.
value_indices.emplace_back(index);
auto num_value_cols = [&]() -> size_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We might be able to remove the parentheses:

Suggested change
auto num_value_cols = [&]() -> size_t {
auto num_value_cols = [&] -> size_t {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding () causes a warning:

TableJni.cpp:4650:31: error: parameter declaration before lambda trailing return type only optional with ‘-std=c++2b’ or ‘-std=gnu++2b’ [-Werror=c++23-extensions]
[INFO]      [exec]  4650 |     auto num_value_cols = [&] -> size_t {
[INFO]      [exec]       |                               ^~
[INFO]      [exec] cc1plus: all warnings being treated as errors

}();

std::vector<cudf::column_view> grouped_cols(num_grouped_cols);
[&]() -> void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a need for an IIFE here. Why not just put the body here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 4679 to 4687
auto num_grouped_cols = [&]() -> size_t {
if (jvalue_indices == NULL) {
// output both key columns and value columns
return key_indices.size() + num_value_cols;
} else {
// only output value columns
return num_value_cols;
}
}();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might have a simpler phrasing.

Suggested change
auto num_grouped_cols = [&]() -> size_t {
if (jvalue_indices == NULL) {
// output both key columns and value columns
return key_indices.size() + num_value_cols;
} else {
// only output value columns
return num_value_cols;
}
}();
// Include key columns if output projection is not specified.
size_t const num_grouped_cols = num_value_cols + (jvalue_indices == NULL)? key_indices.size() : 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

auto key_view = groups.keys->view();
auto key_view_it = key_view.begin();
for (auto key_id : key_indices) {
grouped_cols.at(key_id) = std::move(*key_view_it);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're just copying column views here, I think this might have been easier:

Suggested change
grouped_cols.at(key_id) = std::move(*key_view_it);
grouped_cols[key_id] = *key_view_it;

Column views are meant to be copied, IIRC. Maybe I've missed why this is written this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it's meaningless using std::move(view).
Previously it's written in this way, so I followed.
Let's use what you proposed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits and questions. But I see where this is going.

@res-life
Copy link
Contributor Author

res-life commented Nov 6, 2025

/ok to test 398668a

@res-life
Copy link
Contributor Author

res-life commented Nov 6, 2025

/ok to test 398668a

Signed-off-by: Chong Gao <[email protected]>
@res-life
Copy link
Contributor Author

res-life commented Nov 6, 2025

/ok to test 45b4ef2

Signed-off-by: Chong Gao <[email protected]>
@res-life
Copy link
Contributor Author

res-life commented Nov 6, 2025

/ok to test cd929e6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request Java Affects Java cuDF API. non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants