-
Notifications
You must be signed in to change notification settings - Fork 981
[Java] Supports value indices for contiguousSplitGroupsAndGenUniqKeys
#20391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Chong Gao <[email protected]>
|
/ok to test |
@res-life, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
Signed-off-by: Chong Gao <[email protected]>
|
/ok to test 4485fda |
Signed-off-by: Chong Gao <[email protected]>
|
build |
|
/ok to test 996c9ae |
Signed-off-by: Chong Gao <[email protected]>
|
/ok to test 175769c |
contiguousSplitGroupsAndGenUniqKeys
|
It took me some time to understand the intent here. The problem I had was with the term "value indices". Values are usually associated with the table rows, not references to columns. If we could refer to "value indices" with a different term, say "projection column indices", I think it will be a clearer change. I will continue to review this change again under that new lens. |
| * Similar to the above {@link #contiguousSplitGroupsAndGenUniqKeys}. | ||
| * | ||
| * The diff with the above method is: | ||
| * - Provide an extra input `valueIndices` which defines the columns to output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
projectionColumnIndices would be a clearer term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| if (std::find(key_indices.begin(), key_indices.end(), index) == key_indices.end()) { | ||
| // not key column, so adds it as value column. | ||
| value_indices.emplace_back(index); | ||
| auto num_value_cols = [&]() -> size_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We might be able to remove the parentheses:
| auto num_value_cols = [&]() -> size_t { | |
| auto num_value_cols = [&] -> size_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding () causes a warning:
TableJni.cpp:4650:31: error: parameter declaration before lambda trailing return type only optional with ‘-std=c++2b’ or ‘-std=gnu++2b’ [-Werror=c++23-extensions]
[INFO] [exec] 4650 | auto num_value_cols = [&] -> size_t {
[INFO] [exec] | ^~
[INFO] [exec] cc1plus: all warnings being treated as errors
| }(); | ||
|
|
||
| std::vector<cudf::column_view> grouped_cols(num_grouped_cols); | ||
| [&]() -> void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a need for an IIFE here. Why not just put the body here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| auto num_grouped_cols = [&]() -> size_t { | ||
| if (jvalue_indices == NULL) { | ||
| // output both key columns and value columns | ||
| return key_indices.size() + num_value_cols; | ||
| } else { | ||
| // only output value columns | ||
| return num_value_cols; | ||
| } | ||
| }(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might have a simpler phrasing.
| auto num_grouped_cols = [&]() -> size_t { | |
| if (jvalue_indices == NULL) { | |
| // output both key columns and value columns | |
| return key_indices.size() + num_value_cols; | |
| } else { | |
| // only output value columns | |
| return num_value_cols; | |
| } | |
| }(); | |
| // Include key columns if output projection is not specified. | |
| size_t const num_grouped_cols = num_value_cols + (jvalue_indices == NULL)? key_indices.size() : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| auto key_view = groups.keys->view(); | ||
| auto key_view_it = key_view.begin(); | ||
| for (auto key_id : key_indices) { | ||
| grouped_cols.at(key_id) = std::move(*key_view_it); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're just copying column views here, I think this might have been easier:
| grouped_cols.at(key_id) = std::move(*key_view_it); | |
| grouped_cols[key_id] = *key_view_it; |
Column views are meant to be copied, IIRC. Maybe I've missed why this is written this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it's meaningless using std::move(view).
Previously it's written in this way, so I followed.
Let's use what you proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits and questions. But I see where this is going.
Signed-off-by: Chong Gao <[email protected]>
|
/ok to test 398668a |
|
/ok to test 398668a |
Signed-off-by: Chong Gao <[email protected]>
|
/ok to test 45b4ef2 |
Signed-off-by: Chong Gao <[email protected]>
|
/ok to test cd929e6 |
Description
Supports value indices for contiguousSplitGroupsAndGenUniqKeys
contributes to NVIDIA/spark-rapids#13679
Signed-off-by: Chong Gao [email protected]
API changes in detail
the original API contiguousSplitGroupsAndGenUniqKeys()
Only specify key indices, the split tables contains both key columns and other columns and keep the column order.
E.g.:
Input table is [c0, c1, c2, c3], key indices is [2, 0]
Outoup split table columns is [c0, c1, c2, c3], keeps the original column order.
new API contiguousSplitGroupsAndGenUniqKeys(valueIndices)
specify both key indices and value indices.
Input table is [c0, c1, c2, c3, c4], key indices is [2, 0], values indices is [3, 1]
Output split table columns is [c3, c1]
Checklist