-
Notifications
You must be signed in to change notification settings - Fork 235
feat: rpad support column for second arg instead of just literal #2099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8c5e0aa
to
bb0c2f3
Compare
@andygrove , The issue is with implementation of |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2099 +/- ##
============================================
+ Coverage 56.12% 57.62% +1.49%
- Complexity 976 1297 +321
============================================
Files 119 147 +28
Lines 11743 13497 +1754
Branches 2251 2390 +139
============================================
+ Hits 6591 7777 +1186
- Misses 4012 4451 +439
- Partials 1140 1269 +129 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @coderfender wondering should we port the code to https://github.com/apache/datafusion/tree/main/datafusion/spark/src/function and then reuse spark function from the DF spark crate?
Thank you for the review @comphead . Moving expressions to datafusion-spark create is indeed the goal once this change is merged into main |
@andygrove , @comphead could you please review the code whenever you get a chance ? Thank you very much |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @coderfender! First round of feedback.
native/spark-expr/src/static_invoke/char_varchar_utils/read_side_padding.rs
Outdated
Show resolved
Hide resolved
native/spark-expr/src/static_invoke/char_varchar_utils/read_side_padding.rs
Outdated
Show resolved
Hide resolved
bb0c2f3
to
3d68aa9
Compare
@mbutrovich , seems like a test failed due to a perhaps transient Spark env issue. Could you rerun the failed check whenever you get a chance please ? |
DataType::Utf8 => { | ||
spark_read_side_padding_internal::<i32>(array, truncate, rpad_arg) | ||
} | ||
DataType::LargeUtf8 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we bring this to DataFusion we will need to support Utf8View
. We can't really test that in Comet without a unit test in the file, but something to prepare for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sharing this @mbutrovich I will update this info in the github issue I plan to create to port these changes to data fusion crate
@@ -71,44 +100,78 @@ fn spark_read_side_padding2( | |||
} | |||
} | |||
|
|||
enum RPadArgument { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a new enum
type instead of relying on ColumnarValue
when can already represent a scalar or array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you . This is great suggestion and I went ahead and leveraged ColumnarValue
to fork to the right logic
truncate, | ||
ColumnarValue::Scalar(ScalarValue::Int32(Some(*length))), | ||
), | ||
// Dictionary support required for SPARK-48498 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems related to padding. How does this affect dictionary encoded columns?
array, | ||
truncate, | ||
ColumnarValue::Array(Arc::<dyn arrow::array::Array>::clone(array_int)), | ||
), | ||
// Dictionary support required for SPARK-48498 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! My understanding is that dictionary support ensures SQL-compliant CHAR type literals, which always have a fixed length (This change already existed by the time I picked up this issue). Therefore, my support for the array argument is obsolete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very close, just questions about the comments at this point. Thanks for your patience @coderfender!
@@ -322,6 +322,16 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { | |||
checkSparkAnswer("SELECT try_add(_1, _2) FROM tbl") | |||
} | |||
} | |||
test("fix_rpad") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get the meaningful test name? what exactly fix is tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure . Thank you for the review. I will update the test name to add more context
} | ||
} | ||
|
||
fn add_padding_string(string: String, length: usize, truncate: bool) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we can think of impl like
fn add_padding_string(input: String, length: usize, truncate: bool) -> String {
let char_len = input.chars().count();
if char_len >= length {
if truncate {
// Take the first `length` chars safely
input.chars().take(length).collect()
} else {
input
}
} else {
// Pad with only the needed spaces
let padding = " ".repeat(length - char_len);
input + &padding
}
}
so we don't allocate spaces if its not needed
no unwrap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refering string by index, is it unicode safe? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great suggestion. My goal for now was to keep the original implementation intact and not introduce changes which directly doesn't solve the issue
e507c16
to
0fc9f93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @coderfender its LGTM
Please add a test for unicode string to see if there is an issue, if it is we need to comment the test to be fixed in the future, and also we probably need to document this limitations
for string in string_array.iter() { | ||
match string { | ||
Some(string) => builder.append_value(add_padding_string( | ||
string.parse().unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its good to avoid unwraps and return Err instead
for (string, length) in string_array.iter().zip(int_pad_array) { | ||
match string { | ||
Some(string) => builder.append_value(add_padding_string( | ||
string.parse().unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Which issue does this PR close?
Closes #2096
Implement comet native logic to support rpad(column, column) API in Spark . Currently comet only supports rpad(column, int)
What changes are included in this PR?
PR to implement native code to support rpad(col, int)
How are these changes tested?
Unit testing in cometSuite