Skip to content

Conversation

coderfender
Copy link
Contributor

@coderfender coderfender commented Aug 8, 2025

Which issue does this PR close?

Closes #2096

Implement comet native logic to support rpad(column, column) API in Spark . Currently comet only supports rpad(column, int)

What changes are included in this PR?

PR to implement native code to support rpad(col, int)

How are these changes tested?

Unit testing in cometSuite

@coderfender coderfender marked this pull request as draft August 8, 2025 16:54
@coderfender coderfender marked this pull request as ready for review August 9, 2025 01:03
@coderfender
Copy link
Contributor Author

@andygrove , The issue is with implementation of rpad to only support col,int signature . Rather than reverting to native spark code, I went ahead and implemented native code for col,col input (and added a test case in CometExpressionSuite . Please take a look at the changes and let me know your thoughts on the changes

@codecov-commenter
Copy link

codecov-commenter commented Aug 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.62%. Comparing base (f09f8af) to head (0fc9f93).
⚠️ Report is 483 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2099      +/-   ##
============================================
+ Coverage     56.12%   57.62%   +1.49%     
- Complexity      976     1297     +321     
============================================
  Files           119      147      +28     
  Lines         11743    13497    +1754     
  Branches       2251     2390     +139     
============================================
+ Hits           6591     7777    +1186     
- Misses         4012     4451     +439     
- Partials       1140     1269     +129     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @coderfender wondering should we port the code to https://github.com/apache/datafusion/tree/main/datafusion/spark/src/function and then reuse spark function from the DF spark crate?

@coderfender
Copy link
Contributor Author

coderfender commented Aug 11, 2025

Thank you for the review @comphead . Moving expressions to datafusion-spark create is indeed the goal once this change is merged into main

@coderfender
Copy link
Contributor Author

@andygrove , @comphead could you please review the code whenever you get a chance ? Thank you very much

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @coderfender! First round of feedback.

@coderfender
Copy link
Contributor Author

@mbutrovich , seems like a test failed due to a perhaps transient Spark env issue. Could you rerun the failed check whenever you get a chance please ?

@mbutrovich mbutrovich changed the title fix: rpad_bug_fix feat: rpad support column for second arg instead of just literal Aug 19, 2025
DataType::Utf8 => {
spark_read_side_padding_internal::<i32>(array, truncate, rpad_arg)
}
DataType::LargeUtf8 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we bring this to DataFusion we will need to support Utf8View. We can't really test that in Comet without a unit test in the file, but something to prepare for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing this @mbutrovich I will update this info in the github issue I plan to create to port these changes to data fusion crate

@@ -71,44 +100,78 @@ fn spark_read_side_padding2(
}
}

enum RPadArgument {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a new enum type instead of relying on ColumnarValue when can already represent a scalar or array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you . This is great suggestion and I went ahead and leveraged ColumnarValue to fork to the right logic

@coderfender coderfender requested a review from mbutrovich August 27, 2025 02:09
truncate,
ColumnarValue::Scalar(ScalarValue::Int32(Some(*length))),
),
// Dictionary support required for SPARK-48498
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache/spark#46832

This seems related to padding. How does this affect dictionary encoded columns?

array,
truncate,
ColumnarValue::Array(Arc::<dyn arrow::array::Array>::clone(array_int)),
),
// Dictionary support required for SPARK-48498
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! My understanding is that dictionary support ensures SQL-compliant CHAR type literals, which always have a fixed length (This change already existed by the time I picked up this issue). Therefore, my support for the array argument is obsolete.

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very close, just questions about the comments at this point. Thanks for your patience @coderfender!

@@ -322,6 +322,16 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper {
checkSparkAnswer("SELECT try_add(_1, _2) FROM tbl")
}
}
test("fix_rpad") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get the meaningful test name? what exactly fix is tested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure . Thank you for the review. I will update the test name to add more context

}
}

fn add_padding_string(string: String, length: usize, truncate: bool) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we can think of impl like

fn add_padding_string(input: String, length: usize, truncate: bool) -> String {
    let char_len = input.chars().count();

    if char_len >= length {
        if truncate {
            // Take the first `length` chars safely
            input.chars().take(length).collect()
        } else {
            input
        }
    } else {
        // Pad with only the needed spaces
        let padding = " ".repeat(length - char_len);
        input + &padding
    }
}

so we don't allocate spaces if its not needed
no unwrap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refering string by index, is it unicode safe? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great suggestion. My goal for now was to keep the original implementation intact and not introduce changes which directly doesn't solve the issue

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @coderfender its LGTM
Please add a test for unicode string to see if there is an issue, if it is we need to comment the test to be fixed in the future, and also we probably need to document this limitations

for string in string_array.iter() {
match string {
Some(string) => builder.append_value(add_padding_string(
string.parse().unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its good to avoid unwraps and return Err instead

for (string, length) in string_array.iter().zip(int_pad_array) {
match string {
Some(string) => builder.append_value(add_padding_string(
string.parse().unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rpad expression panics if length input is not a literal value
4 participants