Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48715][SQL] Integrate UTF8String validation into collation-aware string function implementations #47131

Closed
wants to merge 11 commits into from

Conversation

uros-db
Copy link
Contributor

@uros-db uros-db commented Jun 27, 2024

What changes were proposed in this pull request?

Use our own invalid UTF-8 byte sequence replacement logic in UTF8String, before all .toString() method calls.

Why are the changes needed?

Avoid relying on Java to perform invalid UTF-8 byte sequence replacement, and ensure consistent results.

Does this PR introduce any user-facing change?

Yes, collation aware string function implementations will now rely on our own invalid UTF-8 string replacement implementation, instead of Java's.

How was this patch tested?

Existing tests, with some changes in UTF8StringSuite and CollationSupportSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jun 27, 2024
@uros-db uros-db requested a review from mkaravel July 2, 2024 20:44
Copy link
Contributor Author

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkaravel ready for another round of review

@uros-db uros-db changed the title [WIP][SPARK-48715][SQL] Integrate UTF8String.makeValid into string expressions [WIP][SPARK-48715][SQL] Integrate UTF8String validation into collation-aware string function implementations Jul 2, 2024
@uros-db uros-db requested a review from mkaravel July 3, 2024 15:25
Copy link
Contributor

@mkaravel mkaravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor Author

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan ready for review

@cloud-fan cloud-fan changed the title [WIP][SPARK-48715][SQL] Integrate UTF8String validation into collation-aware string function implementations [SPARK-48715][SQL] Integrate UTF8String validation into collation-aware string function implementations Jul 4, 2024
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in bf25f0a Jul 4, 2024
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Jul 10, 2024
…re string function implementations

### What changes were proposed in this pull request?
Use our own invalid UTF-8 byte sequence replacement logic in UTF8String, before all `.toString()` method calls.

### Why are the changes needed?
Avoid relying on Java to perform invalid UTF-8 byte sequence replacement, and ensure consistent results.

### Does this PR introduce _any_ user-facing change?
Yes, collation aware string function implementations will now rely on our own invalid UTF-8 string replacement implementation, instead of Java's.

### How was this patch tested?
Existing tests, with some changes in `UTF8StringSuite` and `CollationSupportSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47131 from uros-db/make-valid.

Authored-by: Uros Bojanic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants