Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39901][CORE][SQL] Redesign ignoreCorruptFiles to make it more accurate by adding a new config spark.files.ignoreCorruptFiles.errorClasses #47090

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

wayneguow
Copy link
Contributor

@wayneguow wayneguow commented Jun 25, 2024

What changes were proposed in this pull request?

As described in issue SPARK-39901 the design of the current ignoreCorruptFiles feature has certain flaws. It excessively matches IOExceptions, which may cause a file to be Ignored by mistake when encountering some transient and sporadic IO exceptions.

This PR proposes a new config spark.files.ignoreCorruptFiles.errorClasses. By setting this config, Spark users can accurately ignore corrupt files caused by specific exceptions.

For example, if the config value is set as belows:

  • java.lang.IOException:not a Sequence file,java.lang.EOFException
    (config format: className[:keyMsg],className[:keyMsg])

It means that when an IOException is encountered and the error message contains key information not a Sequence file, or when a java.lang.EOFException is encountered (note that only class needs to be judged here), corrupted files should be ignored.

The default value of this config is "", which means that the error class list for ignoring corrupt files has not been set. At this time, the behavior of ignoreCorruptFiles remains exactly the same as before.

Why are the changes needed?

Optimize the defects of the current ignoreCorruptFiles feature.

Does this PR introduce any user-facing change?

Yes, Spark users can change the behavior of ignoreCorruptFiles by setting the new config; but by default, the behavior remains the same as before. So don't worry it's a breakchange for users.

How was this patch tested?

Add some new test cases and Pass GA.

Was this patch authored or co-authored using generative AI tooling?

No.

@wayneguow wayneguow changed the title [SPARK-39901][CORE][SQL] Redesign ignoreCorruptFiles feature to make it more accurate by adding a new config spark.files.ignoreCorruptFiles.errorClasses [SPARK-39901][CORE][SQL] Redesign ignoreCorruptFiles to make it more accurate by adding a new config spark.files.ignoreCorruptFiles.errorClasses Jun 25, 2024
@wayneguow wayneguow marked this pull request as ready for review June 26, 2024 04:22
@wayneguow
Copy link
Contributor Author

cc @JoshRosen @LuciferYang , when you have time,thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant