[SPARK-39901][CORE][SQL] Redesign ignoreCorruptFiles
to make it more accurate by adding a new config spark.files.ignoreCorruptFiles.errorClasses
#47090
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
As described in issue SPARK-39901 the design of the current
ignoreCorruptFiles
feature has certain flaws. It excessively matchesIOException
s, which may cause a file to be Ignored by mistake when encountering some transient and sporadic IO exceptions.This PR proposes a new config
spark.files.ignoreCorruptFiles.errorClasses
. By setting this config, Spark users can accurately ignore corrupt files caused by specific exceptions.For example, if the config value is set as belows:
java.lang.IOException:not a Sequence file,java.lang.EOFException
(config format: className[:keyMsg],className[:keyMsg])
It means that when an
IOException
is encountered and the error message contains key informationnot a Sequence file
, or when ajava.lang.EOFException
is encountered (note that only class needs to be judged here), corrupted files should be ignored.The default value of this config is "", which means that the error class list for ignoring corrupt files has not been set. At this time, the behavior of
ignoreCorruptFiles
remains exactly the same as before.Why are the changes needed?
Optimize the defects of the current
ignoreCorruptFiles
feature.Does this PR introduce any user-facing change?
Yes, Spark users can change the behavior of
ignoreCorruptFiles
by setting the new config; but by default, the behavior remains the same as before. So don't worry it's a breakchange for users.How was this patch tested?
Add some new test cases and Pass GA.
Was this patch authored or co-authored using generative AI tooling?
No.