[SPARK-39901][CORE][SQL] Redesign `ignoreCorruptFiles` to make it more accurate by adding a new config `spark.files.ignoreCorruptFiles.errorClasses` #47090

wayneguow · 2024-06-25T18:21:37Z

What changes were proposed in this pull request?

As described in issue SPARK-39901 the design of the current ignoreCorruptFiles feature has certain flaws. It excessively matches IOExceptions, which may cause a file to be Ignored by mistake when encountering some transient and sporadic IO exceptions.

This PR proposes a new config spark.files.ignoreCorruptFiles.errorClasses. By setting this config, Spark users can accurately ignore corrupt files caused by specific exceptions.

For example, if the config value is set as belows:

java.lang.IOException:not a Sequence file,java.lang.EOFException
(config format: className[:keyMsg],className[:keyMsg])

It means that when an IOException is encountered and the error message contains key information not a Sequence file, or when a java.lang.EOFException is encountered (note that only class needs to be judged here), corrupted files should be ignored.

The default value of this config is "", which means that the error class list for ignoring corrupt files has not been set. At this time, the behavior of ignoreCorruptFiles remains exactly the same as before.

Why are the changes needed?

Optimize the defects of the current ignoreCorruptFiles feature.

Does this PR introduce any user-facing change?

Yes, Spark users can change the behavior of ignoreCorruptFiles by setting the new config; but by default, the behavior remains the same as before. So don't worry it's a breakchange for users.

How was this patch tested?

Add some new test cases and Pass GA.

Was this patch authored or co-authored using generative AI tooling?

No.

wayneguow · 2024-06-26T04:23:47Z

cc @JoshRosen @LuciferYang , when you have time，thanks.

github-actions bot added SQL DOCS CORE AVRO labels Jun 25, 2024

wayneguow force-pushed the SPARK-39901 branch from 11424df to 9de45b7 Compare June 26, 2024 02:55

wayneguow marked this pull request as ready for review June 26, 2024 04:22

wayneguow added 4 commits June 26, 2024 19:54

redesign

4c6b43f

fix test cases and import order

31e39c5

optimize test cases code

ec096dc

fix style

698685c

wayneguow force-pushed the SPARK-39901 branch from 2e35379 to 698685c Compare June 26, 2024 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39901][CORE][SQL] Redesign `ignoreCorruptFiles` to make it more accurate by adding a new config `spark.files.ignoreCorruptFiles.errorClasses` #47090

[SPARK-39901][CORE][SQL] Redesign `ignoreCorruptFiles` to make it more accurate by adding a new config `spark.files.ignoreCorruptFiles.errorClasses` #47090

wayneguow commented Jun 25, 2024 •

edited

Loading

wayneguow commented Jun 26, 2024

[SPARK-39901][CORE][SQL] Redesign ignoreCorruptFiles to make it more accurate by adding a new config spark.files.ignoreCorruptFiles.errorClasses #47090

Are you sure you want to change the base?

[SPARK-39901][CORE][SQL] Redesign ignoreCorruptFiles to make it more accurate by adding a new config spark.files.ignoreCorruptFiles.errorClasses #47090

Conversation

wayneguow commented Jun 25, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

wayneguow commented Jun 26, 2024

[SPARK-39901][CORE][SQL] Redesign `ignoreCorruptFiles` to make it more accurate by adding a new config `spark.files.ignoreCorruptFiles.errorClasses` #47090

[SPARK-39901][CORE][SQL] Redesign `ignoreCorruptFiles` to make it more accurate by adding a new config `spark.files.ignoreCorruptFiles.errorClasses` #47090

wayneguow commented Jun 25, 2024 •

edited

Loading