[BUG] Malformed fixed length byte array Parquet file loads corrupted data instead of error #14104

jlowe · 2023-09-13T15:40:58Z

Describe the bug
Using libcudf to load a Parquet file that is malformed "succeeds" by producing a table with some corrupted rows rather than returning an error as expected. Spark 3.5, parquet-mr 1.13.1, and pyarrow 13 all produce unexpected EOF errors when trying to load the same file.

Steps/Code to reproduce bug
Load https://github.com/apache/parquet-testing/blob/master/data/fixed_length_byte_array.parquet using libcudf. Note that it will produce a table with 1000 rows with no nulls, and some of the rows have a list of bytes longer than 4 entries. According to the docs for the file, the data is supposed to be a single column with a fixed-length byte array of size 4, yet some rows load with more than four bytes, some with no bytes.

Expected behavior
libcudf should return an error when trying to load the file rather than producing corrupted rows.

Signed-off-by: Jason Lowe <[email protected]>

…#9235) Signed-off-by: Jason Lowe <[email protected]>

etseidl · 2023-09-14T20:49:29Z

These are like puzzles 😅 So the issue with this file is that the schema says the flba_field is required, but the column index indicates there are nulls. Because the schema says the field is required, there is no definition level data either. I suppose this one could be detected by doing a sanity check during page header parsing (make sure the uncompressed size makes sense for how many values should be present), or even after reading the file metadata (schema says required but metadata says nulls are present). The page reader should also exit (but does not...it trusts the value counts and doesn't currently detect buffer overruns) when it reaches the end of the page data, but as with other errors that occur on the device, it's hard to communicate back to the host that some exception occurred.

GregoryKimball · 2023-09-18T19:16:10Z

Thank you @etseidl for looking into this. Of your proposals I prefer:

after reading the file metadata (schema says required but metadata says nulls are present)

I don't mind doing more work if we are going to crash anyways. What do you think is the simplest check to implement?
(FYI @PointKernel)

etseidl · 2023-09-18T20:10:24Z

@GregoryKimball I think the simplest would be to walk through the schema in some fashion, find the max definition level for each column, and then check the ColumnIndex for for each column chunk for that column and see if the num_nulls field is consistent with the max definition level (i.e, if max_def == 0 and num_nulls > 0 then error). This would be doable on the host without digging into the page data. But this requires that column indexes are present (which they are for this file). The next option would be to do the same thing, but instead walk the page headers in the file to get the null counts, but that would require V2 data page headers.

The only surefire way is to detect the buffer overun when decoding the values (which is what parquet-mr and arrow seem to do), but as I've said, erroring out of the kernel when that is detected and communicating the error to the host is an issue.

vuule · 2023-09-20T18:37:15Z

The decode kernel does not detect the error, page_state_s.error flag stays at zero when reading the linked file.
If this flag was raised, we could use it to communicate the error to the host. I think the overhead of such solution would be acceptable (4 byte D2H copy, w/o errors; plus atomicOr when an error occurs).

GregoryKimball · 2023-09-27T02:45:55Z

#14167 is taking the first step to solving this case. We will also need to update the decode kernel to detect this error.

jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS cuIO cuIO issue labels Sep 13, 2023

jlowe added a commit to jlowe/spark-rapids that referenced this issue Sep 13, 2023

xfail fixed_length_byte_array.parquet test due to rapidsai/cudf#14104

e081fbf

Signed-off-by: Jason Lowe <[email protected]>

This was referenced Sep 13, 2023

xfail fixed_length_byte_array.parquet test due to rapidsai/cudf#14104 [databricks] NVIDIA/spark-rapids#9235

Merged

[BUG] fixed_length_byte_array.parquet loads corrupted data instead of error NVIDIA/spark-rapids#9236

Open

jlowe added a commit to NVIDIA/spark-rapids that referenced this issue Sep 13, 2023

xfail fixed_length_byte_array.parquet test due to rapidsai/cudf#14104 (…

cbcbc76

…#9235) Signed-off-by: Jason Lowe <[email protected]>

GregoryKimball added 1 - On Deck To be worked on next and removed Needs Triage Need team to review and classify labels Sep 27, 2023

GregoryKimball added this to the Parquet continuous improvement milestone Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Malformed fixed length byte array Parquet file loads corrupted data instead of error #14104

[BUG] Malformed fixed length byte array Parquet file loads corrupted data instead of error #14104

jlowe commented Sep 13, 2023

etseidl commented Sep 14, 2023

GregoryKimball commented Sep 18, 2023 •

edited

Loading

etseidl commented Sep 18, 2023

vuule commented Sep 20, 2023

GregoryKimball commented Sep 27, 2023

[BUG] Malformed fixed length byte array Parquet file loads corrupted data instead of error #14104

[BUG] Malformed fixed length byte array Parquet file loads corrupted data instead of error #14104

Comments

jlowe commented Sep 13, 2023

etseidl commented Sep 14, 2023

GregoryKimball commented Sep 18, 2023 • edited Loading

etseidl commented Sep 18, 2023

vuule commented Sep 20, 2023

GregoryKimball commented Sep 27, 2023

GregoryKimball commented Sep 18, 2023 •

edited

Loading