Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] JSON reader metadata contains in an extra child for string column inside (deeply) nested structs/lists #17108

Open
mhaseeb123 opened this issue Oct 17, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@mhaseeb123
Copy link
Member

Describe the bug
The table metadata read by the JSON reader contains an extra child for string columns located within deeply nested structs and lists. The extra child results is caught in assert when writing the read table and metadata to (say) parquet here. Writing the read table to parquet without using the metadata from read_json succeeds.

Steps/Code to reproduce bug

  1. Create the JSON file on an RDSLab machine
import cudf
from io import StringIO
import shutil

df = cudf.read_parquet(
    "/datasets/gkimball/spark_json/20241001/part-00000-505e98e9-a5c8-4720-8bb4-d6cc96625744-c000.snappy.parquet"
)
print("cudf read input parquet")

buf = StringIO(df["columnC"].str.cat(sep="\n", na_rep="{}"))
print("StringIO made JSONL buffer")

with open("/home/coder/json_buffer.json", "w") as fd:
    buf.seek(0)
    shutil.copyfileobj(buf, fd)
  1. Paste the contents of the attached parquet_io.txt file to parquet_io.cpp example as is and build libcudf and parquet_io example.

Expected behavior
The read table metadata should not have an extra child for string column.

Environment overview (please complete the following information)
Machine: dgx05 at RDS Lab, cudf branch-24.12, cudf conda devconntainer with cuda12.5

Environment details
N/A

Additional context
Note that once the write_parquet succeeds with the fix, the last read_parquet (verification) portion of the example may still fail until the changes from #17059 have been pulled in.

@mhaseeb123 mhaseeb123 added the bug Something isn't working label Oct 17, 2024
@mhaseeb123
Copy link
Member Author

For more context, here's the schema seen by write_parquet (column names normalized) when using the metadata from read_json with_schema.txt and here is the schema seen by write_parquet otherwise
without_schema.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

1 participant