You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
awswrangler.athena.to_iceberg() might have race condition betweens3.to_parquet() and _merge_iceberg().
The logical sequence of calling s3.to_parquet() followed by _merge_iceberg() seems correct. However, there's a potential race condition where the Athena query execution request might proceed before the Parquet file upload to S3 is complete. This could lead to errors in the Athena engine, such as failing to find the file or attempting to read an incomplete upload.
(In fact, in my dev environment, when using to_iceberg() with large datasets, I frequently encounter failures with the error "HIVE_BAD_DATA: Not valid Parquet file", which I suspect is due to the query attempting to access files before they're fully uploaded.)
For this, I propose two potential solutions:
Implement S3 and Glue verification (more robust but requires additional API calls)
Add a callback mechanism (enables asynchronous processing but may increase code complexity)
If development time is a concern those two,
I propose a simpler alternative like add delay_time param :
This would provide a basic mechanism to ensure logic stability by allowing for a configurable delay between the upload and query execution. While not ideal, it could serve as a quick interim solution to improve reliability.
Describe the bug
Hi!
awswrangler.athena.to_iceberg() might have race condition between
s3.to_parquet()
and_merge_iceberg()
.The logical sequence of calling
s3.to_parquet()
followed by_merge_iceberg()
seems correct. However, there's a potential race condition where theAthena
query execution request might proceed before the Parquet file upload toS3
is complete. This could lead toerrors
in theAthena
engine, such as failing to find the file or attempting to read an incomplete upload.(In fact, in my dev environment, when using
to_iceberg()
with large datasets, I frequently encounter failures with the error"HIVE_BAD_DATA: Not valid Parquet file"
, which I suspect is due to the query attempting to access files before they're fully uploaded.)For this, I propose two potential solutions:
S3
andGlue
verification (more robust but requires additional API calls)callback mechanism
(enables asynchronous processing but may increase code complexity)If development time is a concern those two,
I propose a simpler alternative like add
delay_time
param :And implement a simple delay:
This would provide a basic mechanism to ensure logic stability by allowing for a configurable delay between the upload and query execution. While not ideal, it could serve as a quick interim solution to improve reliability.
Best regards,
How to Reproduce
OS
Amazon Linux 2, Jupyter Lab 3(notebook-al2-v2)
Python version
3.10.8
AWS SDK for pandas version
3.10.0
The text was updated successfully, but these errors were encountered: