Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] read_table from s3 randomly fails due to timeout #45432

Open
eladc opened this issue Feb 5, 2025 · 0 comments
Open

[Python] read_table from s3 randomly fails due to timeout #45432

eladc opened this issue Feb 5, 2025 · 0 comments

Comments

@eladc
Copy link

eladc commented Feb 5, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Hello,

This is very similar to bug #36007

the requesting machine is in the same region as the s3 bucket.
joblib is used to parallelize the download, up to 56 threads.
it is very difficult to reproduce, happens at least once a day to random users who are using the same code to download, but different parquets.

Installed packages:
arrow 1.3.0
pyarrow 14.0.1

  File "/opt/venv/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 3003, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,   
  File "/opt/venv/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2631, in read
    table = self._dataset.to_table(  
  File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3713, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_statusError: IOError: AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached

How can I debug this further?

Thank you.

Component(s)

Python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant