[Python] read_table from s3 randomly fails due to timeout #45432

eladc · 2025-02-05T14:27:40Z

Describe the bug, including details regarding any error messages, version, and platform.

Hello,

This is very similar to bug #36007

the requesting machine is in the same region as the s3 bucket.
joblib is used to parallelize the download, up to 56 threads.
it is very difficult to reproduce, happens at least once a day to random users who are using the same code to download, but different parquets.

Installed packages:
arrow 1.3.0
pyarrow 14.0.1

  File "/opt/venv/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 3003, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,   
  File "/opt/venv/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2631, in read
    table = self._dataset.to_table(  
  File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3713, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_statusError: IOError: AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached

How can I debug this further?

Thank you.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

eladc added the Type: bug label Feb 5, 2025

github-actions bot added the Component: Python label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] read_table from s3 randomly fails due to timeout #45432

[Python] read_table from s3 randomly fails due to timeout #45432

eladc commented Feb 5, 2025

[Python] read_table from s3 randomly fails due to timeout #45432

[Python] read_table from s3 randomly fails due to timeout #45432

Comments

eladc commented Feb 5, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)