Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IncompleteRead exception crashes the JobManager #601

Open
VictorVerhaert opened this issue Aug 14, 2024 · 2 comments
Open

IncompleteRead exception crashes the JobManager #601

VictorVerhaert opened this issue Aug 14, 2024 · 2 comments

Comments

@VictorVerhaert
Copy link
Contributor

While running long jobs using the JobManager, it crashes while trying to download results.

Traceback (most recent call last):
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/urllib3/response.py", line 748, in _error_catcher
    yield
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/urllib3/response.py", line 894, in _raw_read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
urllib3.exceptions.IncompleteRead: IncompleteRead(1293825744 bytes read, 627790347 more expected)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/requests/models.py", line 820, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/urllib3/response.py", line 1060, in stream
    data = self.read(amt=amt, decode_content=decode_content)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/urllib3/response.py", line 977, in read
    data = self._raw_read(amt)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/urllib3/response.py", line 872, in _raw_read
    with self._error_catcher():
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/urllib3/response.py", line 772, in _error_catcher
    raise ProtocolError(arg, e) from e
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(1293825744 bytes read, 627790347 more expected)', IncompleteRead(1293825744 bytes read, 627790347 more expected))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/victor.verhaert/LCFM/lcfm-production/notebooks/JM-LCFM.py", line 137, in <module>
    job_manager.run_jobs(
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/openeo/extra/job_management.py", line 273, in run_jobs
    self._update_statuses(df)
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/openeo/extra/job_management.py", line 433, in _update_statuses
    self.on_job_done(the_job, df.loc[i])
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/openeo/extra/job_management.py", line 373, in on_job_done
    job.get_results().download_files(target=job_dir)
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/openeo/rest/job.py", line 502, in download_files
    downloaded = [a.download(target) for a in self.get_assets()]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/openeo/rest/job.py", line 502, in <listcomp>
    downloaded = [a.download(target) for a in self.get_assets()]
                  ^^^^^^^^^^^^^^^^^^
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/openeo/rest/job.py", line 378, in download
    for block in response.iter_content(chunk_size=chunk_size):
  File "/home/victor.verhaert/LCFM/lcfm-production/.conda/lib/python3.11/site-packages/requests/models.py", line 822, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(1293825744 bytes read, 627790347 more expected)', IncompleteRead(1293825744 bytes read, 627790347 more expected))

We need to make the job manager more robust to these type of exceptions

@soxofaan
Copy link
Member

Do you have an idea if that ChunkedEncodingError is just a temp glitch or can you reproduce that failure each time you try to (manually) download the result assets?

@soxofaan
Copy link
Member

We need to make the job manager more robust to these type of exceptions

The question is what can be done better purely at the level of python client implementation.

Skipping the failure with a warning is tempting, but that might not be better (as a default behavior) because the end user might easily overlook that and get wrong impression that everything went fine.

An alternative simple improvement that could help here is add an option to not automatically download results of jobs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants