Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix is_time to avoid memory overload #397

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open

fix is_time to avoid memory overload #397

wants to merge 19 commits into from

Conversation

cehbrecht
Copy link
Collaborator

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
    • This PR fixes issue #xyz
  • Tests for the changes have been added (for bug fixes / features)
  • Documentation has been added / updated (for bug fixes / features)
  • HISTORY.rst has been updated (with summary of main changes)

What kind of change does this PR introduce?:

This is a fix for is_time function to avoid memory overload when using coord.values for a data variable.

Does this PR introduce a breaking change?:

Other information:

This issue occurred when testing atlas-v2 data with very large datasets (several GBs):

https://github.com/roocs/rook/blob/dev-atlas-v2/notebooks/atlas-v2.ipynb

@cehbrecht cehbrecht marked this pull request as draft February 26, 2025 16:28
@coveralls
Copy link

coveralls commented Feb 26, 2025

Pull Request Test Coverage Report for Build 13785665732

Details

  • 30 of 33 (90.91%) changed or added relevant lines in 1 file are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.2%) to 86.818%

Changes Missing Coverage Covered Lines Changed/Added Lines %
clisops/utils/dataset_utils.py 30 33 90.91%
Files with Coverage Reduction New Missed Lines %
clisops/utils/dataset_utils.py 2 79.41%
Totals Coverage Status
Change from base Build 13766759433: 0.2%
Covered Lines: 3102
Relevant Lines: 3573

💛 - Coveralls

def is_time(coord):
"""
Determines if a coordinate is time.

:param coord: coordinate of xarray dataset e.g. coord = ds.coords[coord_id]
:return: (bool) True if the coordinate is time.
"""
if coord.ndim >= 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it really be False by default for time_bnds, so should time_bnds not be considered as time coordinate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know ... but if I skip the check/filter than I have to deal with it at other places ... see L106.

Copy link
Contributor

@sol1105 sol1105 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In l64 in function get_coord_by_type one could replace
coord_vars = list(ds.coords) + list(ds.data_vars) with coord_vars = (list(ds.coords) + list(ds.data_vars)).remove(get_main_variable(ds)), then is_time should no longer be run for variables that do not fit in memory, as it was before.

return coord_id, [x for x in coords if x != coord_id]
else:
return coord_id
if coord_id in ds.coords:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure coord_id is in ds.coords (lat_bnds is not)

@cehbrecht
Copy link
Collaborator Author

In l64 in function get_coord_by_type one could replace coord_vars = list(ds.coords) + list(ds.data_vars) with coord_vars = (list(ds.coords) + list(ds.data_vars)).remove(get_main_variable(ds)), then is_time should no longer be run for variables that do not fit in memory, as it was before.

I have added this. there was another issue that get_main_variable failed for some regrid tests. See regrid.detect_coordinate L858.

@Zeitsperre Zeitsperre mentioned this pull request Feb 27, 2025
4 tasks
@cehbrecht
Copy link
Collaborator Author

cehbrecht commented Mar 6, 2025

@sol1105 tests are working now. We have updated to the latests xarray ... some regrid tests are failing. I have marked them with xfail. Are you ok with this?

Not sure how we solve this in future. We could patch xarray ...

@Zeitsperre
Copy link
Collaborator

Zeitsperre commented Mar 6, 2025

@cehbrecht I asked my colleague @aulemahal to look into the issue on the xarray side. There might be some movement there soon!

For reference: pydata/xarray#8821

@cehbrecht cehbrecht marked this pull request as ready for review March 11, 2025 10:22
@cehbrecht
Copy link
Collaborator Author

@Zeitsperre @sol1105 good to go?

Copy link
Collaborator

@Zeitsperre Zeitsperre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Small changes.

Comment on lines +109 to +110
for coord_id in coords:
if coord_id in ds.coords:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably flatten this with:

if coord_id in coords and coord_id in ds.coords:

Comment on lines +225 to +226
Check if a coordinate uses cftime datetime objects.
Handles Dask-backed arrays for lazy evaluation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Check if a coordinate uses cftime datetime objects.
Handles Dask-backed arrays for lazy evaluation.
Check if a coordinate uses cftime datetime objects.
Handles Dask-backed arrays for lazy evaluation.

@pytest.mark.skipif(xe is None, reason=XESMF_IMPORT_MSG)
def test_regrid_basic(tmpdir, tmp_path, mini_esgf_data, xfail_if_xarray_incompatible):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the xfail_if_xarray_incompatible fixture obsolete now? If so, we should remove it from tests/conf.py

@Zeitsperre Zeitsperre mentioned this pull request Mar 11, 2025
4 tasks
@Zeitsperre
Copy link
Collaborator

@cehbrecht Don't forget to update the HISTORY.rst file!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants