Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of DataModel reading for single (or simple) metadata access #284

Open
braingram opened this issue Mar 14, 2024 · 4 comments

Comments

@braingram
Copy link
Collaborator

braingram commented Mar 14, 2024

There are use cases (like the jwst resample step) where loading a single keyword from many DataModel containing fits files may be useful. As the number of files might be very large and opening every model might exceed reasonable amounts of RAM it will be important to have a performant way to perform these simple keyword accesses.

Using meta.wcsinfo.s_region (contained in the ASDF extension) as an example there are a few ways this keyword can be read:

  1. Using stdatamodels.jwst.datamodels.open:
with dm.open(fn) as m:
    return m.meta.wcsinfo.s_region
  1. Using stdatamodels.asdf_in_fits:
with asdf_in_fits.open(fn) as af:
    return af['meta']['wcsinfo']['s_region']
  1. Using astropy.io.fits and asdf.open:
with fits.open(fn) as ff:
    bs = io.BytesIO(ff['ASDF'].data.tobytes())
    with asdf.open(bs) as af:
        return af['meta']['wcsinfo']['s_region']
  1. Using astropy.io.fits and asdf.util.load_yaml:
with fits.open(fn) as ff:
    bs = io.BytesIO(ff['ASDF'].data.tobytes())
    tree = asdf.util.load_yaml(bs)
    return tree['meta']['wcsinfo']['s_region']

Of the 4 options above 1-3 are similar in performance (using both a ImageModel and a larger IFUImageModel as test files). With performance being limited primarily by asdf.open (more on that below). 4 is much faster in both cases. See the below table for performance (run with cProfile so slightly slower than real).

ImageModel IFUImageModel
dm.open 0.245 3.945
asdf_in_fits 0.137 3.815
asdf.open 0.143 3.820
load_yaml 0.036 0.332

The table shows it's ~10x faster to use load_yaml as this skips:

  • validation
  • mapping fits based on the schema
  • loading arrays
  • conversion from asdf flavored yaml to custom python objects

Although all of the above is public API it seems worthwhile to investigate wrapping 4 as a helper function in stdatamodels (with sufficient documentation about how this is dangerous).

Below is a snakeviz generated graph of the call to dm.open for the IFUImageModel data file showing the bulk of the time spent in asdf.open:
Screen Shot 2024-03-14 at 5 28 29 PM

@nden
Copy link
Collaborator

nden commented Mar 15, 2024

Although all of the above is public API it seems worthwhile to investigate wrapping 4 as a helper function in stdatamodels (with sufficient documentation about how this is dangerous).

I agree having this function would be very useful. What dangers do you foresee?

@braingram
Copy link
Collaborator Author

braingram commented Mar 15, 2024

Thanks for taking a look at this issue. Most of the dangers are that it bypasses all of the stdatamodels and asdf machinery. This means:

  • no validation of the tree (although this was validated on write)
  • no checking that the tree matches the fits data (even via the FITS_HASH, so if a user modifies a fits header the modification will not be seen with load_yaml)
  • no filling of defaults or reference resolution (although I'm not sure either of these are used)
  • no conversion from tagged to custom objects (so many objects, like arrays, will come through as dictionaries)
  • no mapping of arrays to fits data

I think for the s_region use case all of the above are ok. Are there other places in jwst where this might be useful?

@perrygreenfield
Copy link
Collaborator

If one were concerned with validation, the option could be to do it the current way (though I would not make that the option for operations since the file should have been validated, and no sneaky, invalid updates should have been done). But I agree that an easy-to-use version of 4) should be provided. I can see mining of the files meta data is extremely useful, particularly out of operations and should be as efficient as possible.

@braingram
Copy link
Collaborator Author

ModelContainer in jwst also provides a models_grouped method that groups models based on a small set of metadata keywords:
https://github.com/spacetelescope/jwst/blob/master/jwst/datamodels/container.py#L466
These keywords are all strings in the primary fits header. Related to this issue, it would be useful to have an efficient way to read these keywords without incurring the overhead of traversing the schema. These attributes are duplicated in the ASDF extension, however the values in the fits header should be preferred (as is already done for datamodels.open).

It would be great if the efficient metadata access code could also load multiple fits keywords.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants