Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JP-3768: Move outlier detection median computers to stcal #292

Merged
merged 16 commits into from
Oct 1, 2024

Conversation

emolter
Copy link
Collaborator

@emolter emolter commented Sep 27, 2024

Resolves JP-3768

This PR ports the median calculation machinery added by spacetelescope/jwst#8782 into stcal for use by other missions. See spacetelescope/jwst#8840 for the related changes in jwst.

Tasks

  • update or add relevant tests
  • update relevant docstrings and / or docs/ page
  • Does this PR change any API used downstream? (if not, label with no-changelog-entry-needed)
    • write news fragment(s) in changes/: echo "changed something" > changes/<PR#>.<changetype>.rst (see below for change types)
    • run regression tests with this branch installed ("git+https://github.com/<fork>/stcal@<branch>")
news fragment change types...
  • changes/<PR#>.apichange.rst: change to public API
  • changes/<PR#>.bugfix.rst: fixes an issue
  • changes/<PR#>.general.rst: infrastructure or miscellaneous change

Copy link

codecov bot commented Sep 27, 2024

Codecov Report

Attention: Patch coverage is 99.23664% with 2 lines in your changes missing coverage. Please review.

Project coverage is 85.19%. Comparing base (8988d2c) to head (c2d10b7).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/stcal/outlier_detection/median.py 98.60% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #292      +/-   ##
==========================================
+ Coverage   84.76%   85.19%   +0.43%     
==========================================
  Files          44       46       +2     
  Lines        8542     8804     +262     
==========================================
+ Hits         7241     7501     +260     
- Misses       1301     1303       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@emolter
Copy link
Collaborator Author

emolter commented Sep 27, 2024

jwst regtests started here
edit: no failures

@emolter emolter marked this pull request as ready for review September 27, 2024 19:19
@emolter emolter requested a review from a team as a code owner September 27, 2024 19:19
Copy link
Collaborator

@kmacdonald-stsci kmacdonald-stsci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some incomplete docstrings missing a brief description of functions/methods. Other than that, looks fine.

memory efficiency optimizations. np.nanmedian always uses at least 64-bit
precision internally, and this is too memory-intensive. Instead, loop over
the median calculation to avoid the memory usage of the internal upcasting
and temporary array allocation. The additional runtime of this loop is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the largest index of shape[0] tested? When testing ramp fitting the file jw02589006001_04101_00001-seg001_nrs1_uncal.fits has 6103 integrations. This seriously slowed down computation time in ramp fitting when running python compared to data with the same dimensions, but a much lower number of integrations.

Copy link
Collaborator Author

@emolter emolter Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the question you are asking about the runtime performance of the nanmedian3D function, right? This conversation on the JWST PR may be relevant, although no test was done with the zeroth (time/n_groups) axis being as large as that.

When you say that it seriously slowed down runtime in ramp fitting, do you mean that a median calculation was done, or just that the entire step scaled poorly with the number of integrations?

For imaging data, where we've seen the slowest processing, the largest n_groups we have seen is only ~100, since the step takes _cal files as input instead of _calints. However, I can look into whether there would be any coronagraphic data that had a huge number of integrations in their _calints files, as this is also used by jwst for coronagraphic data

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this on a relatively large coronagraphy dataset I had lying around from JWSTDMS-921. The shape of the array going into this function was (1250, 224, 288), so still not 6000 but this is on the large side of what is processed here in practice. The median calculation does not blow up the runtime in this case: the entire outlier detection step took only 5 seconds to run, as compared with ~1 hour for the coronagraphy-specific align_refs and klip steps. So I don't think this is a concern for runtime but let me know if you would like to see additional tests run.

data: np.ndarray,
idx: int | None = None
) -> None:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring describing method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the updated docstring looks ok to you

self._median_computer.add_image(data)

def evaluate(self: MedianComputer) -> np.ndarray:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring describing method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the updated docstring looks ok to you

Copy link
Collaborator

@braingram braingram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most comments are docs/docstring related.

Would you add median.py to the docs? It will likely be easier if the submodule has an __all__ (I added a comment about this) which will also help to define the "public" API.

src/stcal/outlier_detection/median.py Outdated Show resolved Hide resolved
src/stcal/outlier_detection/median.py Outdated Show resolved Hide resolved
src/stcal/outlier_detection/median.py Outdated Show resolved Hide resolved
src/stcal/outlier_detection/median.py Outdated Show resolved Hide resolved
src/stcal/outlier_detection/median.py Outdated Show resolved Hide resolved
src/stcal/outlier_detection/median.py Outdated Show resolved Hide resolved
src/stcal/outlier_detection/median.py Outdated Show resolved Hide resolved
src/stcal/outlier_detection/median.py Outdated Show resolved Hide resolved
tests/outlier_detection/test_median.py Outdated Show resolved Hide resolved
tests/outlier_detection/test_median.py Outdated Show resolved Hide resolved
@emolter
Copy link
Collaborator Author

emolter commented Oct 1, 2024

@braingram I think I incorporated all your comments with the most recent push. Thanks for the docstring updates, I think they improved things a lot, and I agree with your idea to make the OnDiskMedian and DiskAppendableArray private

@emolter emolter requested a review from braingram October 1, 2024 14:41
@emolter emolter requested a review from braingram October 1, 2024 16:04
pyproject.toml Outdated Show resolved Hide resolved
@braingram
Copy link
Collaborator

Would you add median.py to the docs? It will likely be easier if the submodule has an __all__ (I added a comment about this) which will also help to define the "public" API.

Was the addition of this new API to the docs not pushed?

@emolter
Copy link
Collaborator Author

emolter commented Oct 1, 2024

Would you add median.py to the docs? It will likely be easier if the submodule has an __all__ (I added a comment about this) which will also help to define the "public" API.

Was the addition of this new API to the docs not pushed?

I didn't even see the comment, my fault. I will look at the docs from this branch and see what changes are needed. I did already add the __all__

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 1, 2024
@braingram
Copy link
Collaborator

Downstream romancal failures are what I'd consider unrelated. It appears the CI is using the pytest configuration from stcal when running those tests (instead of using the one in romancal). The jwst tests haven't finished but I wouldn't be surprised if it does the same thing (and is now turning some unclosed file warnings into errors).

Thanks for updating the docs, they look good to me:
https://stcal--292.org.readthedocs.build/en/292/stcal/outlier_detection/index.html

I think #292 (comment) is the only unresolved comment I made.

Copy link
Collaborator

@braingram braingram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks great to me.

@emolter emolter merged commit e4dbc44 into spacetelescope:main Oct 1, 2024
24 of 26 checks passed
@emolter emolter deleted the JP-3768 branch October 1, 2024 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation installation outlier-detection testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants