Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add footprint finder code #127

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Add footprint finder code #127

wants to merge 17 commits into from

Conversation

snbianco
Copy link
Collaborator

@snbianco snbianco commented Sep 18, 2024

This is a draft PR that adds the footprint finder code from TESScut. Leaving as draft since I haven't added tests yet, but I'd welcome any feedback on the actual implementation.

I tried to make things as general as possible in terms of functions and variable names, but some things are more TESS specific and will need to be pulled out into a wrapper later when we generalize. Namely, _extract_sequence_information, _create_sequence_list, and _get_cube_files_from_sequence_obs are more TESS-specific as of now. The same is true about certain parts of cube_cut_from_footprint, mainly variables.

Something I was unsure about is how to best handle multiprocessing. The cube_cut_from_footprint function takes a threads parameter to use as max_workers when using cube_cut on the gathered cube files. However, the cube_cut function also takes its own threads parameter. Should these be two separate parameters to cube_cut_from_footprint, or the same? Should threads in cube_cut just be set to 'auto'?

@scfleming
Copy link
Collaborator

scfleming commented Sep 18, 2024

My initial reaction is to say let's setup a single n_threads parameter for multi-threading of any function that can use it. While on paper it might be nice to say "use 8 threads for this one but 16 for that one", that sounds like over-engineering to me at this stage of the process, and it would be much simpler to have a single n_threads parameter used globally.


# Generate cutout from each cube file
cutout_files = []
if threads == 'auto' or threads > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure threads here will help, and it may in fact hurt. cube_cut already uses threads for accessing the S3 data, so with this change, each cutout file would spawn that many threads. In my testing, there are diminishing returns after 8 threads. Since this could end up creating many times that number of threads, I expect we'd see thread contention here.

If you set threads to 0, versus setting threads to "auto" or "8" what are the results here?

Copy link
Collaborator Author

@snbianco snbianco Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing on my machine, I do see a performance improvement with a larger number of threads. It's more apparent when a sector isn't provided and more than 1 cutout is being generated. For example, these commands each generate 7 cutouts.

cube_cut_from_footprint('130 30', cutout_size=50, threads=0) --> 1 min, 23.4 sec
cube_cut_from_footprint('130 30', cutout_size=50, threads=8) --> 57.6 sec
cube_cut_from_footprint('130 30', cutout_size=50, threads='auto') --> 46.4 sec

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this before or after you made the change to use the same threads variable to pass into the cube_cut function?

Copy link
Collaborator Author

@snbianco snbianco Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I ran the test before, when threads was set to auto for cube_cut. When using the same threads variable, the call with threads=0 takes a lot longer, which makes sense. I also see that there is less than a second difference between threads=8 and threads=auto.

Maybe it would be best to keep threads for cube_cut constant at some value, like 'auto' or 8? I think that using threads in cube_cut_from_footprint is still worthwhile for the performance improvement when making several cutouts at once, but performance does seem to stagnate after a certain point.

I'm also thinking that the default value for threads in cube_cut_from_footprint should be set to 8 rather than 1, since performance is consistently better.

astrocut/__init__.py Outdated Show resolved Hide resolved
Comment on lines 198 to 201
sequences : int, List[int], optional
Default None. Sequence(s) from which to generate cutouts. Can provide a single
sequence number as an int or a list of sequence numbers. If not specified,
cutouts will be generated from all sequences that contain the cutout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you were trying to generalize this but I'm not sure what sequences are. Are these sectors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they refer to sectors! I was trying to make the parameter more general and borrowed "sequence" from the CAOM field descriptions: https://mast.stsci.edu/api/v0/_c_a_o_mfields.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess for a user, it might be a little unclear that for TESS this is sectors, and not cameras or ccds or anything like that. so maybe some documentation or examples would help here.

Copy link
Collaborator Author

@snbianco snbianco Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more info and an example to the docstring in the latest commit! I'll also be updating the documentation at some point (probably next sprint) and will definitely include examples there.

@snbianco
Copy link
Collaborator Author

I added unit tests for the module, but there seems to be a problem with accessing the public footprint files on S3 in the runners. From what I can find online, this is a permissions issue. The odd thing is, we have other tests that access S3 resources and work fine.

botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

@snbianco
Copy link
Collaborator Author

I added unit tests for the module, but there seems to be a problem with accessing the public footprint files on S3 in the runners. From what I can find online, this is a permissions issue. The odd thing is, we have other tests that access S3 resources and work fine.

botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

I added a fixture to mock opening the footprint files with fsspec, and tests are passing now.

@snbianco snbianco marked this pull request as ready for review September 26, 2024 19:32
@snbianco
Copy link
Collaborator Author

Marking as ready as tests and documentation have been added.

Copy link
Member

@falkben falkben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an issue with accessing the footprint files in the bucket -- I have a ticket to work on that next week.

load_polys: Convert the s_region column to an array of SphericalPolygon objects
"""
# Open footprint file with fsspec
# Use caching to help performance, but check that remote UID matches the local
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still valid? I'm not seeing anything about UID in the code?

Comment on lines 65 to 67
s3_cache = os.path.join(os.path.dirname(os.path.abspath(__file__)), 's3_cache')
with fsspec.open('filecache::' + s3_uri, s3={'anon': True},
filecache={'cache_storage': s3_cache, 'check_files': True}) as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will put the cache directory in the current directory of the module. This is a little bit of a weird place because when working on the project, it puts it into the current directory.

So, at a minimum we should add an entry for this in the .gitignore file to avoid committing to the repo.

It's also difficult for users to clean up, since it would likely be inside a nested directory inside a virtual environment once they've installed astrocut. I've seen programs put stuff like this in the current users home directory. In UNIX there's a variable XDG_CACHE_HOME that we could use to figure out where to put it. But we'd need to support Windows and Mac as well, and each of those platforms does something else.

How long do we want the cache to live? I'm wondering if a week is too long? I'm also having a hard time finding where that is documented (in fsspec or s3fs) or how to control it.

Or maybe we should just download the cache every time someone makes their first cutout (keep it in memory) and we continue to use it until they exit?

In tesscut, we use a TTL cache for this purpose: https://cachetools.readthedocs.io/en/latest/#cachetools.TTLCache

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the documentation on local caching in fsspec: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally

And here is the API for CachingFileSystem where the cache options are described: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.cached.CachingFileSystem

It takes less than a second to fetch the file from S3, so an in-memory store like TTLCache is probably the way to go. This is all great to know!

astrocut/footprint_cutouts.py Outdated Show resolved Hide resolved
astrocut/footprint_cutouts.py Outdated Show resolved Hide resolved
astrocut/footprint_cutouts.py Show resolved Hide resolved
astrocut/footprint_cutouts.py Outdated Show resolved Hide resolved
astrocut/tests/data/tess_ffi_footprints.json Outdated Show resolved Hide resolved
astrocut/footprint_cutouts.py Outdated Show resolved Hide resolved
if verbose:
print(f'Found {len(cube_files_mapping)} matching cube files.')
base_file_path = "s3://stpubdata/tess/public/mast/" if product == 'SPOC' \
else "s3://stpubdata/tess/public/mast/tica/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there some situations where someone might still want to make cutouts from a different storage path?

Could we make this an option?

One instance where someone might want a different option is if they have downloaded the cube, to make direct cutouts. But in that case, maybe they are just directly using cube_cut?

Another option might be if they've mounted stpubdata cube data onto their machine or cloud environment (w/ fuse or something else) in which case they'd rather use that path than the s3 path. E.g. TIKE cloud platform?

I do think we can probably default to these paths, though.

Or maybe we just indicate that this function is only for making cutouts from s3 files? I guess it's already in the docstring, but it's not obvious from the module name or the function name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that users would use cube_cut in the case that they already have the path (whether local or cloud) to a single cube file. I think that providing a single file kind of defeats the purpose of the footprint lookup.

A mounted filesystem or a local path to many cube files is worth considering, but I do wonder how common that use case would be. Could we guarantee that the cube files match the footprints coming from S3?

I'm inclined to rename the function to something like s3_cube_cut_from_footprint and make a new issue to explore other options at a later time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe cloud_cube_cut_from_footprint though it is a bit long?

though thinking more on it, i think what you have also works. i don't think there's a need to make an issue now. we can wait until a use case comes up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could abbreviate and use cloud_cube_cut_from_fp? That may cause some confusion though since "fp" isn't too obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

between the two, i think i'd prefer just leaving off cloud. i'm not a big fan of acronym and fp isn't obvious

@falkben
Copy link
Member

falkben commented Oct 2, 2024

Since we're having a delay in opening up the cached footprint files on S3, another approach could be to download the footprint directly from CAOM through vo-tap interface on the first cutout.

This query gets the SPOC footprint table from CAOM and can be run from anywhere:

https://mast.stsci.edu/vo-tap/api/v0.1/caom/sync?FORMAT=json&LANG=ADQL&QUERY=SELECT+obs_id,+t_min,+t_max,+s_region,+target_name,+sequence_number+FROM+dbo.ObsPointing+WHERE+obs_collection=%27TESS%27+AND+dataproduct_type=%27image%27+AND+target_name=%27TESS+FFI%27

And this gets the TICA footprint (takes a bit longer since it's HLSP)

https://mast.stsci.edu/vo-tap/api/v0.1/caom/sync?FORMAT=json&LANG=ADQL&QUERY=SELECT%20obs_id,%20t_min,%20t_max,%20s_region,%20target_name,%20sequence_number%20FROM%20dbo.ObsPointing%20WHERE%20obs_collection=%27HLSP%27%20AND%20dataproduct_type=%27image%27%20AND%20target_name=%27TICA%20FFI%27

We manipulate that response in tesscut to create the footprint JSON file we store in S3 with a small bit of code but we could add that into astrocut as well.

Taking advantage of the cached footprint file in S3 is likely better long term, but we could use this method initially, for this PR

@falkben
Copy link
Member

falkben commented Oct 3, 2024

Do we want to revisit any of the mocking now that we can access the footprints?

return np.vectorize(single_intersect)(ffi_list['polygon'], polygon)


def _ra_dec_crossmatch(all_ffis: Table, coord: SkyCoord, cutout_size, arcsec_per_px: int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be useful to provide this function (or one that gets the footprint for you as well) as an external interface. I could see this being useful for people who want more control over the cube_cut.

Could be done in a separate PR or issue if you don't want to do it here.


from .utils.utils import parse_size_input

TESS_ARCSEC_PER_PX = 21 # Number of arcseconds per pixel in a TESS image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to make an issue or note to come back to this and generalize this somehow so it could be used for other missions. i think it could wait for now though, as I think TESS is the only mission we'd do this for.


def _s_region_to_polygon(s_region: Column):
"""
Takes in a s_region string of type POLYGON or CIRCLE and returns it as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this docstring is correct -- this function as currently written only supports POLYGON, not CIRCLE

Comment on lines 92 to 112
@cached(cache=FFI_TTLCACHE, lock=Lock())
def _get_s3_ffis(s3_uri, as_table: bool = False, load_polys: bool = False):
"""
Fetch the S3 footprint file containing a dict of all FFIs and a polygon column
that holds the s_regions as polygon points and vectors.

Optional Parameters:
as_table: Return the footprint file as an Astropy Table
load_polys: Convert the s_region column to an array of SphericalPolygon objects
"""
# Open footprint file with fsspec
with fsspec.open(s3_uri, s3={'anon': True}) as f:
ffis = json.load(f)

if load_polys:
ffis['polygon'] = _s_region_to_polygon(ffis['s_region'])

if as_table:
ffis = Table(ffis)

return ffis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function be removed, for now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that we could leave it in since we'll need it shortly, but it's true that we don't want users trying to call it while the bucket isn't accessible. I'll remove it for now.

@falkben
Copy link
Member

falkben commented Oct 3, 2024

I was looking for the coverage results.

Doesn't need to be handled in this PR, but it looks like there's a problem with codecov upload in the github action. From Python 3.10 with numpy 1.23 and full coverage job:

[2024-10-03T16:46:10.843Z] ['info'] => Project root located at: /home/runner/work/astrocut/astrocut
[2024-10-03T16:46:10.846Z] ['info'] -> No token specified or token is empty
[2024-10-03T16:46:10.936Z] ['info'] Searching for coverage files...
[2024-10-03T16:46:10.973Z] ['info'] => Found 1 possible coverage files:
  ./coverage.xml
[2024-10-03T16:46:10.973Z] ['info'] Processing ./coverage.xml...
[2024-10-03T16:46:10.976Z] ['info'] Detected GitHub Actions as the CI provider.
[2024-10-03T16:46:11.306Z] ['info'] Pinging Codecov: https://codecov.io/upload/v4?package=github-action-2.1.0-uploader-0.8.0&token=*******&branch=footprint-finder&build=11166073335&build_url=https%3A%2F%2Fgithub.com%2Fspacetelescope%2Fastrocut%2Factions%2Fruns%2F11166073335&commit=9552cbb9116481980b67cbea56921b10ebb327db&job=CI&pr=127&service=github-actions&slug=spacetelescope%2Fastrocut&name=&tag=&flags=&parent=
[2024-10-03T16:46:11.478Z] ['error'] There was an error running the uploader: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 429 - {"message":"Rate limit reached. Please upload with the Codecov repository upload token to resolve issue. Expected time to availability: 366s."}

[2024-10-03T16:46:11.479Z] ['info'] Codecov will exit with status code 0. If you are expecting a non-zero exit code, please pass in the `-Z` flag

Anyways, coverage looks pretty good:

                                Stmts   Miss  Cover
---------------------------------------------------
astrocut/__init__.py               14      1    93%
astrocut/asdf_cutouts.py           82      3    96%
astrocut/cube_cut.py              388      4    99%
astrocut/cutout_processing.py     247     13    95%
astrocut/cutouts.py               244     26    89%
astrocut/exceptions.py             11      0   100%
astrocut/footprint_cutouts.py     139     11    92%
astrocut/make_cube.py             427     26    94%
astrocut/utils/__init__.py          0      0   100%
astrocut/utils/utils.py            86      4    95%
astrocut/utils/wcs_fitting.py      31      5    84%
---------------------------------------------------
                                 1669     93    94%

@snbianco
Copy link
Collaborator Author

snbianco commented Oct 3, 2024

I was looking for the coverage results.

Doesn't need to be handled in this PR, but it looks like there's a problem with codecov upload in the github action. From Python 3.10 with numpy 1.23 and full coverage job:

[2024-10-03T16:46:10.843Z] ['info'] => Project root located at: /home/runner/work/astrocut/astrocut
[2024-10-03T16:46:10.846Z] ['info'] -> No token specified or token is empty
[2024-10-03T16:46:10.936Z] ['info'] Searching for coverage files...
[2024-10-03T16:46:10.973Z] ['info'] => Found 1 possible coverage files:
  ./coverage.xml
[2024-10-03T16:46:10.973Z] ['info'] Processing ./coverage.xml...
[2024-10-03T16:46:10.976Z] ['info'] Detected GitHub Actions as the CI provider.
[2024-10-03T16:46:11.306Z] ['info'] Pinging Codecov: https://codecov.io/upload/v4?package=github-action-2.1.0-uploader-0.8.0&token=*******&branch=footprint-finder&build=11166073335&build_url=https%3A%2F%2Fgithub.com%2Fspacetelescope%2Fastrocut%2Factions%2Fruns%2F11166073335&commit=9552cbb9116481980b67cbea56921b10ebb327db&job=CI&pr=127&service=github-actions&slug=spacetelescope%2Fastrocut&name=&tag=&flags=&parent=
[2024-10-03T16:46:11.478Z] ['error'] There was an error running the uploader: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 429 - {"message":"Rate limit reached. Please upload with the Codecov repository upload token to resolve issue. Expected time to availability: 366s."}

[2024-10-03T16:46:11.479Z] ['info'] Codecov will exit with status code 0. If you are expecting a non-zero exit code, please pass in the `-Z` flag

Anyways, coverage looks pretty good:

                                Stmts   Miss  Cover
---------------------------------------------------
astrocut/__init__.py               14      1    93%
astrocut/asdf_cutouts.py           82      3    96%
astrocut/cube_cut.py              388      4    99%
astrocut/cutout_processing.py     247     13    95%
astrocut/cutouts.py               244     26    89%
astrocut/exceptions.py             11      0   100%
astrocut/footprint_cutouts.py     139     11    92%
astrocut/make_cube.py             427     26    94%
astrocut/utils/__init__.py          0      0   100%
astrocut/utils/utils.py            86      4    95%
astrocut/utils/wcs_fitting.py      31      5    84%
---------------------------------------------------
                                 1669     93    94%

Made an issue here: https://jira.stsci.edu/browse/ASB-29119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants