-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add footprint finder code #127
base: main
Are you sure you want to change the base?
Conversation
My initial reaction is to say let's setup a single |
astrocut/cube_cut_from_footprint.py
Outdated
|
||
# Generate cutout from each cube file | ||
cutout_files = [] | ||
if threads == 'auto' or threads > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure threads here will help, and it may in fact hurt. cube_cut
already uses threads for accessing the S3 data, so with this change, each cutout file would spawn that many threads. In my testing, there are diminishing returns after 8 threads. Since this could end up creating many times that number of threads, I expect we'd see thread contention here.
If you set threads to 0, versus setting threads to "auto" or "8" what are the results here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing on my machine, I do see a performance improvement with a larger number of threads. It's more apparent when a sector isn't provided and more than 1 cutout is being generated. For example, these commands each generate 7 cutouts.
cube_cut_from_footprint('130 30', cutout_size=50, threads=0)
--> 1 min, 23.4 sec
cube_cut_from_footprint('130 30', cutout_size=50, threads=8)
--> 57.6 sec
cube_cut_from_footprint('130 30', cutout_size=50, threads='auto')
--> 46.4 sec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this before or after you made the change to use the same threads variable to pass into the cube_cut
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I ran the test before, when threads
was set to auto for cube_cut
. When using the same threads
variable, the call with threads=0
takes a lot longer, which makes sense. I also see that there is less than a second difference between threads=8
and threads=auto
.
Maybe it would be best to keep threads
for cube_cut
constant at some value, like 'auto' or 8? I think that using threads in cube_cut_from_footprint
is still worthwhile for the performance improvement when making several cutouts at once, but performance does seem to stagnate after a certain point.
I'm also thinking that the default value for threads
in cube_cut_from_footprint
should be set to 8 rather than 1, since performance is consistently better.
astrocut/cube_cut_from_footprint.py
Outdated
sequences : int, List[int], optional | ||
Default None. Sequence(s) from which to generate cutouts. Can provide a single | ||
sequence number as an int or a list of sequence numbers. If not specified, | ||
cutouts will be generated from all sequences that contain the cutout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you were trying to generalize this but I'm not sure what sequences are. Are these sectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they refer to sectors! I was trying to make the parameter more general and borrowed "sequence" from the CAOM field descriptions: https://mast.stsci.edu/api/v0/_c_a_o_mfields.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess for a user, it might be a little unclear that for TESS this is sectors, and not cameras or ccds or anything like that. so maybe some documentation or examples would help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more info and an example to the docstring in the latest commit! I'll also be updating the documentation at some point (probably next sprint) and will definitely include examples there.
b842116
to
9e51987
Compare
I added unit tests for the module, but there seems to be a problem with accessing the public footprint files on S3 in the runners. From what I can find online, this is a permissions issue. The odd thing is, we have other tests that access S3 resources and work fine.
|
fix type annotation
Use json.load() documentation, set threads to 8 for cube_cut
9e51987
to
735112e
Compare
I added a fixture to mock opening the footprint files with |
Marking as ready as tests and documentation have been added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an issue with accessing the footprint files in the bucket -- I have a ticket to work on that next week.
astrocut/footprint_cutouts.py
Outdated
load_polys: Convert the s_region column to an array of SphericalPolygon objects | ||
""" | ||
# Open footprint file with fsspec | ||
# Use caching to help performance, but check that remote UID matches the local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment still valid? I'm not seeing anything about UID in the code?
astrocut/footprint_cutouts.py
Outdated
s3_cache = os.path.join(os.path.dirname(os.path.abspath(__file__)), 's3_cache') | ||
with fsspec.open('filecache::' + s3_uri, s3={'anon': True}, | ||
filecache={'cache_storage': s3_cache, 'check_files': True}) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will put the cache directory in the current directory of the module. This is a little bit of a weird place because when working on the project, it puts it into the current directory.
So, at a minimum we should add an entry for this in the .gitignore
file to avoid committing to the repo.
It's also difficult for users to clean up, since it would likely be inside a nested directory inside a virtual environment once they've installed astrocut. I've seen programs put stuff like this in the current users home directory. In UNIX there's a variable XDG_CACHE_HOME
that we could use to figure out where to put it. But we'd need to support Windows and Mac as well, and each of those platforms does something else.
How long do we want the cache to live? I'm wondering if a week is too long? I'm also having a hard time finding where that is documented (in fsspec or s3fs) or how to control it.
Or maybe we should just download the cache every time someone makes their first cutout (keep it in memory) and we continue to use it until they exit?
In tesscut, we use a TTL cache for this purpose: https://cachetools.readthedocs.io/en/latest/#cachetools.TTLCache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the documentation on local caching in fsspec: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
And here is the API for CachingFileSystem
where the cache options are described: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.cached.CachingFileSystem
It takes less than a second to fetch the file from S3, so an in-memory store like TTLCache
is probably the way to go. This is all great to know!
if verbose: | ||
print(f'Found {len(cube_files_mapping)} matching cube files.') | ||
base_file_path = "s3://stpubdata/tess/public/mast/" if product == 'SPOC' \ | ||
else "s3://stpubdata/tess/public/mast/tica/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there some situations where someone might still want to make cutouts from a different storage path?
Could we make this an option?
One instance where someone might want a different option is if they have downloaded the cube, to make direct cutouts. But in that case, maybe they are just directly using cube_cut
?
Another option might be if they've mounted stpubdata cube data onto their machine or cloud environment (w/ fuse or something else) in which case they'd rather use that path than the s3 path. E.g. TIKE cloud platform?
I do think we can probably default to these paths, though.
Or maybe we just indicate that this function is only for making cutouts from s3 files? I guess it's already in the docstring, but it's not obvious from the module name or the function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was that users would use cube_cut
in the case that they already have the path (whether local or cloud) to a single cube file. I think that providing a single file kind of defeats the purpose of the footprint lookup.
A mounted filesystem or a local path to many cube files is worth considering, but I do wonder how common that use case would be. Could we guarantee that the cube files match the footprints coming from S3?
I'm inclined to rename the function to something like s3_cube_cut_from_footprint
and make a new issue to explore other options at a later time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe cloud_cube_cut_from_footprint
though it is a bit long?
though thinking more on it, i think what you have also works. i don't think there's a need to make an issue now. we can wait until a use case comes up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we could abbreviate and use cloud_cube_cut_from_fp
? That may cause some confusion though since "fp" isn't too obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
between the two, i think i'd prefer just leaving off cloud
. i'm not a big fan of acronym and fp
isn't obvious
Since we're having a delay in opening up the cached footprint files on S3, another approach could be to download the footprint directly from CAOM through vo-tap interface on the first cutout. This query gets the SPOC footprint table from CAOM and can be run from anywhere: And this gets the TICA footprint (takes a bit longer since it's HLSP) We manipulate that response in tesscut to create the footprint JSON file we store in S3 with a small bit of code but we could add that into astrocut as well. Taking advantage of the cached footprint file in S3 is likely better long term, but we could use this method initially, for this PR |
Do we want to revisit any of the mocking now that we can access the footprints? |
astrocut/footprint_cutouts.py
Outdated
return np.vectorize(single_intersect)(ffi_list['polygon'], polygon) | ||
|
||
|
||
def _ra_dec_crossmatch(all_ffis: Table, coord: SkyCoord, cutout_size, arcsec_per_px: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be useful to provide this function (or one that gets the footprint for you as well) as an external interface. I could see this being useful for people who want more control over the cube_cut.
Could be done in a separate PR or issue if you don't want to do it here.
|
||
from .utils.utils import parse_size_input | ||
|
||
TESS_ARCSEC_PER_PX = 21 # Number of arcseconds per pixel in a TESS image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want to make an issue or note to come back to this and generalize this somehow so it could be used for other missions. i think it could wait for now though, as I think TESS is the only mission we'd do this for.
astrocut/footprint_cutouts.py
Outdated
|
||
def _s_region_to_polygon(s_region: Column): | ||
""" | ||
Takes in a s_region string of type POLYGON or CIRCLE and returns it as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this docstring is correct -- this function as currently written only supports POLYGON
, not CIRCLE
astrocut/footprint_cutouts.py
Outdated
@cached(cache=FFI_TTLCACHE, lock=Lock()) | ||
def _get_s3_ffis(s3_uri, as_table: bool = False, load_polys: bool = False): | ||
""" | ||
Fetch the S3 footprint file containing a dict of all FFIs and a polygon column | ||
that holds the s_regions as polygon points and vectors. | ||
|
||
Optional Parameters: | ||
as_table: Return the footprint file as an Astropy Table | ||
load_polys: Convert the s_region column to an array of SphericalPolygon objects | ||
""" | ||
# Open footprint file with fsspec | ||
with fsspec.open(s3_uri, s3={'anon': True}) as f: | ||
ffis = json.load(f) | ||
|
||
if load_polys: | ||
ffis['polygon'] = _s_region_to_polygon(ffis['s_region']) | ||
|
||
if as_table: | ||
ffis = Table(ffis) | ||
|
||
return ffis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this function be removed, for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured that we could leave it in since we'll need it shortly, but it's true that we don't want users trying to call it while the bucket isn't accessible. I'll remove it for now.
I was looking for the coverage results. Doesn't need to be handled in this PR, but it looks like there's a problem with codecov upload in the github action. From Python 3.10 with numpy 1.23 and full coverage job: [2024-10-03T16:46:10.843Z] ['info'] => Project root located at: /home/runner/work/astrocut/astrocut
[2024-10-03T16:46:10.846Z] ['info'] -> No token specified or token is empty
[2024-10-03T16:46:10.936Z] ['info'] Searching for coverage files...
[2024-10-03T16:46:10.973Z] ['info'] => Found 1 possible coverage files:
./coverage.xml
[2024-10-03T16:46:10.973Z] ['info'] Processing ./coverage.xml...
[2024-10-03T16:46:10.976Z] ['info'] Detected GitHub Actions as the CI provider.
[2024-10-03T16:46:11.306Z] ['info'] Pinging Codecov: https://codecov.io/upload/v4?package=github-action-2.1.0-uploader-0.8.0&token=*******&branch=footprint-finder&build=11166073335&build_url=https%3A%2F%2Fgithub.com%2Fspacetelescope%2Fastrocut%2Factions%2Fruns%2F11166073335&commit=9552cbb9116481980b67cbea56921b10ebb327db&job=CI&pr=127&service=github-actions&slug=spacetelescope%2Fastrocut&name=&tag=&flags=&parent=
[2024-10-03T16:46:11.478Z] ['error'] There was an error running the uploader: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 429 - {"message":"Rate limit reached. Please upload with the Codecov repository upload token to resolve issue. Expected time to availability: 366s."}
[2024-10-03T16:46:11.479Z] ['info'] Codecov will exit with status code 0. If you are expecting a non-zero exit code, please pass in the `-Z` flag Anyways, coverage looks pretty good:
|
Made an issue here: https://jira.stsci.edu/browse/ASB-29119 |
This is a draft PR that adds the footprint finder code from TESScut. Leaving as draft since I haven't added tests yet, but I'd welcome any feedback on the actual implementation.
I tried to make things as general as possible in terms of functions and variable names, but some things are more TESS specific and will need to be pulled out into a wrapper later when we generalize. Namely,
_extract_sequence_information
,_create_sequence_list
, and_get_cube_files_from_sequence_obs
are more TESS-specific as of now. The same is true about certain parts ofcube_cut_from_footprint
, mainly variables.Something I was unsure about is how to best handle multiprocessing. The
cube_cut_from_footprint
function takes athreads
parameter to use asmax_workers
when usingcube_cut
on the gathered cube files. However, thecube_cut
function also takes its ownthreads
parameter. Should these be two separate parameters tocube_cut_from_footprint
, or the same? Shouldthreads
incube_cut
just be set to'auto'
?