Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-bbox timeseries querying #2408

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

sfalkena
Copy link
Contributor

@sfalkena sfalkena commented Nov 13, 2024

This is a PR that is part of TorchGeo Timeseries support (#2382 - Return time series). The goal is to provide an interface that allows for the following options for querying:

sample = dataset[bbox]
sample = dataset[[bbox1, bbox2, bbox3]]
sample = dataset[([dataset1_bbox1], [dataset2_bbox1, dataset2_bbox2])]

The idea is that anything within a single bbox get's merged into a single raster, and that any query with multiple bboxes will stack along the time dimension. Querying with a tuple of (iterable) bboxes would split the subqueries to different datasets.

As long as this PR is in draft mode, I will keep test.ipynb for anyone interested in helping to try out the new method.

Current status: The first two ways of querying has been implemented for the RasterDataset and tested with a small example.

What has changed:

  • I've refactored quite a bit to reduce the complexity of __init__ and __get_item__ methods. try_set_metadata, _get_bounds, _compile_and_check_filename_regex and _populate_index are examples of this.
  • Existing functionality to merge everything within a bbox has been moved in __merge_single_bbox, where the biggest difference is that I keep track of a dataframe for all regex metadata instead of lists and dicts. The main reason is that it allows to group filepaths per band easily, as well of keeping track of which dates went into which merged raster. More about that later.
  • Multi-bbox queries are stacked across a new dimension (t, c, h, w) by __merge_query. We need to agree if we want to go for timedimension 1 with non-temporal datasets.
  • Apart from the tensor with timeseries imagery, I am returning sample['dates'] in datetime format too. This is a list of dates that went into every single bbox. So in the multi-bbox case this will be [[bbox1_t1, bbox1_t2, ...], [bbox2_t1]]. I chose datetime format for now since that worked well for me in practice, but this can be converted to any format by the transforms. I was thinking that instead of a list of dates, maybe we could return a daterange or something, but that mostly depends on the downstream use.
  • The filename_glob has been relaxed, so that all files/bands end up in the dataset index.
  • A new class variable nodata_value has been added, and drop_nodata has been added to the init of the class. Setting the value to True will ignore any merged raster that contains nodata values. This came from using the class in practice with Sentinel2, and seeing that some timestamps contained black (parts of) imagery, since some sentinel tiles are not square. In theory, this could become a separate PR, but I chose to add it here, because the effect of nodata pixels becomes more pronounced with timeseries.

What is still left to do:

  • Implement method 3, query with tuple (or raise an error if trying to index a "single" dataset with a tuple).
  • Implement/copy querying strategy to other datasets.
  • Pass all pre-commit checks.

@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Nov 13, 2024
@adamjstewart
Copy link
Collaborator

The filename_glob has been relaxed, so that all files/bands end up in the dataset index.

Why?

@adamjstewart adamjstewart added this to the 0.7.0 milestone Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants