Multi-bbox timeseries querying #2408

sfalkena · 2024-11-13T13:54:00Z

This is a PR that is part of TorchGeo Timeseries support (#2382 - Return time series). The goal is to provide an interface that allows for the following options for querying:

sample = dataset[bbox]
sample = dataset[[bbox1, bbox2, bbox3]]
sample = dataset[([dataset1_bbox1], [dataset2_bbox1, dataset2_bbox2])]

The idea is that anything within a single bbox get's merged into a single raster, and that any query with multiple bboxes will stack along the time dimension. Querying with a tuple of (iterable) bboxes would split the subqueries to different datasets.

As long as this PR is in draft mode, I will keep test.ipynb for anyone interested in helping to try out the new method.

Current status: The first two ways of querying has been implemented for the RasterDataset and tested with a small example.

What has changed:

I've refactored quite a bit to reduce the complexity of __init__ and __get_item__ methods. try_set_metadata, _get_bounds, _compile_and_check_filename_regex and _populate_index are examples of this.
Existing functionality to merge everything within a bbox has been moved in __merge_single_bbox, where the biggest difference is that I keep track of a dataframe for all regex metadata instead of lists and dicts. The main reason is that it allows to group filepaths per band easily, as well of keeping track of which dates went into which merged raster. More about that later.
Multi-bbox queries are stacked across a new dimension (t, c, h, w) by __merge_query. We need to agree if we want to go for timedimension 1 with non-temporal datasets.
Apart from the tensor with timeseries imagery, I am returning sample['dates'] in datetime format too. This is a list of dates that went into every single bbox. So in the multi-bbox case this will be [[bbox1_t1, bbox1_t2, ...], [bbox2_t1]]. I chose datetime format for now since that worked well for me in practice, but this can be converted to any format by the transforms. I was thinking that instead of a list of dates, maybe we could return a daterange or something, but that mostly depends on the downstream use.
The filename_glob has been relaxed, so that all files/bands end up in the dataset index.
A new class variable nodata_value has been added, and drop_nodata has been added to the init of the class. Setting the value to True will ignore any merged raster that contains nodata values. This came from using the class in practice with Sentinel2, and seeing that some timestamps contained black (parts of) imagery, since some sentinel tiles are not square. In theory, this could become a separate PR, but I chose to add it here, because the effect of nodata pixels becomes more pronounced with timeseries.

What is still left to do:

Implement method 3, query with tuple (or raise an error if trying to index a "single" dataset with a tuple).
Implement/copy querying strategy to other datasets.
Pass all pre-commit checks.

adamjstewart · 2024-11-13T14:37:14Z

The filename_glob has been relaxed, so that all files/bands end up in the dataset index.

Why?

Initial code on multi-bbox ts querying

57a5da5

github-actions bot added the datasets Geospatial or benchmark datasets label Nov 13, 2024

Add plotting to notebook

f26abac

adamjstewart added this to the 0.7.0 milestone Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-bbox timeseries querying #2408

Multi-bbox timeseries querying #2408

sfalkena commented Nov 13, 2024 •

edited

Loading

adamjstewart commented Nov 13, 2024

Multi-bbox timeseries querying #2408

Are you sure you want to change the base?

Multi-bbox timeseries querying #2408

Conversation

sfalkena commented Nov 13, 2024 • edited Loading

adamjstewart commented Nov 13, 2024

sfalkena commented Nov 13, 2024 •

edited

Loading