Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Digital typhoon dataset #1748

Merged
merged 61 commits into from
Aug 29, 2024
Merged

Conversation

nilsleh
Copy link
Collaborator

@nilsleh nilsleh commented Nov 30, 2023

This PR adds the Digital Typhoon Dataset.

The implementation allows the following features:

  • create an input sequence of single channel images concatenated along channel dimension for nowcasting task (predicting label of last image in that sequence)
  • filter samples by min or max feature values
  • datamodule that lets you split by storm id (disjoint sets over the time domain) or over the time domain (disjoint set of storm ids)

TODO:

  • Target Normalization for regression task

Sample Image:

@nilsleh nilsleh marked this pull request as draft November 30, 2023 21:48
@github-actions github-actions bot added datasets Geospatial or benchmark datasets datamodules PyTorch Lightning datamodules labels Nov 30, 2023
@adamjstewart adamjstewart added this to the 0.6.0 milestone Nov 30, 2023
@github-actions github-actions bot added the testing Continuous integration testing label Dec 1, 2023
@calebrob6
Copy link
Member

This is really cool! I wonder if there is any generalization between this and the Cyclone dataset

@nilsleh
Copy link
Collaborator Author

nilsleh commented Dec 2, 2023

This is really cool! I wonder if there is any generalization between this and the Cyclone dataset

Stay tuned:)

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 2, 2023
@nilsleh nilsleh marked this pull request as ready for review December 2, 2023 14:18
@nilsleh nilsleh added this to the 0.6.0 milestone Aug 20, 2024
docs/api/non_geo_datasets.csv Outdated Show resolved Hide resolved
tests/data/digital_typhoon/data.py Show resolved Hide resolved
tests/datamodules/test_digital_typhoon.py Show resolved Hide resolved
tests/datasets/test_digital_typhoon.py Outdated Show resolved Hide resolved
tests/datasets/test_digital_typhoon.py Outdated Show resolved Hide resolved
torchgeo/datasets/digital_typhoon.py Show resolved Hide resolved
torchgeo/datasets/digital_typhoon.py Outdated Show resolved Hide resolved
)
)

# torchgeo expects a single label
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fix this if needed. I'm trying to add support for multi-label classification in #2219

# tensor with added channel dimension
tensor = torch.from_numpy(h5f['Infrared'][:]).unsqueeze(0)

# follow normalization procedure
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this should be in the datamodule instead. Possibly also true for other stuff.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it has to be part of the dataset because the normalization values can change. In the datamodule it will become quiet messy, and overall I think it's much nicer if normalization happens in the dataset and any augmentations in the datamodule.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only complaint is that the user can't perform their own normalization, or if they do it is now duplicated. Definitely something worth discussing more broadly though (not just in your PR, for all datasets).

name: torch.tensor(feature_df[name].item()).float()
for name in self.features
}
# normalize the targets for regression
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good now. Only remaining comments worth addressing before we merge are:

  1. list -> tuple
  2. Dataset name

2 in particular is important to avoid API changes in the future. All other comments can be changed later.

@adamjstewart
Copy link
Collaborator

adamjstewart commented Aug 28, 2024

Inspired by this rename I wasted invested 2 hrs of my life writing https://github.com/adamjstewart/dotfiles/blob/master/bin/git-sed

adamjstewart
adamjstewart previously approved these changes Aug 28, 2024
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to get minimum tests passing now.

@github-actions github-actions bot added the dependencies Packaging and dependencies label Aug 28, 2024
@github-actions github-actions bot removed the dependencies Packaging and dependencies label Aug 29, 2024
@adamjstewart adamjstewart merged commit b9a09f5 into microsoft:main Aug 29, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants