Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geosampler prechipping #2300

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

sfalkena
Copy link
Contributor

I think it makes sense to be able to interact with the chips of each sampler. Therefore, I want to propose changing the creation of the chips in the __init__ of each sampler, and to keep track of the chips using a GeoDataFrame. This leads to the following benefits:

  • It is possible to save the chips to file and to visualize them either in GIS software, or while developing.
  • It allows to filter the chips using other geometry, to remove chips that are not of interest. This could be particularly useful during inference to limit the number of chips. Example: When I want to infer a building detection model, I can now instantiate a gridsampler, filter my chips using a shapefile of urban areas and only infer on the intersection of my chips with the urban areas. See the function filter_chips.
  • It allows to split the samples. Right now I have created a function to split the samples to spread them across multiple workers. See the function set_worker_split.

An abstract method needs to be implemented for every sampler:

    @abc.abstractmethod    
    def get_chips(self) -> GeoDataFrame:

Let me know what you think!

@github-actions github-actions bot added testing Continuous integration testing samplers Samplers for indexing datasets labels Sep 17, 2024
@sfalkena
Copy link
Contributor Author

@sfalkena please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree company="Shell"

@sfalkena
Copy link
Contributor Author

I still need to add geopandas to the requirements. Do you use any dependency resolver for adding the right version or just simply find the version that is installed for python 3.10?

@adamjstewart
Copy link
Collaborator

The downside of this approach is that every epoch will contain the exact same set of images, whereas the current implementation samples different locations during each epoch. Not a deal breaker, but something to weigh against the proposed benefits:

It is possible to save the chips to file and to visualize them either in GIS software, or while developing.

You could also just call dataset.plot(sample) and visualize them without export. Are you seeing any issues with this? Would like to simplify this if so.

It allows to filter the chips using other geometry, to remove chips that are not of interest.

This is a valid advantage, and something we would like to support (one way or another).

It allows to split the samples.

Is there something wrong with our builtin splitters? The problem with your splitter is that there's no guarantee that two samples do not overlap across train and test.

P.S. Missing dependencies on geopandas and tqdm. Would need to be added to pyproject.toml, requirements/required.txt, and requirements/min-reqs.old. I usually use a binary search on PyPI versions until I find the minimum version that works for us.

@sfalkena
Copy link
Contributor Author

The downside of this approach is that every epoch will contain the exact same set of images, whereas the current implementation samples different locations during each epoch. Not a deal breaker, but something to weigh against the proposed benefits:

Agree with that and that is a good point. Maybe I could add a refresh_samples function to the random sampler that samples a new set? This function could be called manually every epoch.

You could also just call dataset.plot(sample) and visualize them without export. Are you seeing any issues with this? Would like to simplify this if so.

I indeed use this for visualizing a single sample, but more to inspect the underlying data rather than the location of this sample on a map. The point of saving the chips is to get a feeling for the complete set of samples that are generated. Especially in combination with multiple filtering steps during inference, I like the fact that I can doublecheck the areas that get inferred.

Is there something wrong with our builtin splitters? The problem with your splitter is that there's no guarantee that two samples do not overlap across train and test.

Nothing wrong with the existing splitters, although they split datasets rather than the individual samples. The existing functions are desirable over what I have implemented, I added this functionality more for inference, so that I can split my inference area across multiple workers. In theory I could also assign a dataset per worker, but I like the fact that right now I have one dataset and one sampler.

on the P.S: curious if you have any particular reason not to go for something like Poetry to handle package compatibility?

@adamjstewart
Copy link
Collaborator

curious if you have any particular reason not to go for something like Poetry to handle package compatibility?

Haven't seen a good reason to use it. Someone asks me why I'm not using flit, poetry, hatchling, etc. a couple times per year. But so far, I haven't seen any features that they add that setuptools doesn't have. But let's discuss that in a separate discussion.

@github-actions github-actions bot added the dependencies Packaging and dependencies label Sep 18, 2024
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 18, 2024
@sfalkena
Copy link
Contributor Author

So finally all checks are passing :) I added the refresh_samples method too. Please let me know if you want to see any other changes.

@adamjstewart
Copy link
Collaborator

Will refresh_samples be run automatically when using a DataLoader/DataModule? This doesn't really seem like a good solution to the problem.

@sfalkena
Copy link
Contributor Author

sfalkena commented Sep 18, 2024

I have now overridden the __iter__ method specifically for the RandomGeoSampler, so indeed chips get refreshed every epoch. Additionally, added a tutorial notebook on how to visualize the samples.

@sfalkena
Copy link
Contributor Author

One additional thing: In order to pass the notebooks tests, I changed the workflow file to install the checked out repo. By default, torchgeo is installed as one of the cells of the first notebooks, but that only installs torchgeo from the main. In that way no notebook would pass as part of a PR. Agree with installing torchgeo explicitly in my current way?

@adamjstewart
Copy link
Collaborator

One additional thing: In order to pass the notebooks tests, I changed the workflow file to install the checked out repo. By default, torchgeo is installed as one of the cells of the first notebooks, but that only installs torchgeo from the main. In that way no notebook would pass as part of a PR. Agree with installing torchgeo explicitly in my current way?

Can you open a separate PR for this? I would like to get this in soon, as it's also needed by @calebrob6 in #1897. Note that we need to keep -r requirements.txt to ensure that the versions are fixed. I also don't think -e is needed, it should just be the ..

@sfalkena
Copy link
Contributor Author

See #2306

@github-actions github-actions bot added the datamodules PyTorch Lightning datamodules label Sep 23, 2024
@github-actions github-actions bot removed the datamodules PyTorch Lightning datamodules label Sep 23, 2024
@adamjstewart
Copy link
Collaborator

I have not forgotten about this PR, just haven't had time to properly review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Packaging and dependencies documentation Improvements or additions to documentation samplers Samplers for indexing datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants