RandomGeoSampler vs RandomBatchGeoSampler #1751

lcoandrade · 2023-12-01T16:52:32Z

lcoandrade
Dec 1, 2023

Hi again,
I have a doubt.

Considering the dataset I'm using, what should I use to define my dataset?

So far this is my definition:

class MyRasterImage(RasterDataset):
    filename_glob = "*.tif"
    is_image = True
    separate_files = False

class MyRasterMask(RasterDataset):
    filename_glob = "*.tif"
    is_image = False
    separate_files = False

image_set = MyRasterImage(
    paths=os.path.join(INPUT_DIR, TRAIN_DIR, IMG_DIR),
)

class ReclassTransformer:
    def __call__(self, sample):
        x = sample["mask"]
        x[x == 255] = 1
        sample["mask"] = x
        return sample

gt_set = MyRasterMask(
    paths=os.path.join(INPUT_DIR, TRAIN_DIR, LABEL_DIR),
    transforms=ReclassTransformer(),
)

dataset = image_set & gt_set

class CustomGeoDataModule(GeoDataModule):
    def setup(self, stage: str) -> None:
        """Set up datasets.

        Args:
            stage: Either 'fit', 'validate', 'test', or 'predict'.
        """
        self.dataset = self.dataset_class(**self.kwargs)
        
        generator = torch.Generator().manual_seed(0)
        (
            self.train_dataset,
            self.val_dataset,
            self.test_dataset,
        ) = random_bbox_assignment(dataset, [0.6, 0.2, 0.2], generator)
        
        if stage in ["fit"]:
            self.train_batch_sampler = RandomBatchGeoSampler(
                self.train_dataset, self.patch_size, self.batch_size, self.length
            )
        if stage in ["fit", "validate"]:
            self.val_sampler = GridGeoSampler(
                self.val_dataset, self.patch_size, self.patch_size
            )
        if stage in ["test"]:
            self.test_sampler = GridGeoSampler(
                self.test_dataset, self.patch_size, self.patch_size
            )
            
datamodule = CustomGeoDataModule(
    dataset_class = type(dataset), # GeoDataModule kwargs
    batch_size = BATCH_SIZE, # GeoDataModule kwargs
    patch_size = IMG_SIZE, # GeoDataModule kwargs
    length = MAX_WINDOWS, # GeoDataModule kwargs
    num_workers = WORKERS, # GeoDataModule kwargs
    dataset1 = image_set, # IntersectionDataset kwargs
    dataset2 = gt_set, # IntersectionDataset kwargs
    collate_fn = stack_samples, # IntersectionDataset kwargs
)

The prints of image_set and gt_masks are these:

MyRasterImage Dataset
    type: GeoDataset
    bbox: BoundingBox(minx=-6823569.7720385045, maxx=6776.0, miny=237532.76998160873, maxy=8895035.624662481, mint=0.0, maxt=9.223372036854776e+18)
    size: 180
MyRasterMask Dataset
    type: GeoDataset
    bbox: BoundingBox(minx=-1281274.2289648587, maxx=5353879.269352469, miny=3343500.0, maxy=12340611.831173468, mint=0.0, maxt=9.223372036854776e+18)
    size: 180

And for some reason, after making the intersection dataset, the labels have their CRS converted:

Converting MyRasterMask CRS from EPSG:26914 to EPSG:31256

Should I use RandomGeoSampler or RandomBatchGeoSampler and why?

Answered by adamjstewart

Dec 4, 2023

And for some reason, after making the intersection dataset, the labels have their CRS converted

See Figure 2 from our paper. Your datasets have different CRSs. If you don't reproject them to the same CRS, you won't be able to align them. Luckily, TorchGeo reprojects them for you automatically when you merge the two datasets. You can do this implicitly:

dataset_a & dataset_b  # warps b to a

or explicitly:

dataset_b = MyRasterMask(..., crs=...)

Should I use RandomGeoSampler or RandomBatchGeoSampler and why?

See Figure 3a from our paper. RandomGeoSampler samples a random patch from a random file for each sample in each mini-batch. RandomBatchGeoSampler works very similarly, but instead s…

View full answer

isaaccorley · 2023-12-01T18:05:11Z

isaaccorley
Dec 1, 2023
Maintainer

Hi @lcoandrade, not sure if you were aware, but we do have a dataset and datamodule already available for the Inria AIL dataset. See torchgeo.datasets.InriaAerialImageLabeling and torchgeo.datamodules.InriaAerialImageLabelingDatamodule

0 replies

lcoandrade · 2023-12-01T20:13:08Z

lcoandrade
Dec 1, 2023
Author

Hi, @isaaccorley. Thanks to @adamjstewart , I know that.

I'm using it as a custom dataset because I'm making a comparative study between Torchgeo and Rastervision with my students.

I already showed them how to use a custom dataset on RV and now I'll show them how to perform the same work on TG. The idea is to use our own custom dataset later.

I'm asking this because my training batches are quite smaller than the validation. This generates bad results after training.

Here you can see.
Training just 12 and validation 7502.

Why is this happening?

4 replies

isaaccorley Dec 2, 2023
Maintainer

In RandomBatchGeoSampler in your notebook you have length=100 and batch_size=8 which would give you roughly 100/8~=12 train batches. If you would like to increase the number of batches per epoch you can increase the length parameter. You can see more about the samplers in the docs here.

lcoandrade Dec 2, 2023
Author

I was thinking that the number of samples would be made by file in the folder. Now, I understand. Thanks!

adamjstewart Dec 4, 2023
Maintainer

You can also use the default length (None) and let TorchGeo compute approximately how many patches it thinks it can load from all files. Then you should get a similar number of patches during training and validation.

lcoandrade Dec 4, 2023
Author

I'll try that!

lcoandrade · 2023-12-02T13:40:54Z

lcoandrade
Dec 2, 2023
Author

Other thing that is bothering me. While training, I'm getting:

ERROR 1: Point outside of projection domain

2 replies

adamjstewart Dec 4, 2023
Maintainer

This sometimes happens if not all of your files can be reprojected to the CRS you are using. Usually, I suggest using a different CRS.

lcoandrade Dec 4, 2023
Author

Thanks!

lcoandrade · 2023-12-02T19:46:51Z

lcoandrade
Dec 2, 2023
Author

Changing to GridGeoSampler to train, I get 15k batches, which is expected.

2 replies

adamjstewart Dec 4, 2023
Maintainer

Note that RandomGeoSampler/RandomBatchGeoSampler is recommended during training (because it can generate infinitely many unique samples) and GridGeoSampler is recommended during validation/testing (because it can cover the entirety of all images without overlap). See https://torchgeo.readthedocs.io/en/stable/api/samplers.html

lcoandrade Dec 4, 2023
Author

I know that. I was just trying the use of the GridGeoSampler to train to check the amount of batches. I was considering that the number of samples in the RandomBatchGeoSampler would be drawn for file.

adamjstewart · 2023-12-04T09:19:12Z

adamjstewart
Dec 4, 2023
Maintainer

And for some reason, after making the intersection dataset, the labels have their CRS converted

See Figure 2 from our paper. Your datasets have different CRSs. If you don't reproject them to the same CRS, you won't be able to align them. Luckily, TorchGeo reprojects them for you automatically when you merge the two datasets. You can do this implicitly:

dataset_a & dataset_b  # warps b to a

or explicitly:

dataset_b = MyRasterMask(..., crs=...)

Should I use RandomGeoSampler or RandomBatchGeoSampler and why?

See Figure 3a from our paper. RandomGeoSampler samples a random patch from a random file for each sample in each mini-batch. RandomBatchGeoSampler works very similarly, but instead samples all random patches from the same random file during the entire mini-batch. The result is that GDAL's LRU cache is more likely to be hit for larger mini-batches, smaller files, or larger block sizes. You should find that RandomBatchGeoSampler is slightly faster for file I/O than RandomGeoSampler, although that may not matter if your I/O is fast and GPU is slow.

1 reply

lcoandrade Dec 4, 2023
Author

In the beginning, I just checked a few files from my dataset and assumed that all were in the same CRS. Now, after reading this, I've checked all the files and got that they have multiple CRS:

images: [CRS.from_epsg(31256), CRS.from_epsg(31254), CRS.from_epsg(26914), CRS.from_epsg(26916), CRS.from_epsg(26910)]
gt: [CRS.from_epsg(31256), CRS.from_epsg(31254), CRS.from_epsg(26914), CRS.from_epsg(26916), CRS.from_epsg(26910)]

It was wrong to assume that. Thanks for you enlightening.

lcoandrade · 2023-12-04T12:04:01Z

lcoandrade
Dec 4, 2023
Author

Many thanks to @isaaccorley and @adamjstewart for the kind and precise answers. Now, I understand better how TG works. I consider that I have many answers here in the discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomGeoSampler vs RandomBatchGeoSampler #1751

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RandomGeoSampler vs RandomBatchGeoSampler #1751

lcoandrade Dec 1, 2023

Replies: 6 comments · 9 replies

isaaccorley Dec 1, 2023 Maintainer

lcoandrade Dec 1, 2023 Author

isaaccorley Dec 2, 2023 Maintainer

lcoandrade Dec 2, 2023 Author

adamjstewart Dec 4, 2023 Maintainer

lcoandrade Dec 4, 2023 Author

lcoandrade Dec 2, 2023 Author

adamjstewart Dec 4, 2023 Maintainer

lcoandrade Dec 4, 2023 Author

lcoandrade Dec 2, 2023 Author

adamjstewart Dec 4, 2023 Maintainer

lcoandrade Dec 4, 2023 Author

adamjstewart Dec 4, 2023 Maintainer

lcoandrade Dec 4, 2023 Author

lcoandrade Dec 4, 2023 Author

lcoandrade
Dec 1, 2023

Replies: 6 comments 9 replies

isaaccorley
Dec 1, 2023
Maintainer

lcoandrade
Dec 1, 2023
Author

isaaccorley Dec 2, 2023
Maintainer

lcoandrade Dec 2, 2023
Author

adamjstewart Dec 4, 2023
Maintainer

lcoandrade Dec 4, 2023
Author

lcoandrade
Dec 2, 2023
Author

adamjstewart Dec 4, 2023
Maintainer

lcoandrade Dec 4, 2023
Author

lcoandrade
Dec 2, 2023
Author

adamjstewart Dec 4, 2023
Maintainer

lcoandrade Dec 4, 2023
Author

adamjstewart
Dec 4, 2023
Maintainer

lcoandrade Dec 4, 2023
Author

lcoandrade
Dec 4, 2023
Author