Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RasterDataset: add control over resampling algorithm #2015

Merged
merged 4 commits into from
May 13, 2024

Conversation

adamjstewart
Copy link
Collaborator

@adamjstewart adamjstewart commented Apr 20, 2024

By default, rasterio.merge.merge uses "nearest" as its resampling algorithm. This is fast and works well for masks where you don't want to interpolate two categorical classes, but results in resampling artifacts for floating point images. This PR adds an attribute to control the resampling algorithm used. It defaults to "bilinear" for float (usually images) and "nearest" for int (usually masks).

I chose to base this on dtype instead of is_image because there are times when you want to interpolate floating point pixelwise regression masks.

@robmarkcole can you upload before and after plots from the example you showed me?

Closes #2012

@adamjstewart adamjstewart added this to the 0.6.0 milestone Apr 20, 2024
@github-actions github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing labels Apr 20, 2024
.. versionadded:: 0.6
"""
# Based on torch.is_floating_point
if self.dtype in [torch.float64, torch.float32, torch.float16, torch.bfloat16]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.double == torch.float64, and the tests are designed to confirm that both work.

@adamjstewart
Copy link
Collaborator Author

Doesn't seem to have as big of an impact on I/O performance as I expected:

raw (random) raw (grid) preprocessed (random) preprocessed (grid)
before 17.350 10.984 9.9158 4.6496
after 18.706 12.371 9.5933 4.5972

@adamjstewart
Copy link
Collaborator Author

This approach doesn't work well for datasets like L7 Irish or L8 Biome where a single RasterDataset is used for both image and mask. I think the solution is to instead subclass IntersectionDataset like I did in #1972.

@DimitrisMantas
Copy link
Contributor

I think cubic interpolation is not such a good idea because the corresponding convolution kernel has negative weights, meaning that the output data range is not guaranteed.

This can really mess up any normalization or scaling applied to the data before the resampling step.

Here’s a minimal example of what I’m talking about:

import numpy as np
import rasterio

dst_data = data = np.random.rand(512, 512)

dst: rasterio.io.DatasetWriter
with rasterio.open(
    "temp.tif", mode="w", count=1, dtype=np.float32, width=512, height=512
) as dst:
    dst.write(dst_data, indexes=1)

src: rasterio.io.DatasetReader
with rasterio.open("temp.tif") as src:
    src_data = src.read(
        out_shape=(src.count, int(src.height * 2), int(src.width * 2)),
        resampling=rasterio.enums.Resampling.cubic,
    )
    assert (
        src_data.min() >= 0 and src_data.max() < 1
    ), "The cubic kernel has negative weights!"

This can also cause issues with no-data values; there may be huge inaccuracies in the final product depending on where they land in relation to the kernel.

I think bilinear interpolation is fine for 99.9% of cases.

@robmarkcole
Copy link
Contributor

robmarkcole commented Apr 22, 2024

This pair shows the original issue, addressed in this PR

image

This was created with a random sampler, checking for it, but in general the results are improved. There are however some kind of streak artefact:

image

@adamjstewart
Copy link
Collaborator Author

You can play around with other enums and see which looks best: https://rasterio.readthedocs.io/en/stable/api/rasterio.enums.html#rasterio.enums.Resampling

I agree with @DimitrisMantas that we should pick a safe/simple/fast default.

@DimitrisMantas
Copy link
Contributor

I think this is more or less the one-liner for each method:

  • Nearest/Mode: Good for masks; not suitable for continuous fields (i.e., images)
  • Bilinear: Suitable for continuous data; works best with smooth-ish fields (e.g., DEMs & DSMs)
  • Cubic/Cubic Spline/Lanczos: These are all the same as far as we are concerned; they are a bit unpredictable.
  • Average: It sounds sounds a bit like bilinear, but it's probably a kernel.
  • Gauss: Produces a probably undesirable smoothing effect when upsampling images.
  • Min/Max/etc.: They are probably useful in certain regression tasks, but I can't think of a potential application at the moment.

@DimitrisMantas
Copy link
Contributor

DimitrisMantas commented Apr 22, 2024

By the way, Lanczos is technically "the best" of the bunch...

@isaaccorley
Copy link
Collaborator

Agree with ^. It should be NN resampling for masks and Bilinear for all else by default. If a user wants something specific they can override.

@adamjstewart adamjstewart merged commit 25fb9cc into microsoft:main May 13, 2024
17 checks passed
@adamjstewart adamjstewart deleted the datasets/raster-resampling branch May 13, 2024 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Overrideable resample property for IntersectionDataset
4 participants