Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying the expression parser with geometry #226

Closed
chapmanjacobd opened this issue Jul 15, 2021 · 9 comments
Closed

Querying the expression parser with geometry #226

chapmanjacobd opened this issue Jul 15, 2021 · 9 comments

Comments

@chapmanjacobd
Copy link
Contributor

chapmanjacobd commented Jul 15, 2021

If I wanted to leverage some of the existing expression parsing of the compute route to create a route which will create a CSV extract from compute route parameters + a GeoJSON. What would be the best way to go about that?

similar to #220 but 220 is about xyz input, vector tiles output whereas this ticket is about bbox extent or shapely geometry input and output of the desired terracotta function as ndarray (and affine?) OR querying a evaluate_expression output image as a VRT (across multiple tiles)

@chapmanjacobd chapmanjacobd changed the title Querying the expression parser with specific areas Querying the expression parser with geometry Jul 15, 2021
@dionhaefner
Copy link
Collaborator

I think I'll have to see an example input and output here. It's not clear to me whether this is on a tile level, or whether you want to query raw GTiff data at the pixel level, and how you would expect this to work with different zoom levels.

@chapmanjacobd
Copy link
Contributor Author

chapmanjacobd commented Jul 16, 2021

For my use case it would always be at or near native resolution of the GTiff file. Maybe this wouldn't be solved by terracotta directly... I'm not sure what is possible... and maybe native resolution doesn't make sense if there are multiple resolutions for a given compute query. tiling might be interesting in this case--and it is a more general case. I'm not against working in the tile level per se

I'd rather not duplicate handlers/compute.py and server/compute.py but maybe that is the only way to access data in the format needed for a compute-extract route

for single-band non-compute extraction I do this:

import pandas as pd
import pyproj
import rasterio
import shapely.wkt
from io import StringIO
from flask import make_response
from rasterio.mask import raster_geometry_mask
from shapely.geometry import Point
from shapely.ops import transform as shapely_transform
from terracotta import get_driver, get_settings
from terracotta.server.flask_api import convert_exceptions
from utils import key_to_filename

prj4326_3857 = pyproj.Transformer.from_crs("epsg:4326", "epsg:3857", always_xy=True)
prj3857_4326 = pyproj.Transformer.from_crs("epsg:3857", "epsg:4326", always_xy=True)

# http://localhost:5000/extract/ds1/POLYGON((19.19036 -33.674068,19.75616 -33.674068,19.75616 -33.43029,19.19036 -33.43029,19.19036 -33.674068))
# http://localhost:5000/extract/ds1/POINT(18.3863 -33.92021)

def register(server):
    @server.route("/extract/<path:keys>/<string:wkt>", methods=["GET"])
    @convert_exceptions
    def extract(keys: str, wkt: str):
        keys = keys.split("/")
        settings = get_settings()
        driver = get_driver(settings.DRIVER_PATH, provider=settings.DRIVER_PROVIDER)
        with driver.connect():
            dataset = key_to_filename(driver, keys)
            result = get_bbox_data(wkt, dataset)
        return send_as_csv(result)

def send_as_csv(data):
    str_io = StringIO()
    data.to_csv(str_io, index=False)
    output = make_response(str_io.getvalue())
    output.headers["Content-Disposition"] = f"attachment; filename=data_download.csv"
    output.headers["Content-type"] = "text/csv"
    return output


def get_bbox_data(wkt, dataset):
    geom = shapely_transform(prj4326_3857.transform, shapely.wkt.loads(wkt))
    ds_key, ds_path = dataset
    with rasterio.open(ds_path) as src:
        return raster2xy(src, geom, ds_key[0])


def raster2xy(src, geom, ds_name):
    _m, _affine, win = raster_geometry_mask(src, [geom], crop=True, all_touched=True)
    arr_2D = src.read(1, window=win)
    nodata = src.nodata
    blockwidth = src.block_shapes[0][0]
    blockheight = src.block_shapes[0][1]

    px_list = list()
    for pixel_index, pixels in enumerate(arr_2D):
        print(pixel_index)
        px = win.col_off + (pixel_index % blockwidth)
        py = win.row_off + (pixel_index // blockheight)
        easting, northing = src.xy(px, py)

        for pixel_value in pixels:
            if pixel_value == nodata:
                continue

            try:
                px_list.append(
                    {
                        "wkt": shapely.wkt.dumps(
                            shapely_transform(
                                prj3857_4326.transform, Point(easting, northing)
                            ),
                            rounding_precision=5,
                        ),
                        ds_name: pixel_value,
                    }
                )
            except ValueError:
                # skip nan values
                continue

    return pd.DataFrame(px_list)

@dionhaefner
Copy link
Collaborator

I see 2 main problems with this approach:

  • Something like this is extremely vulnerable to DOS attacks. If you pass a polygon with the entire globe you can stall the server for ages, and the payloads will be huge. At least add some validation that aborts if the pixel count is too high.
  • URLs have a size limitation of ~2000 characters. So there's no way you can pass complex shapes with this. You could make this a POST route instead and send the geometry in the request body.

Now for compute expression support, I guess all it would take would be to read multiple datasets and pass them to evaluate_expression... but you would have to require that they are aligned, or do some more gymnastics to reproject.

This is bound to get ugly. Honestly I would probably try to use geotiff.js to read the COG data on the client side directly from your data storage. You can still use Terracotta for indexing and tiling, but when it comes to reading raw data I think we're straying a bit far from the intended use case.

@j08lue
Copy link
Collaborator

j08lue commented Jul 16, 2021

Doing more to the COGs registered in Terracotta than just PNG tile rendering is definitely relevant. There are several features that could be worth considering:

  1. Compute zonal statistics over pixels within a user-provided polygon
  2. Return the pixels in a user-provided polygon as PNG (very useful for creating animations on a map) (like in TiTiler for bbox)
  3. Extract pixel values for a single point / nearest pixel (TiTiler has that, too)

I think any functionality that: A) takes a well defined user input to work on the GeoTIFF B) and returns a condensed / reduced / small response derived from the GeoTIFF would make sense to add to Terracotta.

Getting the values for all pixels within the polygon as CSV, like you are considering @chapmanjacobd, perhaps is a bit too far away from the second criteria I listed - that we should not transfer data from COG in storage, but reduce it client-side to something web-friendly.

As for the valid concern you voiced, @dionhaefner, that large requests could stall the server - I think that could be prevented by validating the complexity of the polygon (-> limit the cost of the mask calculation) and then validating the number of affected pixels, by summing over the mask (-> limits the amount of data returned / processed).

@dionhaefner
Copy link
Collaborator

dionhaefner commented Jul 16, 2021

I'm not against adding a route mapping [bbox, zoom level] -> png and [bbox, zoom level] -> pixel data. Something like

# vanilla singleband tile
/singleband/{keys}/{x}/{y}/{z}.png

# untiled access with bbox
/singleband/{keys}/{z}.png?bbox={wkt}

# equivalent to current /preview.png if z == native zoom level (bbox = extent)
/singleband/{keys}/{z}.png

# raw pixel data of tile instead of png image
/singleband/{keys}/{x}/{y}/{z}.json

# raw pixel data in bbox
/singleband/{keys}/{z}.json?bbox={bbox}

# raw pixel data via compute, as requested in this issue
/compute/{keys}/{z}.json?expression={expression}&bbox={bbox}

# also supports reductions like zonal mean
/compute/{keys}/{z}.json?expression=mean(v1)&bbox={bbox}&v1=foo

With lots of validation on the passed bbox. In this proposal all raw pixel data would be aligned with the XYZ grid, so there would be no way to read the unprojected raw data as in the COG. But this also allows for efficient usage of overviews and makes it so we don't have to inform the user of which CRS the data is in, and ensures that tiles and raw data match.

So the .json routes just give you whatever the .png would give without the final step of rendering the data as an image (basically the raw 2D or 3D NumPy array as text).

@chapmanjacobd
Copy link
Contributor Author

# also supports reductions like zonal mean
/compute/{keys}/{z}.json?expression=mean(v1)&bbox={bbox}&v1=foo

ohhhh very cool. I hadn't even thought about that

It sounds like working at the tile level would completely solve my use case. Even if terracotta didn't do exactly what I wanted it would be a lot more accessible with the above routes as examples

@j08lue
Copy link
Collaborator

j08lue commented Jul 16, 2021

Nice list, @dionhaefner! Seeing this in clear API specs makes the cases quite tangible.

Do we have to limit ourselves to bbox or could we take any polygon (with the URL length limit, of course)? Then a bbox would just be a simple case of that. Especially for zonal statistics, users (like me at least) will often want to be more specific than just a box and animations look nice when masked to the polygon and not just cropped to a box in WGS or Web Mercator coordinates.

Another consideration for returning untiled PNGs: If a user wants to display that PNG on a map, they need to know where to place the image on the map. We have experimented with such an API - computing zonal statistics and returning cropped PNGs (as byte64-encoded strings in JSON) together with anchor points: https://soilmoisturepro.z6.web.core.windows.net/ Unless we make the image fit exactly the bbox asks for, we need to provide that information somehow.

@dionhaefner
Copy link
Collaborator

Yes I meant arbitrary geometry.

If a user wants to display that PNG on a map

Yea I'm not sure if that might be taking it too far. Your soil moisture example would also work with many small rasters, right? If you want to put stuff on a map then tiling should be what you want.

On the other hand, in my proposed API all output would be aligned with the XYZ grid, so if you really really need this you can always compute the nearest pixel to your bbox on that grid in the frontend. Or you could ignore the issue and assume that origin == upper left corner of your bbox (with some sub-pixel inaccuracy).

@chapmanjacobd
Copy link
Contributor Author

chapmanjacobd commented Dec 23, 2021

Thinking more about this I think it will be too slow to compute geometry operations per tile...

here is some half-baked code if someone wants to try further

import contextlib
import functools
import warnings
from concurrent.futures import Future
from typing import Any, Dict, Mapping, Optional, Sequence, Tuple, TypeVar, Union
from typing.io import BinaryIO

import geopandas as gpd
import IPython
import mercantile
import numpy as np
import terracotta
from rich import inspect
from terracotta import exceptions, get_driver, get_settings, image, update_settings, xyz
from terracotta.drivers.base import Driver, requires_connection
from terracotta.drivers.raster_base import submit_to_executor
from terracotta.handlers.compute import Number

from refresh import DB_PATH
import pyproj
from shapely.ops import transform

prj4326_3857 = pyproj.Transformer.from_crs("EPSG:4326", "EPSG:3857")

v = gpd.read_file("geoBoundaries-IND-ADM2.topojson")
g4326 = v[v["shapeName"] == "Tenkasi"].geometry.values[0]
IPython.embed()
g = transform(prj4326_3857.transform, g4326)


def _get_raster_tile(
    cls,
    path: str,
    *,
    reprojection_method: str,
    resampling_method: str,
    tile_bounds: Tuple[float, float, float, float] = None,
    tile_size: Tuple[int, int] = (256, 256),
    preserve_values: bool = False,
) -> np.ma.MaskedArray:
    """Load a raster dataset from a file through rasterio.

    Heavily inspired by mapbox/rio-tiler
    """
    import rasterio
    from affine import Affine
    from rasterio import transform, warp, windows
    from rasterio.vrt import WarpedVRT

    dst_bounds: Tuple[float, float, float, float]

    if preserve_values:
        reproject_enum = resampling_enum = cls._get_resampling_enum("nearest")
    else:
        reproject_enum = cls._get_resampling_enum(reprojection_method)
        resampling_enum = cls._get_resampling_enum(resampling_method)

    with contextlib.ExitStack() as es:
        es.enter_context(rasterio.Env(**cls._RIO_ENV_KEYS))
        try:
            src = es.enter_context(rasterio.open(path))
        except OSError:
            raise IOError("error while reading file {}".format(path))

        # compute buonds in target CRS
        dst_bounds = warp.transform_bounds(src.crs, cls._TARGET_CRS, *src.bounds)

        if tile_bounds is None:
            tile_bounds = dst_bounds

        # prevent loads of very sparse data
        cover_ratio = (
            (dst_bounds[2] - dst_bounds[0])
            / (tile_bounds[2] - tile_bounds[0])
            * (dst_bounds[3] - dst_bounds[1])
            / (tile_bounds[3] - tile_bounds[1])
        )

        if cover_ratio < 0.01:
            raise exceptions.TileOutOfBoundsError("dataset covers less than 1% of tile")

        # compute suggested resolution in target CRS
        dst_transform, _, _ = warp.calculate_default_transform(
            src.crs, cls._TARGET_CRS, src.width, src.height, *src.bounds
        )
        dst_res = (abs(dst_transform.a), abs(dst_transform.e))

        # make sure VRT resolves the entire tile
        tile_transform = transform.from_bounds(*tile_bounds, *tile_size)
        tile_res = (abs(tile_transform.a), abs(tile_transform.e))

        if tile_res[0] < dst_res[0] or tile_res[1] < dst_res[1]:
            dst_res = tile_res
            resampling_enum = cls._get_resampling_enum("nearest")

        # pad tile bounds to prevent interpolation artefacts
        num_pad_pixels = 2

        # compute tile VRT shape and transform
        dst_width = max(1, round((tile_bounds[2] - tile_bounds[0]) / dst_res[0]))
        dst_height = max(1, round((tile_bounds[3] - tile_bounds[1]) / dst_res[1]))
        vrt_transform = transform.from_bounds(
            *tile_bounds, width=dst_width, height=dst_height
        ) * Affine.translation(-num_pad_pixels, -num_pad_pixels)
        vrt_height, vrt_width = (
            dst_height + 2 * num_pad_pixels,
            dst_width + 2 * num_pad_pixels,
        )

        # remove padding in output
        out_window = windows.Window(
            col_off=num_pad_pixels,
            row_off=num_pad_pixels,
            width=dst_width,
            height=dst_height,
        )

        # construct VRT
        vrt = es.enter_context(
            WarpedVRT(
                src,
                crs=cls._TARGET_CRS,
                resampling=reproject_enum,
                transform=vrt_transform,
                width=vrt_width,
                height=vrt_height,
                add_alpha=not cls._has_alpha_band(src),
            )
        )

        # read data
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", message="invalid value encountered.*")
            tile_data = vrt.read(
                1, resampling=resampling_enum, window=out_window, out_shape=tile_size
            )

            # assemble alpha mask
            mask_idx = vrt.count
            mask = vrt.read(mask_idx, window=out_window, out_shape=tile_size) == 0

            if src.nodata is not None:
                mask |= tile_data == src.nodata

        r = tile_data

        IPython.embed()

    return np.ma.masked_array(tile_data, mask=mask)


# return type has to be Any until mypy supports conditional return types
@requires_connection
def get_raster_tile(
    self,
    keys: Union[Sequence[str], Mapping[str, str]],
    *,
    tile_bounds: Sequence[float] = None,
    tile_size: Sequence[int] = None,
    preserve_values: bool = False,
    asynchronous: bool = False,
) -> Any:
    # This wrapper handles cache interaction and asynchronous tile retrieval.
    # The real work is done in _get_raster_tile.

    future: Future[np.ma.MaskedArray]
    result: np.ma.MaskedArray

    settings = get_settings()
    key_tuple = tuple(self._key_dict_to_sequence(keys))
    datasets = self.get_datasets(dict(zip(self.key_names, key_tuple)))
    assert len(datasets) == 1
    path = datasets[key_tuple]

    if tile_size is None:
        tile_size = settings.DEFAULT_TILE_SIZE

    # make sure all arguments are hashable
    kwargs = dict(
        path=path,
        tile_bounds=tuple(tile_bounds) if tile_bounds else None,
        tile_size=tuple(tile_size),
        preserve_values=preserve_values,
        reprojection_method=settings.REPROJECTION_METHOD,
        resampling_method=settings.RESAMPLING_METHOD,
    )

    cache_key = hash(tuple(kwargs.items()))

    try:
        with self._cache_lock:
            result = self._raster_cache[cache_key]
    except KeyError:
        pass
    else:
        if asynchronous:
            # wrap result in a future
            future = Future()
            future.set_result(result)
            return future
        else:
            return result

    retrieve_tile = functools.partial(_get_raster_tile, **kwargs)

    future = submit_to_executor(retrieve_tile)

    def cache_callback(future: Future) -> None:
        # insert result into global cache if execution was successful
        if future.exception() is None:
            self._add_to_cache(cache_key, future.result())

    if asynchronous:
        future.add_done_callback(cache_callback)
        return future
    else:
        result = future.result()
        cache_callback(future)
        return result


def get_tile_data(
    driver: Driver,
    keys: Union[Sequence[str], Mapping[str, str]],
    tile_xyz: Tuple[int, int, int] = None,
    *,
    tile_size: Tuple[int, int] = (256, 256),
    preserve_values: bool = False,
    asynchronous: bool = False,
) -> Any:
    """Retrieve raster image from driver for given XYZ tile and keys"""

    if tile_xyz is None:
        # read whole dataset
        return get_raster_tile(
            keys,
            tile_size=tile_size,
            preserve_values=preserve_values,
            asynchronous=asynchronous,
        )

    # determine bounds for given tile
    metadata = driver.get_metadata(keys)
    wgs_bounds = metadata["bounds"]

    tile_x, tile_y, tile_z = tile_xyz

    if not xyz.tile_exists(wgs_bounds, tile_x, tile_y, tile_z):
        raise exceptions.TileOutOfBoundsError(
            f"Tile {tile_z}/{tile_x}/{tile_y} is outside image bounds"
        )

    mercator_tile = mercantile.Tile(x=tile_x, y=tile_y, z=tile_z)
    target_bounds = mercantile.xy_bounds(mercator_tile)

    return get_raster_tile(
        keys,
        tile_bounds=target_bounds,
        tile_size=tile_size,
        preserve_values=preserve_values,
        asynchronous=asynchronous,
    )


def compute(
    expression: str,
    some_keys: Sequence[str],
    operand_keys: Mapping[str, str],
    stretch_range: Tuple[Number, Number],
    tile_xyz: Tuple[int, int, int] = None,
    *,
    colormap: str = None,
    tile_size: Tuple[int, int] = None,
) -> BinaryIO:
    """Return singleband image computed from one or more images as PNG
    Expects a Python expression that returns a NumPy array. Operands in
    the expression are replaced by the images with keys as defined by
    some_keys (all but the last key) and operand_keys (last key).
    Contrary to singleband and rgb handlers, stretch_range must be given.
    Example:
        >>> operands = {
        ...     'v1': 'B08',
        ...     'v2': 'B04'
        ... }
        >>> compute('v1 * v2', ['S2', '20171101'], operands, [0, 1000])
        <binary image containing product of bands 4 and 8>
    """
    from terracotta.expressions import evaluate_expression

    if not stretch_range[1] > stretch_range[0]:
        raise exceptions.InvalidArgumentsError(
            "Upper stretch bounds must be larger than lower bounds"
        )

    settings = get_settings()

    if tile_size is None:
        tile_size_ = settings.DEFAULT_TILE_SIZE
    else:
        tile_size_ = tile_size

    driver = get_driver(settings.DRIVER_PATH, provider=settings.DRIVER_PROVIDER)

    with driver.connect():
        key_names = driver.key_names

        if len(some_keys) != len(key_names) - 1:
            raise exceptions.InvalidArgumentsError(
                "must specify all keys except last one"
            )

        def get_band_future(band_key: str) -> Future:
            band_keys = (*some_keys, band_key)
            return get_tile_data(
                driver,
                band_keys,
                tile_xyz=tile_xyz,
                tile_size=tile_size_,
                asynchronous=True,
            )

        futures = {var: get_band_future(key) for var, key in operand_keys.items()}
        operand_data = {var: future.result() for var, future in futures.items()}

        # in my mind this would be the ideal spot to do final processing (if the processing is quicker than reading data....) since the xyz data can be cached
        # but we are missing the spatial context of the data at this point
        r = operand_data["v1"]
        zonal_extract, _transform = rasterio.mask.mask(r, g, crop=True, filled=False)

        IPython.embed()

    try:
        out = evaluate_expression(expression, operand_data)
    except ValueError as exc:
        # make sure error message gets propagated
        raise exceptions.InvalidArgumentsError(
            f"error while executing expression: {exc!s}"
        )

    out = image.to_uint8(out, *stretch_range)
    return image.array_to_png(out, colormap=colormap)


if __name__ == "__main__":
    # from server import app
    # app.run(port=5000, host="localhost", threaded=False, debug=True)

    update_settings(
        DRIVER_PATH=DB_PATH,
        REPROJECTION_METHOD="nearest",
        DEFAULT_TILE_SIZE=[512, 512]
        # DEBUG=True,
    )

    db = terracotta.get_driver(DB_PATH, provider="sqlite")
    with db.connect():
        inspect(
            compute(
                "v1 * 1",
                [],
                {"v1": "India-v16-2020"},
                [0, 100],
            )
        )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants