This directory contains the logic, configuration and documentation for acquiring, processing and ingesting data such that a running IRV stack may host as tilesets.
The data processing steps are broadly as follows:
- Download data if possible
- Reproject to WGS84 if necessary
- Set zeros to no data value
- Clip to remove polar regions
- Cloud optimise
- Ingest into
terracotta
, creating a metadata database for dataset if necessary - Create dataset metadata in Postgres database
Vector data are yet to be incorporated in the unified ETL workflow. See the relevant pipeline readme files for more information. The rest of this section describes the general approach.
We use tippecanoe to generate Mapbox Vector
Tiles, stored in .mbtiles
files. Follow the installation or build instructions
in their documentation.
Step 1 is to have the features prepared in GeoJSON.
The simplest tippecanoe
example command often works well enough:
tippecanoe -zg -o landslide_forest.mbtiles --drop-densest-as-needed landslide_forest.geojson
Here's a version with options that should work a little better for a larger dataset:
tippecanoe -o landslide_forest.mbtiles \
--use-attribute-for-id=feature_id \
-zg \
--minimum-zoom=4 \
--read-parallel \
--drop-densest-as-needed \
--extend-zooms-if-still-dropping \
--simplification=10 \
--simplify-only-low-zooms \
landslide_forest.geojson
NB that --read-parallel
works with GeoJSONSeq
(line-delimited GeoJSON with one feature per line and no wrapping FeatureCollection).
You could use ogr2ogr
or something like this Python script to convert from regular
GeoJSON to a line-delimited series of features:
import json
with open('features.geojson', 'r') as fh_in:
with open('features.geojsonld', 'w') as fh_out:
for f in data['features']:
line = json.dumps(f)
fh_out.write(line)
fh_out.write("\n")
Once generated, the mbtiles file needs to sit in ./tileserver/vector/data
and have
an entry in config.json or config-dev.json.
The file and volume mapping for vector tiles is configured in the docker-compose (e.g. here for dev).
The ETL process is currently implemented with a combination of scripts and snakemake rules. There is currently an effort underway to move all processing logic into snakemake rules.
The Snakefile
and .smk
files contain the rule definitions for deciding how
to transform some input file into an output. Many datasets have some of their
own specific rules, typically for downloading and initial processing, for
instance, reprojection.
Install the necessary dependencies:
micromamba create -f environment.yml -y
To activate the environment:
micromamba activate irv-etl
The last two stages of the ETL pipeline (ingestion and metadata creation) involve interacting with the database service defined in the parent directory to this one.
To bring up the database, refer to the readme in the parent directory for a full explanation of the required env files, etc., but briefly:
docker compose -f docker-compose-dev.yaml up db -d
Unfortunately, not all the source data is openly available on the internet. Some files must be manually copied into the appropriate location prior to running the ETL process.
The affected datasets include:
- iris
Check the snakemake
rules for more information, but source raster data should
typically reside in raster/raw/<dataset>/
.
You may wish to remove write permissions to these files once they have been
installed, e.g. chmod ug-w raster/raw/iris/*.tif
. This means rm -r raster
will remove files than can be replaced automatically, but not the
awkward files.
The ETL pipeline is primarily configured with an environment file, located at
../envs/{dev|prod}/.etl.env
. This contains details for connections to services
and authentication information for pipelines which require it.
Here is an example environment file to use as a template:
# configuration for terracotta
TC_DRIVER_PATH=postgresql://global_dev:password@localhost:5432
TC_DRIVER_PROVIDER=postgresql
TC_PNG_COMPRESS_LEVEL=0
TC_RESAMPLING_METHOD=nearest
TC_REPROJECTION_METHOD=nearest
# connecting to database
PGHOST=localhost
PGDATABASE=global_dev
PGUSER=global_dev
PGPASSWORD=password
# data downloading
# https://cds.climate.copernicus.eu/api-how-to
COPERNICUS_CDS_URL="https://cds.climate.copernicus.eu/api/v2"
COPERNICUS_CDS_API_KEY= # "<uid>:<token>"
With the software environment activated (see above), one can request files via
snakemake
to invoke jobs. These may be processed rasters, or in the case of
database operations, dummy files with the extension .flag
. Requesting any
missing output implies its ancestors are also required.
To request a cloud optimised raster, invoke snakemake
as follows:
snakemake --cores <n_cores> -- raster/cog/<dataset>/<key>.tif
For example:
snakemake --cores <n_cores> -- raster/cog/exposure_nature/ocs_0-30cm_mean_1000.tif
N.B. The --dry-run
or -n
option can be used to preview which jobs
snakemake
has determined are necessary prior to executing.
To request a raster processed to another stage in the pipeline, substitute
raw
, no_data
or clip
for cog
in the above paths.
A list of datasets currently implemented in the unified workflow is kept as
ALL_DATASETS
in the Snakefile
.
The full processing pipeline for a single raster dataset will acquire and process the rasters, ingest them and create a metadata record. Invoke it as follows:
snakemake --cores <n_cores> -- raster/metadata/<dataset_name>.flag
To run every pipeline, we do not request a file, but rather a target rule called
all
.
snakemake --cores <n_cores> -R all
This will create and ingest all the pertinent rasters and create metadata records for them.
To add additional raster datasets to the ETL pipeline you will need several new
files in pipelines/<new_dataset_name>
:
README.md
describing the datasetlayers.csv
table of layersmetadata.json
metadata to be written to the Postgres databaserules.smk
to acquire and process the data into a WGS84 raster. The rule(s) should write output files toraster/raw/<dataset>/
. The shared rules will then clip, set zero to no data, and cloud optimise unless overridden by custom rules.
The layers.csv
file must follow the example structure:
filename
column must contain the file basename (no directory path)- one or more columns, to be listed as
keys
in themetadata.json
must exist and be sufficient to uniquely identify each row - additional columns may exist (e.g. URL) to support the raster workflow but will be ignored in the metadata processing
filename,mode
202001_Global_Motorized_Travel_Time_to_Healthcare_2019.tif,motorized
202001_Global_Walking_Only_Travel_Time_To_Healthcare_2019.tif,walking
The metadata.json
must follow the example structure:
"domain"
must be a shortlower_snake_case
string that will be exposed through the API and used by clients to request tiles. It will be prefixed byterracotta_
to give the metadata database name. A good default choice would be the same string as the directory name for the pipeline, which is picked up as theDATASET
wildcard by snakemake."name"
should be a short, readable description"group"
should be one of the high-level front-end groups (Hazard, Exposure, Vulnerability, Risk)"description"
should be a short description or citation"license"
should be a short code for the open data license"keys"
must be the list of metadata keys to be loaded to the terracotta metadata database, identical to the variable column names used inlayers.csv
to identify/address each layer
{
"domain": "traveltime_to_healthcare",
"name": "Global maps of travel time to healthcare facilities",
"group": "Exposure",
"description": "Weiss, D.J., Nelson, A., Vargas-Ruiz, C.A. et al. Global maps of travel time to healthcare facilities. Nat Med 26, 1835–1838 (2020). https://doi.org/10.1038/s41591-020-1059-1",
"license": "CC-BY 4.0",
"keys": ["mode"]
}
The Snakefile
will also require modification:
- If you have written a new
rules.smk
, you will need to import it as a module here. See existing datasets for more information. - If you wish to overwrite the behaviour of a common rule, e.g. clipping, cloud
optimsation, etc, you can override rules when importing using
snakemake
'sruleorder
directive. - You should also add your dataset to
ALL_DATASETS
so that theall
target rule will work as expected.
Connect to the database server and delete the metadata database and reference row in raster_tile_sources
:
# Drop a metadata database
psql -h localhost -U global_dev -c 'DROP DATABASE terracotta_storm;'
# Delete a row from raster tile sources
psql -h localhost -U global_dev -c "DELETE FROM raster_tile_sources WHERE domain = 'storm';"
Check the current state of the local database server:
# List all databases
psql -h localhost -U global_dev -c '\l'
# List raster tile sources
psql -h localhost -U global_dev -c 'select id, domain, name, "group", keys FROM raster_tile_sources;'
To be implemented and then documented!