The livingatlas module contains a number of processes for ingest of data to support the living atlases. For details on deployment see the architecture diagram.
The la-pipelines command line tool is the mechanism for running these pipelines either locally or on a spark cluster. This tool along with the YAML configuration files and built JAR file are deployed into production systems using a debian package (see ansible scripts for details).
This la-pipelines tool uses configuration files in configs/ directory to run the pipelines with as java processes or as jobs submitted to the spark cluster.
This is the first pipeline to run for a dataset. This takes a DwCA and converts to AVRO format using the ExtendedRecord schema.
The result is a verbatim.avro
file on the file system which is a verbatim representation
of the darwin core archive in AVRO.
Command to use: ./la-pipelines dwca-avro dr123
The second pipeline that is run when ingesting a dataset. This can be run in 3 modes using the --mode flag:
- local - Use java only pipeline. This is best used for smaller datasets.
- embedded - Use embedded spark. This is for larger datasets when a cluster isn't available.
- cluster - Use a spark cluster. This is best used for large datasets.
This pipeline generates AVRO outputs for each of the transforms which are stored in a directory structure on the filesystem (HDFS or just a Unix filesystem). The pipeline will run the following transforms:
- basic
- taxonomy - use name matching service
- location
- attribution
- temporal
- multimedia
Pipeline class: VerbatimToInterpretedPipeline
Command to use: ./la-pipelines interpret dr123 --cluster
This pipeline is responsible for minting UUIDs on records. UUIDs are created and stored in a subdirectory of the output for a dataset. The pipeline will:
- Read information from the collectory to determine which fields can be used to construct a unique key e.g. occurrenceID.
- Check that all records have these fields and they are unique
- Load existing UUIDs for this dataset if it has been previously load
- Re-associate existing UUIDs with the records
- Mint UUIDs for new records
- Write the output to the filesystem
Pipeline class: UUIDMintingPipeline
Command to use: ./la-pipelines uuid dr123 --cluster
Sensitive data service pipeline
Pipeline class: InterpretedToSensitivePipeline
Command to use: ./la-pipelines sds dr123 --cluster
There are two pipelines used for the integration with the image service
This pipeline pushes new images to the image service for storage. It can be run sync or async. When ran in async, the pipeline pushes image details to the image service and returns, leaving the images to be loaded asynchronously by the image service.
Pipeline class: ImageLoadPipeline
Command to use: ./la-pipelines image-load dr123 --cluster
This pipeline retrieve details of the image stored in the image service for a dataset and retrieves the identifiers (UUIDs) for these images so that they can be indexed with the records.
Pipeline class: ImageSyncPipeline
Command to use: ./la-pipelines image-sync dr123 --cluster
This pipeline produces an AVRO representation of the record that will be sent to an indexing platform such as SOLR.
This loads all the necessary AVRO files and creates AVRO files using the IndexRecord schema.
Command to use: ./la-pipelines index dr123 --cluster
This pipeline provides the integration with the sampling service. It will retrieve a unique set of coordinates for a dataset or all datasets and retrieve values against spatial layers stored in the spatial service for environmental (raster) and contextual (polygon) layers.
It maintains a cache of these samples with is updated on each run.
Command to use: ./la-pipelines sample dr123 --cluster
The pipeline runs environmental outlier detection.
Command to use: ./la-pipelines jackknife --cluster
The pipeline runs duplicate detection.
Command to use: ./la-pipelines clustering --cluster
The pipeline creates the SOLR index.
Command to use: ./la-pipelines solr --cluster