Travis build for ala-dev branch | |
GBIF Jenkins build for ala- dev branch | |
Sonar |
This module is to add functionality required by the Living Atlases to facilitate the replacement to biocache-store for data ingress.
For details on the GBIF implementation, see the pipelines github repository. This project is focussed on extensions to that architecture to support use by the Living Atlases.
Above is a representation of the data flow from source data in Darwin core archives supplied by data providers, to the API access to these data via the biocache-service component.
Within the "Interpreted AVRO" box is a list of "transforms" each of which take the source data and produce an isolated output in a AVRO formatted file.
GBIF's pipelines already supports a number of core transforms for handling biodiversity occurrence data. The Living Atlas pipelines extensions make us of these transforms "as-is" where possible and extend existing transforms where required.
For information on how the architecture between the legacy system biocache-store and pipelines differ, see this page.
The pipelines work has necessitated some minor additional API additions and change to the following components:
pipelines branch A version 3.x of biocache-service is in development. This will not use Cassandra for storage of occurrence records, but Cassandra is still required for the storage of user assertions and query identifiers (used to store large query parameters such as WKT strings).
A simple drop wizard wrapper around the ala-name-matching library has been developed to support integration with pipelines. This service is package in docker container as is deployed as a service using ansible. For testing locally, use the docker-compose files.
A simple drop wizard wrapper around the ala-sensitive-data-service library has been developed to support integration with pipelines. This service is package in docker container as is deployed as a service using ansible. For testing locally, use the docker-compose files.
Ansible scripts have been developed and are available here, Below are some instructions for setting up a local development environment for pipelines. These steps will load a dataset into a SOLR index.
- Java 8 - this is mandatory (see GBIF pipelines documentation)
- Maven needs to run with OpenSDK 1.8 'nano ~/.mavenrc' add 'export JAVA_HOME=[JDK1.8 PATH]'
- Docker Desktop
- lombok plugin for intelliJ needs to be installed for slf4 annotation
- Download shape files from here and expand into
/data/pipelines-shp
directory - Download a test darwin core archive (e.g. https://archives.ala.org.au/archives/gbif/dr893/dr893.zip)
- Create the following directory
/data/pipelines-data
- Build with maven
mvn clean package
- Start required docker containers using
docker-compose -f pipelines/src/main/docker/ala-name-service.yml up -d
docker-compose -f pipelines/src/main/docker/solr8.yml up -d
cd scripts
- To convert DwCA to AVRO, run
./la-pipelines dwca-avro dr893
- To interpret, run
./la-pipelines interpret dr893 --embedded
- To mint UUIDs, run
./la-pipelines uuid dr893 --embedded
- (Optional) To sample run:
./la-pipelines sample dr893 --embedded
- To setup SOLR:
- Run
cd ../solr/scripts
and then run './update-solr-config.sh
- Run
cd ../../scripts
- Run
- To index, run
./la-pipelines index dr893 --embedded
- Run
./la-pipelines -h
for help and more steps:
LA-Pipelines data ingress utility.
The la-pipelines can be executed to run all the ingress steps or only a few of them:
Pipeline ingress steps:
┌───── do-all ───────────────────────────────────────────────┐
│ │
dwca-avro --> interpret --> validate --> uuid --> image-sync ... │
--> image-load --> sds --> index --> sample --> jackknife --> solr
(...)
Tests follow the GBIF/failsafe/surefire convention.
All integration tests have a suffix of "IT".
All junit tests are ran with mvn package
and integration tests are ran with mvn verify
.
mvn verify
will start the docker containers in the pre-integration-test
phase,
and shut them down in the post-integration-test
phase.
To start the required containers for local development purposes, install Docker Desktop and run the following:
docker-compose -f pipelines/src/main/docker/ala-nameservice.yml up -d
docker-compose -f pipelines/src/main/docker/solr8.yml up -d
To shutdown, run the following:
docker-compose -f pipelines/src/main/docker/ala-nameservice.yml kill
docker-compose -f pipelines/src/main/docker/solr8.yml kill
Note: The docker containers that are ran as part of the maven build run on different
ports to those specified in the docker compose files pipelines/src/main/docker
.
This was a deliberate choice allow developers to run integration tests in IDEs while developing pipelines,
and then run maven builds on the same machine without port clashes.
For code style and tool see the recommendations on the GBIF pipelines project. In particular, note the project uses Project Lombok, please install Lombok plugin for Intellij IDEA.
avro-tools
is recommended to aid to development for quick views of AVRO outputs.
This can be installed on Macs with Homebrew like so:
brew install avro-tools