Name		Name	Last commit message	Last commit date
parent directory ..
bitmap		bitmap
configs		configs
debian		debian
migration		migration
pipelines		pipelines
scripts		scripts
solr		solr
README.md		README.md
architectures.md		architectures.md
pipelines.md		pipelines.md
pom.xml		pom.xml
sparkql.md		sparkql.md

README.md

Living Atlas Pipelines extensions


	Travis build for ala-dev branch
	GBIF Jenkins build for ala- dev branch
	Sonar

This module is to add functionality required by the Living Atlases to facilitate the replacement to biocache-store for data ingress.

Architecture

For details on the GBIF implementation, see the pipelines github repository. This project is focussed on extensions to that architecture to support use by the Living Atlases.

Above is a representation of the data flow from source data in Darwin core archives supplied by data providers, to the API access to these data via the biocache-service component.

Within the "Interpreted AVRO" box is a list of "transforms" each of which take the source data and produce an isolated output in a AVRO formatted file.

GBIF's pipelines already supports a number of core transforms for handling biodiversity occurrence data. The Living Atlas pipelines extensions make us of these transforms "as-is" where possible and extend existing transforms where required.

For information on how the architecture between the legacy system biocache-store and pipelines differ, see this page.

Dependent projects

The pipelines work has necessitated some minor additional API additions and change to the following components:

biocache-service

pipelines branch A version 3.x of biocache-service is in development. This will not use Cassandra for storage of occurrence records, but Cassandra is still required for the storage of user assertions and query identifiers (used to store large query parameters such as WKT strings).

ala-namematching-service

A simple drop wizard wrapper around the ala-name-matching library has been developed to support integration with pipelines. This service is package in docker container as is deployed as a service using ansible. For testing locally, use the docker-compose files.

ala-sensitive-data-service

A simple drop wizard wrapper around the ala-sensitive-data-service library has been developed to support integration with pipelines. This service is package in docker container as is deployed as a service using ansible. For testing locally, use the docker-compose files.

Getting started

Ansible scripts have been developed and are available here, Below are some instructions for setting up a local development environment for pipelines. These steps will load a dataset into a SOLR index.

Software requirements:

Java 8 - this is mandatory (see GBIF pipelines documentation)
Maven needs to run with OpenSDK 1.8 'nano ~/.mavenrc' add 'export JAVA_HOME=[JDK1.8 PATH]'
Docker Desktop
lombok plugin for intelliJ needs to be installed for slf4 annotation

Setting up la-pipelines

Download shape files from here and expand into /data/pipelines-shp directory
Download a test darwin core archive (e.g. https://archives.ala.org.au/archives/gbif/dr893/dr893.zip)
Create the following directory /data/pipelines-data
Build with maven mvn clean package

Running la-pipelines

Start required docker containers using

docker-compose -f pipelines/src/main/docker/ala-name-service.yml up -d
docker-compose -f pipelines/src/main/docker/solr8.yml up -d

cd scripts
To convert DwCA to AVRO, run ./la-pipelines dwca-avro dr893
To interpret, run ./la-pipelines interpret dr893 --embedded
To mint UUIDs, run ./la-pipelines uuid dr893 --embedded
(Optional) To sample run:
1. ./la-pipelines sample dr893 --embedded
To setup SOLR:
1. Run cd ../solr/scripts and then run ' ./update-solr-config.sh
2. Run cd ../../scripts
To index, run ./la-pipelines index dr893 --embedded
Run ./la-pipelines -h for help and more steps:

LA-Pipelines data ingress utility.

The la-pipelines can be executed to run all the ingress steps or only a few of them:

Pipeline ingress steps:

    ┌───── do-all ───────────────────────────────────────────────┐
    │                                                            │
dwca-avro --> interpret --> validate --> uuid --> image-sync ... │
  --> image-load --> sds --> index --> sample --> jackknife --> solr
(...)

Integration Tests

Tests follow the GBIF/failsafe/surefire convention. All integration tests have a suffix of "IT". All junit tests are ran with mvn package and integration tests are ran with mvn verify.

mvn verify will start the docker containers in the pre-integration-test phase, and shut them down in the post-integration-test phase.

To start the required containers for local development purposes, install Docker Desktop and run the following:

docker-compose -f pipelines/src/main/docker/ala-nameservice.yml up -d
docker-compose -f pipelines/src/main/docker/solr8.yml up -d

To shutdown, run the following:

docker-compose -f pipelines/src/main/docker/ala-nameservice.yml kill
docker-compose -f pipelines/src/main/docker/solr8.yml kill

Note: The docker containers that are ran as part of the maven build run on different ports to those specified in the docker compose files pipelines/src/main/docker. This was a deliberate choice allow developers to run integration tests in IDEs while developing pipelines, and then run maven builds on the same machine without port clashes.

Code style and tools

For code style and tool see the recommendations on the GBIF pipelines project. In particular, note the project uses Project Lombok, please install Lombok plugin for Intellij IDEA.

avro-tools is recommended to aid to development for quick views of AVRO outputs. This can be installed on Macs with Homebrew like so:

brew install avro-tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

livingatlas

livingatlas

README.md

Living Atlas Pipelines extensions

Architecture

Dependent projects

biocache-service

ala-namematching-service

ala-sensitive-data-service

Getting started

Software requirements:

Setting up la-pipelines

Running la-pipelines

Integration Tests

Code style and tools

Files

livingatlas

Directory actions

More options

Directory actions

More options

Latest commit

History

livingatlas

Folders and files

parent directory

README.md

Living Atlas Pipelines extensions

Architecture

Dependent projects

biocache-service

ala-namematching-service

ala-sensitive-data-service

Getting started

Software requirements:

Setting up la-pipelines

Running la-pipelines

Integration Tests

Code style and tools