Skip to content

Commit

Permalink
rres-endpoints, writing walkthrough doc [ci skip]
Browse files Browse the repository at this point in the history
  • Loading branch information
marco-brandizi committed Apr 24, 2024
1 parent 6ee94a0 commit 62227d1
Show file tree
Hide file tree
Showing 3 changed files with 64 additions and 0 deletions.
23 changes: 23 additions & 0 deletions rres-endpoints/config/datasets/cereals-dummy-1-cfg.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Used as an example for the walkthrough example at
# /pipeline-walkthrough.md

# /home/data/knetminer/etl-test/cereals-dummy/cereals-dummy-1.oxl

# Unfortunately, there isn't consistence, so we can use KETL_DATASET_ID here
oxl_home="$KNET_HOME/etl-test/poaceae/$KETL_DATASET_VERSION"

export KETL_SRC_OXL="$oxl_home/generic/knowledge-network-free.oxl"

export KETL_OUT="$KETL_OUT_HOME/$KETL_DATASET_ID/$KETL_DATASET_VERSION"

## Neo 
#
export KETL_HAS_NEO4J=true
export KETL_NEO_VERSION='5.16.0'
export NEO4J_HOME="$KNET_SOFTWARE/neo4j-community-$KETL_NEO_VERSION-etl"

## Knet Initialiser
#
# The name within the code base, which identifies the config dir to be
# used for the KnetMiner initialiser
export KNET_INIT_DATASET_ID="poaceae-test"
7 changes: 7 additions & 0 deletions rres-endpoints/config/datasets/poaceae-sample-1-cfg.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,13 @@ export NEO4J_HOME="/tmp/neo4j-community-$KETL_NEO_VERSION"
# Knet Initialiser
export KNET_INIT_DATASET_ID="poaceae-test"

# This is usually not done for a real dataset, since
# the OXL comes from another workflow (based on Ondex Mini) and
# it's already in place in ${KETL_SRC_OXL} (see eg, poaceae-free-57-cfg.oxl
#
# In this dummy test the dummy OXL is downloaded from the place
# where we make it available for all software components that need it.
#
if [[ ! -e "${KETL_SRC_OXL}" ]]; then

echo -e "\n\tDownloading $KETL_DATASET_ID.oxl"
Expand Down
34 changes: 34 additions & 0 deletions rres-endpoints/pipeline-walkthrough.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## Configuration

As explained in the main [README](README.md), the pipeline can be configured to work with a given dataset ID and a given dataset version (eg, cereals 57), and a given dataset+version can work with a given environment.

The configuration is hierarchical. Defaults are set by [config/default-cfg.sh](config/default-cfg.sh), which invokes `config/environments/$envNsme-env.sh` (`$envName` is a command line parameter), and then invokes `config/datasets/$datasetId-$version-cfg.sh` (`$datasetId ` and `$version` are command line parameters too). This means that environment-specific config settings can override or extend defaults (by using them) and then dataset-specific config can override/extend either defaults or environment settings.

Most pipeline scripts invoke (using the Bash [source command](https://www.baeldung.com/linux/source-include-files)) `default-cfg.sh` as a first step. This script has also a special behaviour: it checks the three command line arguments, which must be: `datasetId`, `datasetVersion` and an optional `environmentId`. These parameters are used to find specific config scripts, as said above.

So, for instance, the dataset building pipeline can be launched this way:

```bash
# This is where we have the pipeline scripts deployed, we won't repeat this in the examples below
cd /home/data/knetminer/software/knetminer-backend/rres-endpoints
git pull # Optional, this is a mirror of the knetminer-backend repo and you might want to update it
./build-endpoint.sh 'cereals-free' 1 rres # quote datasetId if it contains punctuation
```
All the scripts that need it, will call the `defeult-cfg.sh`, which will check the CLI arguments and invoke specific config scripts as said above.

**Tip**: a quick way to see the same variables that the pipeline scripts see is:

```bash

```

### Dataset config
As explained in the README, for a new dataset, you should define a dataset+version specific config and place it in `config/datasets/$datasetId-$version-cfg.sh`. In our walkthrough example, this is [config/datasets/cereals-dummy-1-cfg.sh](config/datasets/cereals-dummy-1-cfg.sh).

In this file, `KETL_OUT` defines that all the pipeline output files are rooted at `/home/data/knetminer/pub/endpoints/cereals-dummy/1/`. The value of this depends the previous definition of `KETL_OUT_HOME`, which in turn, depends on KNET_HOME. Both these two vars are defined in the environment config, at [config/environments/rres-env.sh](config/environments/rres-env.sh).

### Environment configuration
For this walkthrough, we'll use the RRes environment, its shared directories our deployments on them and SLURM, the cluster framework to send batch jobs to high-performant computing hosts and in parallel (more below).

As per the main README, the config for this environment is at [config/environments/rres-env.sh](config/environments/rres-env.sh). As said above, this defines the pipeline working directory and the path of the input OXL. It also has pointers to software tools such as the OXL-to-RDF exporter or the Neo4j server that the pipeline uses to prepare a Neo dump from the OXL (more below). These tools are pre-installed before running the pipeline.

0 comments on commit 62227d1

Please sign in to comment.