rres-endpoints, writing walkthrough doc [ci skip]

Rothamsted · Apr 24, 2024 · 62227d1 · 62227d1
1 parent 6ee94a0
commit 62227d1
Show file tree

Hide file tree

Showing 3 changed files with 64 additions and 0 deletions.
diff --git a/rres-endpoints/config/datasets/cereals-dummy-1-cfg.sh b/rres-endpoints/config/datasets/cereals-dummy-1-cfg.sh
@@ -0,0 +1,23 @@
+# Used as an example for the walkthrough example at 
+# /pipeline-walkthrough.md
+
+# /home/data/knetminer/etl-test/cereals-dummy/cereals-dummy-1.oxl
+
+# Unfortunately, there isn't consistence, so we can use KETL_DATASET_ID here
+oxl_home="$KNET_HOME/etl-test/poaceae/$KETL_DATASET_VERSION"
+
+export KETL_SRC_OXL="$oxl_home/generic/knowledge-network-free.oxl"
+
+export KETL_OUT="$KETL_OUT_HOME/$KETL_DATASET_ID/$KETL_DATASET_VERSION"
+
+## Neo 
+#
+export KETL_HAS_NEO4J=true
+export KETL_NEO_VERSION='5.16.0'
+export NEO4J_HOME="$KNET_SOFTWARE/neo4j-community-$KETL_NEO_VERSION-etl"
+
+## Knet Initialiser
+#
+# The name within the code base, which identifies the config dir to be
+# used for the KnetMiner initialiser
+export KNET_INIT_DATASET_ID="poaceae-test"
diff --git a/rres-endpoints/config/datasets/poaceae-sample-1-cfg.sh b/rres-endpoints/config/datasets/poaceae-sample-1-cfg.sh
@@ -12,6 +12,13 @@ export NEO4J_HOME="/tmp/neo4j-community-$KETL_NEO_VERSION"
 # Knet Initialiser
 export KNET_INIT_DATASET_ID="poaceae-test"
 
+# This is usually not done for a real dataset, since
+# the OXL comes from another workflow (based on Ondex Mini) and
+# it's already in place in ${KETL_SRC_OXL} (see eg, poaceae-free-57-cfg.oxl
+# 
+# In this dummy test the dummy OXL is downloaded from the place
+# where we make it available for all software components that need it.
+#
 if [[ ! -e "${KETL_SRC_OXL}" ]]; then
 
 	echo -e "\n\tDownloading $KETL_DATASET_ID.oxl"

diff --git a/rres-endpoints/pipeline-walkthrough.md b/rres-endpoints/pipeline-walkthrough.md
@@ -0,0 +1,34 @@
+## Configuration
+
+As explained in the main [README](README.md), the pipeline can be configured to work with a given dataset ID and a given dataset version (eg, cereals 57), and a given dataset+version can work with a given environment.
+
+The configuration is hierarchical. Defaults are set by [config/default-cfg.sh](config/default-cfg.sh), which invokes `config/environments/$envNsme-env.sh` (`$envName` is a command line parameter), and then invokes `config/datasets/$datasetId-$version-cfg.sh` (`$datasetId ` and `$version` are command line parameters too). This means that environment-specific config settings can override or extend defaults (by using them) and then dataset-specific config can override/extend either defaults or environment settings.
+
+Most pipeline scripts invoke (using the Bash [source command](https://www.baeldung.com/linux/source-include-files)) `default-cfg.sh` as a first step. This script has also a special behaviour: it checks the three command line arguments, which must be: `datasetId`, `datasetVersion` and an optional `environmentId`. These parameters are used to find specific config scripts, as said above. 
+
+So, for instance, the dataset building pipeline can be launched this way:
+
+```bash
+# This is where we have the pipeline scripts deployed, we won't repeat this in the examples below
+cd /home/data/knetminer/software/knetminer-backend/rres-endpoints
+git pull # Optional, this is a mirror of the knetminer-backend repo and you might want to update it
+./build-endpoint.sh 'cereals-free' 1 rres # quote datasetId if it contains punctuation
+```
+All the scripts that need it, will call the `defeult-cfg.sh`, which will check the CLI arguments and invoke specific config scripts as said above.
+
+**Tip**: a quick way to see the same variables that the pipeline scripts see is:
+
+```bash
+
+```
+
+### Dataset config
+As explained in the README, for a new dataset, you should define a dataset+version specific config and place it in `config/datasets/$datasetId-$version-cfg.sh`. In our walkthrough example, this is [config/datasets/cereals-dummy-1-cfg.sh](config/datasets/cereals-dummy-1-cfg.sh).
+
+In this file, `KETL_OUT` defines that all the pipeline output files are rooted at `/home/data/knetminer/pub/endpoints/cereals-dummy/1/`. The value of this depends the previous definition of `KETL_OUT_HOME`, which in turn, depends on KNET_HOME. Both these two vars are defined in the environment config, at [config/environments/rres-env.sh](config/environments/rres-env.sh).
+
+### Environment configuration
+For this walkthrough, we'll use the RRes environment, its shared directories our deployments on them and SLURM, the cluster framework to send batch jobs to high-performant computing hosts and in parallel (more below).
+
+As per the main README, the config for this environment is at [config/environments/rres-env.sh](config/environments/rres-env.sh). As said above, this defines the pipeline working directory and the path of the input OXL. It also has pointers to software tools such as the OXL-to-RDF exporter or the Neo4j server that the pipeline uses to prepare a Neo dump from the OXL (more below). These tools are pre-installed before running the pipeline.
+