This repository moved to OpenCoDE: https://gitlab.opencode.de/umwelt-info/metadaten
This project is a prototype for a metadata index for the umwelt.info project. It aims for efficient operation by using the Rust programming language and storing the datasets and a search index directly in the file system to avoid dependencies on additional services like databases or search engines. It does not aim to be generic, configurable or programmable, especially where that would conflict with efficiency.
The system is implemented as three separate programs that access a common file system directory at $DATA_PATH
.
-
The harvester periodically harvests/crawls/scrapes the sources defined in
$DATA_PATH/harvester.toml
to write all datasets to$DATA_PATH/datasets
with one directory per source and one file per dataset and to store summary metrics in$DATA_PATH/metrics
. -
The indexer usually runs after the harvester and reads all datasets to produce a search index over their properties in
$DATA_PATH/index
using the Tantivy library. -
The server provides an HTTP-based API to query the search index and retrieve individual datasets. It also collects access statistics about each datasets in
$DATA_PATH/stats
. It is the only continuously running component and can be scaled out by exporting$DATA_PATH
via a networked file system like NFS or SMB.
The code is organised as a single library with three entry points for the above mentioned programs. A fourth binary named xtask
is used automate the development workflow.
The CI pipelines checks formatting via Rustfmt, ensure a warning-free build using Clippy, runs the unit and integration tests and builds and collects optimized binaries.
The system is deployed using a set of sandboxed systemd units, both for periodically running the harvester and indexer as well as continuously running the server.
To format, lint and test the code, run
> cargo xtask
deployment/harvester.toml
tracks all relevant sources. Based on that, a configuration like
[[sources]]
name = "uba-gdi"
type = "csw"
url = "https://gis.uba.de/smartfinder-csw/api/"
should be created at data/harvester.toml
, so that the harvester and indexer can be invoked by
> cargo xtask harvester
Finally, executing
> cargo xtask server
will make the server listen on 127.0.0.1:8081
.
Iteratively developing harvesters can be time-consuming and place undue load on the source due to large responses being transmitted over the network. To mitigate this issue, each request must be identified using a key
let response = client.make_request(&format!("{}-{}", source.name, record_number), |client| ...).await?;
under which its response is stored on disk. Once development has reached a state where the set of requests is stable, their responses can be replayed by setting $REPLAY_RESPONSES
, e.g.
> REPLAY_RESPONSES= cargo xtask harvester
The HTTP routes /search
and /dataset
support content negotiation insofar they yield either rendered HTML pages or the underlying JSON data depending on the Accept
header transmitted by the HTTP client.