umwelt.info metadata index

This repository moved to OpenCoDE: https://gitlab.opencode.de/umwelt-info/metadaten

umwelt.info metadata index

This project is a prototype for a metadata index for the umwelt.info project. It aims for efficient operation by using the Rust programming language and storing the datasets and a search index directly in the file system to avoid dependencies on additional services like databases or search engines. It does not aim to be generic, configurable or programmable, especially where that would conflict with efficiency.

The system is implemented as three separate programs that access a common file system directory at $DATA_PATH.

The harvester periodically harvests/crawls/scrapes the sources defined in $DATA_PATH/harvester.toml to write all datasets to $DATA_PATH/datasets with one directory per source and one file per dataset and to store summary metrics in $DATA_PATH/metrics.
The indexer usually runs after the harvester and reads all datasets to produce a search index over their properties in $DATA_PATH/index using the Tantivy library.
The server provides an HTTP-based API to query the search index and retrieve individual datasets. It also collects access statistics about each datasets in $DATA_PATH/stats. It is the only continuously running component and can be scaled out by exporting $DATA_PATH via a networked file system like NFS or SMB.

Development and operation

The code is organised as a single library with three entry points for the above mentioned programs. A fourth binary named xtask is used automate the development workflow.

The CI pipelines checks formatting via Rustfmt, ensure a warning-free build using Clippy, runs the unit and integration tests and builds and collects optimized binaries.

The system is deployed using a set of sandboxed systemd units, both for periodically running the harvester and indexer as well as continuously running the server.

How to get started

To format, lint and test the code, run

> cargo xtask

deployment/harvester.toml tracks all relevant sources. Based on that, a configuration like

[[sources]]
name = "uba-gdi"
type = "csw"
url = "https://gis.uba.de/smartfinder-csw/api/"

should be created at data/harvester.toml, so that the harvester and indexer can be invoked by

> cargo xtask harvester

Finally, executing

> cargo xtask server

will make the server listen on 127.0.0.1:8081.

Replaying responses

Iteratively developing harvesters can be time-consuming and place undue load on the source due to large responses being transmitted over the network. To mitigate this issue, each request must be identified using a key

let response = client.make_request(&format!("{}-{}", source.name, record_number), |client| ...).await?;

under which its response is stored on disk. Once development has reached a state where the set of requests is stable, their responses can be replayed by setting $REPLAY_RESPONSES, e.g.

> REPLAY_RESPONSES= cargo xtask harvester

Content negotiation

The HTTP routes /search and /dataset support content negotiation insofar they yield either rendered HTML pages or the underlying JSON data depending on the Accept header transmitted by the HTTP client.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
.cargo		.cargo
.github/workflows		.github/workflows
analysis		analysis
deployment		deployment
src		src
templates		templates
.gitignore		.gitignore
COPYING		COPYING
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

umwelt.info metadata index

Development and operation

How to get started

Replaying responses

Content negotiation

About

Contributors 2

Languages

License

adamreichold/umwelt-info

Folders and files

Latest commit

History

Repository files navigation

umwelt.info metadata index

Development and operation

How to get started

Replaying responses

Content negotiation

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages