A Large Semantic Knowledge Graph from Wikipedia Categories and Listings
For information about the general idea, extraction statistics, and resources of CaLiGraph, visit the CaLiGraph website.
- At least 300 GB of RAM as we load most of DBpedia in memory to speed up the extraction
 - At least one GPU to run transformers
 - During the first execution of an extraction you need a stable internet connection as the required DBpedia files are downloaded automatically
 
- 
In the project root, create a conda environment with:
conda env create -f environment.yaml - 
Activate the environment with
conda activate caligraph - 
Install dependencies with
poetry install - 
Install PyTorch for your specific cuda version with
poetry run poe autoinstall-torch-cuda - 
If you have not downloaded them already, you have to fetch the latest corpora for spaCy and nltk (run in terminal):
 
# download the most recent corpus of spaCy
python -m spacy download en_core_web_lg
# download wordnet & words corpora of nltk
python -c 'import nltk; nltk.download("wordnet"); nltk.download("words"); nltk.download("omw-1.4")'
You can configure the application-specific parameters as well as logging- and file-related parameters in config.yaml.
Make sure that the virtual environment caligraph is activated. Then you can run the extraction in the project root folder with python .
All the required resources, like DBpedia files, will be downloaded automatically during execution.
CaLiGraph is serialized in N-Triple format. The resulting files are placed in the results folder.
Use the script evaluate_mention_detection.py to evaluate a specific configuration for subject entity detection.
Make sure that there is a free GPU on your system and that the environment caligraph is activated. Then you can run an evaluation as follows:
python evaluate_mention_detection.py <GPU-ID> <HUGGINGFACE-MODEL> <OPTIONAL-CONFIG-PARAMS>
Have a look at the evaluation script for a description of the optional configuration parameters.
In the project root, run tests with pytest