A Large Semantic Knowledge Graph from Wikipedia Categories and Listings
For information about the general idea, extraction statistics, and resources of CaLiGraph, visit the CaLiGraph website.
- At least 300 GB of RAM as we load most of DBpedia in memory to speed up the extraction
- At least one GPU to run transformers
- During the first execution of an extraction you need a stable internet connection as the required DBpedia files are downloaded automatically
-
In the project root, create a conda environment with:
conda env create -f environment.yaml
-
Activate the environment with
conda activate caligraph
-
Install dependencies with
poetry install
-
Install PyTorch for your specific cuda version with
poetry run poe autoinstall-torch-cuda
-
If you have not downloaded them already, you have to fetch the latest corpora for spaCy and nltk (run in terminal):
# download the most recent corpus of spaCy
python -m spacy download en_core_web_lg
# download wordnet & words corpora of nltk
python -c 'import nltk; nltk.download("wordnet"); nltk.download("words"); nltk.download("omw-1.4")'
You can configure the application-specific parameters as well as logging- and file-related parameters in config.yaml
.
Make sure that the virtual environment caligraph
is activated. Then you can run the extraction in the project root folder with python .
All the required resources, like DBpedia files, will be downloaded automatically during execution.
CaLiGraph is serialized in N-Triple format. The resulting files are placed in the results
folder.
Use the script evaluate_mention_detection.py
to evaluate a specific configuration for subject entity detection.
Make sure that there is a free GPU on your system and that the environment caligraph
is activated. Then you can run an evaluation as follows:
python evaluate_mention_detection.py <GPU-ID> <HUGGINGFACE-MODEL> <OPTIONAL-CONFIG-PARAMS>
Have a look at the evaluation script for a description of the optional configuration parameters.
In the project root, run tests with pytest