Skip to content

Latest commit



172 lines (147 loc) · 5.89 KB

File metadata and controls

172 lines (147 loc) · 5.89 KB

Ascent: Advanced Semantics for Commonsense Knowledge Extraction


ASCENT is a pipeline for extracting and consolidating commonsense knowledge from the world wide web. ASCENT is capable of extracting facet-enriched assertions, for example, lawyer; represents; clients; [LOCATION] in courts or elephant; uses; its trunk; [PURPOSE] to suck up water. A web interface of the ASCENT knowledge base for 10,000 popular concepts can be found at


Setting up environment

You need python3.7+ to run the pipeline.

First, create and activate a virtual environment using your favourite platform, e.g., python3-venv:

python -m venv .env
source .env/bin/activate

Then, install required packages:

pip install -r requirements.txt

Next, you need to download the following SpaCy model:

python -m spacy download en_core_web_md

Then, download the wordnet corpus for the nltk package:

python -c 'import nltk;"wordnet")'

RoBERTa models

Download our pretrained models for triple clustering and facet type labeling from then extract it to the project's root folder.

Bing API Key

Edit the file config.ini and provide your Bing API Key and Bing Search Custom Config under the section [bing_search]. Documentations to the Bing Custom Search API:


To run the ASCENT pipeline, navigate to the src/ folder and execute the script:

cd src/
python --config ../config.ini

You will be asked to fill in subject(s) which should be WordNet concepts. You can provide a single subject:

Enter subjects: lion.n.01

or a list of comma-separated subjects:

Enter subjects: lion.n.01,lynx.n.02,elephant.n.01

or path to a file containing one subject per line:

Enter subjects: /path/to/your/subjects.txt

Then, enter indices of the modules you want to execute:

[0] Bing Search
[1] Crawl articles
[2] Filter irrelevant articles
[3] Extract knowledge
[4] Cluster similar triples
[5] Label facets
[6] Group similar facets

For example, to run the complete pipeline:

From module: 0
  To module: 6

Final results will be written to output/kb/<subject>/final.json. Intermediate results of every module can be found in the output folder as well.


An example config file is the config.ini file. The missing fields are the Bing API-related ones. You can find references of the config fields in the following:

  • [default]

    • res_dir: resource folder
    • output: output folder
    • gpu: list of comma-separated GPUs to be used. -1 means CPU will be used. E.g., gpu = 0,3 means that we'll use the 0-th and 3-rd GPUs of the machine.
  • [bing_search]

    • subscription_key: Bing API subscription key (required)
    • custom_config: Bing API custom config (required)
    • num_urls: number of URLs to be fetched by the Bing API
    • host =
    • path = /bingcustomsearch/v7.0/search
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
    • num_processes: number of processors for this module
  • [article_grab]

    • num_crawlers: number of parallel crawlers, each crawler works with one subject at a time
    • processes_per_crawler: number of processors per crawlers
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
  • [filter]

    • num_processes: number of processors for this module
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
  • [extraction]

    • doc_threshold: document cosine-similarity threshold. Documents lower than this threshold will be filtered out (default: 0.55)
    • num_processes: number of processors for this module
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
  • [triple_clustering]

    • model: path to the triple clustering model
    • threshold: threshold for the HAC algorithm (default: 0.005)
    • batch_size: size of triple pair batch to be processed at a time (default: 1024)
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
  • [facet_labeling]

    • model: path to the facet labeling model
    • batch_size: size of faceted triple batch to be processed at a time (default: 1024)
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not
  • [facet_grouping]

    • num_processes: number of processors for this module
    • overwrite: (true|false) indicates that when result of this module is already found in the output folder, overwrite it or not


If you use Ascent, please cite the following paper:

  author = {Nguyen, Tuan-Phong and Razniewski, Simon and Weikum, Gerhard},
  title = {Advanced Semantics for Commonsense Knowledge Extraction},
  year = {2021},
  isbn = {9781450383127},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {},
  doi = {10.1145/3442381.3449827},
  booktitle = {Proceedings of the Web Conference 2021},
  pages = {2636–2647},
  numpages = {12},
  location = {Ljubljana, Slovenia},
  series = {WWW '21}