Overview | Installation | Datasets | Examples | How You Can Help | Further Learning
TypeDB Bio is an open source biomedical knowledge graph to enable research in areas such as drug discovery, precision medicine and drug repurposing. It provides biomedical researchers an intuitive way to query interconnected and heterogeneous biomedical data in one single place.
For example, by querying for the virus SARS-CoV-2, we can find the associated human protein, proteasome subunit alpha type-2 (PSMA2), a component of the proteasome, implicated in SARS-CoV-2 replication, and its encoding gene (PSMA2). Additionally, we can identify the drug carfilzomib, a known inhibitor of the proteasome that could therefore be researched as a potential treatment for patients with Covid-19.
By examining these specific relationships and their attributes, we can further investigate any connected biological components and better understand their inter-relations. This helps researchers to efficiently study the mechanisms of protein interactions, infections, the immune response, and help to find targets for the development of treatments or drugs more efficiently. We can also expand our search to include contextual information as is shown below:
The team behind TypeDB Bio consists of a partnership between GSK, Oxford PharmaGenesis and Vaticle
The schema that models the underlying knowledge graph alongside the descriptive query language, TypeQL, makes writing complex queries an extremely straightforward and intuitive process. Furthermore, TypeDB's automated reasoning, allows TypeDB Bio to become an intelligent database of biomedical data in the biomedical field that infers implicit knowledge based on the explicitly stored data. TypeDB Bio can understand biological facts, infer based on new findings and enforce research constraints, all at query (run) time.
Prerequesites: Python >= 3.10, JDK >= 11, TypeDB Core >= 2.18.0, TypeDB Python Driver >= 2.18.0, TypeDB Studio >= 2.18.0
Clone this repo:
git clone https://github.com/vaticle/typedb-bio.git
Download the CORD-NER data set from this link and add it to this directory: dataset/cordner
Set up a virtual environment and install the dependencies:
cd <path/to/typedb-bio>/
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Start typedb
typedb server
Start the loader script
python loader.py
Config options can be set in: config.ini
Some options can be overridden with command line arguments. For help with those arguments:
python loader.py -h
If using TypeDB Enterprise or Cloud, the connection password can only be supplied via command line for security:
python loader.py -p my-password
Now grab a coffee (or two) while the loader builds the schema and data for you!
Install the test dependencies:
pip install -r requirements_test.txt
Run the tests:
python -m pytest -v -s tests
Install the development dependencies:
pip install -r requirements_dev.txt
pre-commit install
TypeQL queries can be run either in TypeDB Studio, in TypeDB Console, or through driver APIs. However, we encourage running the queries on TypeDB Studio to have the best visual experience.
# What are the drugs that interact with the genes associated to the virus Sars?
match
$virus isa virus, has virus-name "SARS";
$gene isa gene;
$drug isa drug;
$rel1 ($gene, $virus) isa gene-virus-association;
$rel2 ($gene, $drug) isa drug-gene-interaction;
offset 0; limit 20;
Currently the datasets we've integrated include:
- CORD-NER: The CORD-19 dataset that the White House released has been annotated and made publicly available. It uses various NER methods to recognise named entities on CORD-19 with distant or weak supervision.
- Uniprot: We’ve downloaded the reviewed human subset, and ingested genes, transcripts and protein identifiers.
- Coronaviruses: This is an annotated dataset of coronaviruses and their potential drug targets put together by Oxford PharmaGenesis based on literature review.
- DGIdb: We’ve taken the Interactions TSV which includes all drug-gene interactions.
- Human Protein Atlas: The Normal Tissue Data includes the expression profiles for proteins in human tissues.
- Reactome: This dataset connects pathways and their participating proteins.
- DisGeNet: We’ve taken the curated gene-disease-associations dataset, which contains associations from Uniprot, CGI, ClinGen, Genomics England and CTD, PsyGeNET, and Orphanet.
- SemMed: This is a subset of the SemMed version 4.0 database.
- TissueNet: A dataset of protein-protein interactions.
In progress:
- CORD-19: We incorporate the original corpus which includes peer-reviewed publications from bioRxiv, medRxiv and others.
- TODO: write loader script
We plan to add many more datasets!
This is an on-going project and we need your help! If you want to contribute, you can help out by helping us including:
- Migrate more data sources (e.g. clinical trials, DrugBank, Excelra)
- Extend the schema by adding relevant rules
- Create a website
- Write tutorials and articles for researchers to get started
If you wish to get in touch, please talk to us on the #typedb-bio channel on our Discord (link here).
- TypeDB for Life Sciences
- Predicting Novel Disease Targets at AstraZeneca
- Accelerating Drug Discovery with a TypeDB Knowledge Graph
- Presentation of TypeDB Bio at Orbit 2021
- Drug Discovery Knowledge Graphs
- Using a Knowledge Graph for Precision Medicine
- Drug Repurposing with a TypeDB Knowledge Graph for Bioinformatics
- What is a Knowledge Graph?