Skip to content

Commit

Permalink
Merge pull request #71 from loucerac/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
loucerac authored Nov 30, 2023
2 parents d3f7261 + 4658b59 commit 5f45ba5
Show file tree
Hide file tree
Showing 45 changed files with 1,787 additions and 5,122 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -293,3 +293,4 @@ dask-worker-space/
.pdm-python
examples/fanconi_anemia/results/tmp/
examples/codeocean/results/tmp/
debug/
15 changes: 14 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,12 @@ To run the program for a disease map that uses circuits from the preprocessed `K
seed_genes=2175,2176,2189
```

- using the following template if you want to use the DisGeNET [1] curated gene-disease associations as seeds.

```
disease_id="C0015625"
```

- using the following template if you know which circuits to include (the disease map):

```
Expand Down Expand Up @@ -101,12 +107,19 @@ The recommended setup is:
- setup `pipx`
- setup `miniforge`
- use `pipx` to install `pdm`
- use `pipx` to inject pd-bump into `pdm`
- ensure that `pdm` is version >=2.1, otherwise update with `pipx`
- use `pipx` to inject pdm-bump into `pdm`
- use `pipx` to install `nox`
- run `pdm config venv.backend conda`
- run `make`, if you want to use a CUDA enabled GPU, use `make gpu=1`
- (Recommended): For GPU development, clear the cache using `pdm clean cache` first

## Documentation

The documentation can be found here:

https://loucerac.github.io/drexml/


## References
[1] Janet Piñero, Juan Manuel Ramírez-Anguita, Josep Saüch-Pitarch, Francesco Ronzano, Emilio Centeno, Ferran Sanz, Laura I Furlong. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucl. Acids Res. (2019) doi:10.1093/nar/gkz1021
2 changes: 2 additions & 0 deletions docs/source/authors-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@ Here are project's authors and contributors:

* Carlos Loucera <[email protected]>
* Maria Peña-Chilet <[email protected]>
* Víctor Manuel de la Oliva Roque <[email protected]>
* Sara Herráiz-Gil <[email protected]>
* Marina Esteban-Medina <[email protected]>
52 changes: 50 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,29 @@ You can install drexml using:

.. code::
pip install git+https://github.com/loucerac/drexml.git@master
pip install drexml
Getting started
===============

DRExML is a Python software package that allows you to explore and contextualize the pharmacological landscape of a disease characterized by its signaling circuits.

Mechanistic modeling allows us to analyze the activity of signaling circuits that constitute biological pathways from gene expression data. This package uses machine learning models to infer the potential regulatory properties of known drug targets (KDTs) in these signaling circuits. This exploration of drug landscape from the disease map allows the identification of significantly influential KDT-targeting drugs that could play a therapeutic role in that disease.

The user can upload information about their disease to build their mechanistic map (see Input). The package contains information on the gene expression levels normalized by the edgeR package of the KDTs (from GTEx Analysis Release V8; dbGaP 110 Accession phs000424.v8.p2) ) and the drugs that interact with each KDT (from DrugBank version 05.01.10). Finally, default signal transduction values are computed using the Hipathia package (version 2.14.0).

For more information on the command line interface, check the :ref:`cli-reference`.

.. toctree::
:maxdepth: 2
:hidden:

cli-reference

Default transcriptomic profiles are obtained from the GTEx portal (GTEx Analysis Release V8; dbGaP 110 Accession phs000424.v8.p2) and normalized using the edgeR package (version 3.40.0). Default KDTs are obtained from the DrugBank database (version 05.01.10). Finally, default signal transduction values are computed using the Hipathia package (version 2.14.0).


For more information, check the :ref:`api-reference`.

.. toctree::
Expand All @@ -30,6 +47,38 @@ For more information, check the :ref:`api-reference`.

api_reference

Highlights
==========

- Contextualize the pharmacological landscape of a mechanistically characterized disease.
- Obtain a list of KDTs which show a considerable influence over a given disease map
- Quick visualization of the knowledge: generation of figures and tables of results

Input
=====

The disease definition file is the main point of entry to interact with the software. Here, variables are defined pertaining to the disease map and activity matrix, as well as to the KDTs explored.

Main variables pertaining to the construction of the disease map and activity matrix are:

- seed_genes: list of Entrez identifiers of genes suspected to be involved in the disease. The algorithm returns KEGG subpathways which contain at least on of these genes as the disease map
- disease_id: UMLS CUI disease identifiers. The algorithm returns genes associated to this disease using the curated DISGENET database and returns the KEGG subpathways which contain at least one of these genes as the disease map
- circuits: .tsv file with a list of circuits composing the target disease map, identified by their KEGG ID.

The path to each of these three files must be defined in the environment file (e.g. disease.env). Note that users can use different signal transduction algorithms and/or different KDT transcriptomic profiles to explore. These must be given as TSV files by providing a path to the activity matrix through the "pathvals" variable and to the KDTs through the "gene_exp" variable.

Output
======
Software outputs are stored as TSV files:

- shap_summary_symbol.tsv: final relevance matrix (SHAP values attributed to KDTs)
- shap_selection_symbol.tsv: binary matrix indicating the KDTs have been selected for a given circuit or not
- stability_results_symbol.tsv: R2 and stability estimates of circuits along with their confidence intervals

Visualization
=============
The plot sub-command can be used to generate a number of summary graphics. Funcionalities include "plot_metrics" through which R2 and stability estimates can be visualized and "plot_relevance_heatmap" through which a heat map of the KDT SHAP scores can be generated in order to visualize KDT relevance over circuits.

For more information on the command line interface, check the :ref:`cli-reference`.

.. toctree::
Expand All @@ -38,7 +87,6 @@ For more information on the command line interface, check the :ref:`cli-referenc

cli-reference


For more information on the authors, check the :ref:`authors-reference`.

.. toctree::
Expand Down
95 changes: 77 additions & 18 deletions drexml/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,43 @@
import pystow
from pandas.errors import ParserError
from requests.exceptions import ConnectTimeout
from zenodo_client import Zenodo

from drexml.utils import get_resource_path, read_disease_config
from drexml.utils import ensure_zenodo, get_resource_path, read_disease_config

RECORD_ID = "6020480"


def load_drugbank():
"""Download if necessary and load the drugbank table.
Returns
-------
pd.DataFrame
Drugbank table.
"""

# TODO: read versions using config
path = ensure_zenodo("drugbank-v050110_gtex-V8_mygene-20230220.tsv.gz")

return pd.read_csv(path, sep="\t")


def load_atc():
"""Load the ATC table.
Returns
-------
pd.DataFrame
ATC table.
"""

# TODO: read versions using config

atc_path = ensure_zenodo("atc.csv.gz")

return pd.read_csv(atc_path, usecols=["atc_code", "atc_name"])


def load_disgenet():
"""Download if necessary and load the Disgenet curated list of gene-disease
associations.
Expand Down Expand Up @@ -121,8 +151,7 @@ def fetch_file(key, env, version="latest"):
if env[key + "_zenodo"]: # pragma: no cover
if version == "latest":
try:
zenodo = Zenodo()
path = zenodo.download_latest(RECORD_ID, env[key], force=False)
path = ensure_zenodo(env[key])
except ConnectTimeout as err:
print(err)
path = pathlib.Path.home().joinpath(
Expand Down Expand Up @@ -232,25 +261,37 @@ def preprocess_frame(res, env, key):
The preprocessed data frame.
"""

if key is not None:
index_name_options = get_index_name_options(key)
def preprocess_frame_(res, key):
if key is not None:
index_name_options = get_index_name_options(key)

for name in index_name_options:
if name in res.columns:
res = res.set_index(name, drop=True)
res.index = res.index.astype(str)

return res

for name in index_name_options:
if name in res.columns:
res = res.set_index(name, drop=True)
res.index = res.index.astype(str)
res = preprocess_frame_(res, key)

if key == "gene_exp":
return preprocess_gexp(res)
elif key == "pathvals":
return preprocess_activities(res)
elif key == "circuits":
# build the disease map
gene_list = []
if env["seed_genes"]:
gene_list += env["seed_genes"]
if env["disease_id"]:
gene_list += [str(gene) for gene in get_gda(env["disease_id"])]
return preprocess_map(res, gene_list, env["circuits_column"], env["use_physio"])
print("key dict")
print(env["circuits_dict"])
circuits_dict = load_df(env["circuits_dict"])
circuits_dict = preprocess_frame_(circuits_dict, key)
return preprocess_map(
res, gene_list, env["circuits_column"], env["use_physio"], circuits_dict
)
elif key == "genes":
return preprocess_genes(res, env["genes_column"])

Expand Down Expand Up @@ -321,7 +362,9 @@ def preprocess_activities(frame):
return frame


def preprocess_map(frame, disease_seed_genes, circuits_column, use_physio):
def preprocess_map(
frame, disease_seed_genes, circuits_column, use_physio, circuits_dict=None
):
"""
Preprocesses a map data frame.
Expand Down Expand Up @@ -352,19 +395,35 @@ def preprocess_map(frame, disease_seed_genes, circuits_column, use_physio):
with periods and returns the resulting list of circuits.
"""
frame.index = frame.index.str.replace("-", ".").str.replace(" ", ".")
circuit_list = []
if disease_seed_genes:
print("cdict")
print(circuits_dict)
print("genes")
print(disease_seed_genes)
disease_seed_genes = frame.columns.intersection(disease_seed_genes)
circuits = frame.index[frame[disease_seed_genes].any(axis=1)].tolist()
else:
circuits_dict.index = circuits_dict.index.str.replace("-", ".").str.replace(
" ", "."
)
disease_seed_genes = circuits_dict.columns.intersection(disease_seed_genes)
circuit_list += circuits_dict.index[
circuits_dict[disease_seed_genes].any(axis=1)
].tolist()
print("by genes")
print(circuit_list)
if circuits_column in frame:
frame[circuits_column] = frame[circuits_column].astype(bool)
circuits = frame.index[frame[circuits_column]].tolist()
circuit_list += frame.index[frame[circuits_column]].tolist()
print("by hip")
print(circuit_list)

# remove duplicated
circuit_list = list(set(circuit_list))

if use_physio:
physio_lst = load_physiological_circuits()
circuits = [c for c in circuits if c in physio_lst]
circuit_list = [c for c in circuit_list if c in physio_lst]

return circuits
return circuit_list


def preprocess_genes(frame, genes_column):
Expand Down
Loading

0 comments on commit 5f45ba5

Please sign in to comment.