Merge pull request #71 from loucerac/develop

Develop
loucerac · Nov 30, 2023 · 5f45ba5 · 5f45ba5
2 parents d3f7261 + 4658b59
commit 5f45ba5
Show file tree

Hide file tree

Showing 45 changed files with 1,787 additions and 5,122 deletions.
diff --git a/.gitignore b/.gitignore
@@ -293,3 +293,4 @@ dask-worker-space/
 .pdm-python
 examples/fanconi_anemia/results/tmp/
 examples/codeocean/results/tmp/
+debug/
diff --git a/README.md b/README.md
@@ -40,6 +40,12 @@ To run the program for a disease map that uses circuits from the preprocessed `K
 seed_genes=2175,2176,2189
 ```
 
+- using the following template if you want to use the DisGeNET [1] curated gene-disease associations as seeds.
+
+```
+disease_id="C0015625"
+```
+
 - using the following template if you know which circuits to include (the disease map):
 
 ```
@@ -101,12 +107,19 @@ The recommended setup is:
 - setup `pipx`
 - setup `miniforge`
 - use `pipx` to install `pdm`
-- use `pipx` to inject pd-bump into `pdm`
+- ensure that `pdm` is version >=2.1, otherwise update with `pipx`
+- use `pipx` to inject pdm-bump into `pdm`
 - use `pipx` to install `nox`
+- run `pdm config venv.backend conda`
 - run `make`, if you want to use a CUDA enabled GPU, use `make gpu=1`
+- (Recommended): For GPU development, clear the cache using `pdm clean cache` first
 
 ## Documentation
 
 The documentation can be found here:
 
 https://loucerac.github.io/drexml/
+
+
+## References
+[1] Janet Piñero, Juan Manuel Ramírez-Anguita, Josep Saüch-Pitarch, Francesco Ronzano, Emilio Centeno, Ferran Sanz, Laura I Furlong. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucl. Acids Res. (2019) doi:10.1093/nar/gkz1021
diff --git a/docs/source/authors-reference.rst b/docs/source/authors-reference.rst
@@ -8,4 +8,6 @@ Here are project's authors and contributors:
 
 * Carlos Loucera <[email protected]>
 * Maria Peña-Chilet <[email protected]>
+* Víctor Manuel de la Oliva Roque <[email protected]>
+* Sara Herráiz-Gil <[email protected]>
 * Marina Esteban-Medina <[email protected]>
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -16,12 +16,29 @@ You can install drexml using:
 
 .. code::
 
-   pip install git+https://github.com/loucerac/drexml.git@master
+   pip install drexml
 
 
 Getting started
 ===============
 
+DRExML is a Python software package that allows you to explore and contextualize the pharmacological landscape of a disease characterized by its signaling circuits.
+
+Mechanistic modeling allows us to analyze the activity of signaling circuits that constitute biological pathways from gene expression data. This package uses machine learning models to infer the potential regulatory properties of known drug targets (KDTs) in these signaling circuits. This exploration of drug landscape from the disease map allows the identification of significantly influential KDT-targeting drugs that could play a therapeutic role in that disease.
+
+The user can upload information about their disease to build their mechanistic map (see Input). The package contains information on the gene expression levels normalized by the edgeR package of the KDTs (from GTEx Analysis Release V8; dbGaP 110 Accession phs000424.v8.p2) ) and the drugs that interact with each KDT (from DrugBank version 05.01.10). Finally, default signal transduction values are computed using the Hipathia package (version 2.14.0).
+
+For more information on the command line interface, check the :ref:`cli-reference`.
+
+.. toctree::
+   :maxdepth: 2
+   :hidden:
+
+   cli-reference
+
+Default transcriptomic profiles are obtained from the GTEx portal (GTEx Analysis Release V8; dbGaP 110 Accession phs000424.v8.p2) and normalized using the edgeR package (version 3.40.0). Default KDTs are obtained from the DrugBank database (version 05.01.10). Finally, default signal transduction values are computed using the Hipathia package (version 2.14.0).
+
+
 For more information, check the :ref:`api-reference`.
 
 .. toctree::
@@ -30,6 +47,38 @@ For more information, check the :ref:`api-reference`.
 
    api_reference
 
+Highlights
+==========
+
+- Contextualize the pharmacological landscape of a mechanistically characterized disease.
+- Obtain a list of KDTs which show a considerable influence over a given disease map
+- Quick visualization of the knowledge: generation of figures and tables of results
+
+Input
+=====
+
+The disease definition file is the main point of entry to interact with the software. Here, variables are defined pertaining to the disease map and activity matrix, as well as to the KDTs explored. 
+
+Main variables pertaining to the construction of the disease map and activity matrix are:
+
+- seed_genes: list of Entrez identifiers of genes suspected to be involved in the disease. The algorithm returns KEGG subpathways which contain at least on of these genes as the disease map
+- disease_id: UMLS CUI disease identifiers. The algorithm returns genes associated to this disease using the curated DISGENET database and returns the KEGG subpathways which contain at least one of these genes as the disease map
+- circuits: .tsv file with a list of circuits composing the target disease map, identified by their KEGG ID.
+
+The path to each of these three files must be defined in the environment file (e.g. disease.env). Note that users can use different signal transduction algorithms and/or different KDT transcriptomic profiles to explore. These must be given as TSV files by providing a path to the activity matrix through the "pathvals" variable and to the KDTs through the "gene_exp" variable. 
+
+Output
+======
+Software outputs are stored as TSV files:
+
+- shap_summary_symbol.tsv: final relevance matrix (SHAP values attributed to KDTs)
+- shap_selection_symbol.tsv: binary matrix indicating the KDTs have been selected for a given circuit or not
+- stability_results_symbol.tsv: R2 and stability estimates of circuits along with their confidence intervals
+
+Visualization
+=============
+The plot sub-command can be used to generate a number of summary graphics. Funcionalities include "plot_metrics" through which R2 and stability estimates can be visualized and "plot_relevance_heatmap" through which a heat map of the KDT SHAP scores can be generated in order to visualize KDT relevance over circuits. 
+
 For more information on the command line interface, check the :ref:`cli-reference`.
 
 .. toctree::
@@ -38,7 +87,6 @@ For more information on the command line interface, check the :ref:`cli-referenc
 
    cli-reference
 
-
 For more information on the authors, check the :ref:`authors-reference`.
 
 .. toctree::

diff --git a/drexml/datasets.py b/drexml/datasets.py
@@ -8,13 +8,43 @@
 import pystow
 from pandas.errors import ParserError
 from requests.exceptions import ConnectTimeout
-from zenodo_client import Zenodo
 
-from drexml.utils import get_resource_path, read_disease_config
+from drexml.utils import ensure_zenodo, get_resource_path, read_disease_config
 
 RECORD_ID = "6020480"
 
 
+def load_drugbank():
+    """Download if necessary and load the drugbank table.
+
+    Returns
+    -------
+    pd.DataFrame
+        Drugbank table.
+    """
+
+    # TODO: read versions using config
+    path = ensure_zenodo("drugbank-v050110_gtex-V8_mygene-20230220.tsv.gz")
+
+    return pd.read_csv(path, sep="\t")
+
+
+def load_atc():
+    """Load the ATC table.
+
+    Returns
+    -------
+    pd.DataFrame
+        ATC table.
+    """
+
+    # TODO: read versions using config
+
+    atc_path = ensure_zenodo("atc.csv.gz")
+
+    return pd.read_csv(atc_path, usecols=["atc_code", "atc_name"])
+
+
 def load_disgenet():
     """Download if necessary and load the Disgenet curated list of gene-disease
     associations.
@@ -121,8 +151,7 @@ def fetch_file(key, env, version="latest"):
     if env[key + "_zenodo"]:  # pragma: no cover
         if version == "latest":
             try:
-                zenodo = Zenodo()
-                path = zenodo.download_latest(RECORD_ID, env[key], force=False)
+                path = ensure_zenodo(env[key])
             except ConnectTimeout as err:
                 print(err)
                 path = pathlib.Path.home().joinpath(
@@ -232,25 +261,37 @@ def preprocess_frame(res, env, key):
         The preprocessed data frame.
     """
 
-    if key is not None:
-        index_name_options = get_index_name_options(key)
+    def preprocess_frame_(res, key):
+        if key is not None:
+            index_name_options = get_index_name_options(key)
+
+        for name in index_name_options:
+            if name in res.columns:
+                res = res.set_index(name, drop=True)
+        res.index = res.index.astype(str)
+
+        return res
 
-    for name in index_name_options:
-        if name in res.columns:
-            res = res.set_index(name, drop=True)
-    res.index = res.index.astype(str)
+    res = preprocess_frame_(res, key)
 
     if key == "gene_exp":
         return preprocess_gexp(res)
     elif key == "pathvals":
         return preprocess_activities(res)
     elif key == "circuits":
+        # build the disease map
         gene_list = []
         if env["seed_genes"]:
             gene_list += env["seed_genes"]
         if env["disease_id"]:
             gene_list += [str(gene) for gene in get_gda(env["disease_id"])]
-        return preprocess_map(res, gene_list, env["circuits_column"], env["use_physio"])
+        print("key dict")
+        print(env["circuits_dict"])
+        circuits_dict = load_df(env["circuits_dict"])
+        circuits_dict = preprocess_frame_(circuits_dict, key)
+        return preprocess_map(
+            res, gene_list, env["circuits_column"], env["use_physio"], circuits_dict
+        )
     elif key == "genes":
         return preprocess_genes(res, env["genes_column"])
 
@@ -321,7 +362,9 @@ def preprocess_activities(frame):
     return frame
 
 
-def preprocess_map(frame, disease_seed_genes, circuits_column, use_physio):
+def preprocess_map(
+    frame, disease_seed_genes, circuits_column, use_physio, circuits_dict=None
+):
     """
     Preprocesses a map data frame.
 
@@ -352,19 +395,35 @@ def preprocess_map(frame, disease_seed_genes, circuits_column, use_physio):
       with periods and returns the resulting list of circuits.
     """
     frame.index = frame.index.str.replace("-", ".").str.replace(" ", ".")
+    circuit_list = []
     if disease_seed_genes:
+        print("cdict")
+        print(circuits_dict)
+        print("genes")
         print(disease_seed_genes)
-        disease_seed_genes = frame.columns.intersection(disease_seed_genes)
-        circuits = frame.index[frame[disease_seed_genes].any(axis=1)].tolist()
-    else:
+        circuits_dict.index = circuits_dict.index.str.replace("-", ".").str.replace(
+            " ", "."
+        )
+        disease_seed_genes = circuits_dict.columns.intersection(disease_seed_genes)
+        circuit_list += circuits_dict.index[
+            circuits_dict[disease_seed_genes].any(axis=1)
+        ].tolist()
+        print("by genes")
+        print(circuit_list)
+    if circuits_column in frame:
         frame[circuits_column] = frame[circuits_column].astype(bool)
-        circuits = frame.index[frame[circuits_column]].tolist()
+        circuit_list += frame.index[frame[circuits_column]].tolist()
+        print("by hip")
+        print(circuit_list)
+
+    # remove duplicated
+    circuit_list = list(set(circuit_list))
 
     if use_physio:
         physio_lst = load_physiological_circuits()
-        circuits = [c for c in circuits if c in physio_lst]
+        circuit_list = [c for c in circuit_list if c in physio_lst]
 
-    return circuits
+    return circuit_list
 
 
 def preprocess_genes(frame, genes_column):