Skip to content

Commit

Permalink
Add landscape analysis (#17)
Browse files Browse the repository at this point in the history
This PR adds an workflow for automating the analysis of the landscape of
a given domain, given a (mostly declarative) configuration describing
the resources in that domain. It includes five landscape analyses:

1.
[Disease](https://github.com/biopragmatics/semra/blob/disease-landscape/scripts/disease/disease-landscape.ipynb)
2. [Cell & Cell
Line](https://github.com/biopragmatics/semra/blob/disease-landscape/scripts/cell/cell-landscape.ipynb)
3.
[Anatomy](https://github.com/biopragmatics/semra/blob/disease-landscape/scripts/anatomy/anatomy-landscape.ipynb)
4. [Protein
Complex](https://github.com/biopragmatics/semra/blob/disease-landscape/scripts/complex/complex-landscape.ipynb)
5.
[Gene](https://github.com/biopragmatics/semra/blob/disease-landscape/scripts/gene/gene-landscape.ipynb)

To do:

- [x] Make comparison chart between raw mappings + processed
- [x] Integrate GARD
- [x] Integrate OMIM
- [x] Integrate Orphanet
- [x] Remove HPO
- [x] Create upset plot
- [x] Slice out irrelevant hierarchies from MeSH, EFO, etc.
- [x] Create landscape histogram

This PR also makes other improvements to the underlying SeMRA pipeline
and web app, including closes #16.



![](https://raw.githubusercontent.com/biopragmatics/semra/disease-landscape/notebooks/landscape/disease/graph.svg)

We're able to automatically generate an UpSet plot like the one in [How
many rare diseases are there? (Haendel *et al.*,
2020)](https://doi.org/10.1038/d41573-019-00180-y) (a similar plot to
the following appears in the [supplementary
info](https://media.nature.com/original/magazine-assets/d41573-019-00180-y/17308594)
and an explanation appears on
[zenodo](https://zenodo.org/records/3478576)). Note that our plot is
about all diseases, not specifically rare ones:


![](https://raw.githubusercontent.com/biopragmatics/semra/disease-landscape/notebooks/landscape/disease/landscape_upset.svg)

The following histogram estimates how many diseases there are.
Importantly, it shows how many show up in a single resource, how many
show up in all resources, and how many show up in a few


![](https://raw.githubusercontent.com/biopragmatics/semra/disease-landscape/notebooks/landscape/disease/landscape_histogram.svg)
  • Loading branch information
cthoyt authored Apr 11, 2024
1 parent 69f251c commit 0a51108
Show file tree
Hide file tree
Showing 68 changed files with 154,501 additions and 152 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -901,3 +901,4 @@ FodyWeavers.xsd
scratch/
*.jnl
*.jar
scripts/Untitled.ipynb
41 changes: 41 additions & 0 deletions notebooks/landscape/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Landscape Analysis

This folder contains results from a workflow for automating the analysis of the landscape of a given domain,
given a declarative configuration describing the resources in that domain. It includes five landscape analyses:

1. [Disease](disease/disease-landscape.ipynb)
2. [Cell & Cell Line](cell/cell-landscape.ipynb)
3. [Anatomy](anatomy/anatomy-landscape.ipynb)
4. [Protein Complex](complex/complex-landscape.ipynb)
5. [Gene](gene/gene-landscape.ipynb)

## Example

Below, we highlight the disease landscape. Each analysis creates a graph of the processed mappings.

![](disease/graph.svg)

We're able to automatically generate an UpSet plot like the one in [How many rare diseases are there? (Haendel *et
al.*, 2020)](https://doi.org/10.1038/d41573-019-00180-y) (a similar plot to the following appears in
the [supplementary info](https://media.nature.com/original/magazine-assets/d41573-019-00180-y/17308594) and an
explanation appears on [zenodo](https://zenodo.org/records/3478576)). Note that our plot is about all diseases, not
specifically rare ones:

![](disease/landscape_upset.svg)

The following histogram estimates how many diseases there are. Importantly, it shows how many show up in a single
resource, how many show up in all resources, and how many show up in a few

![](disease/landscape_histogram.svg)

## Summary

A summary chart over all landscapes can be generated with `landscape.py`.

| name | raw_term_count | unique_term_count | reduction |
|---------|---------------:|------------------:|----------:|
| disease | 410173 | 243730 | 0.405787 |
| anatomy | 37917 | 32108 | 0.153203 |
| complex | 15869 | 7775 | 0.510051 |
| gene | 4.94578e+07 | 4.87886e+07 | 0.013529 |
| cell | 207019 | 166274 | 0.196818 |
Loading

0 comments on commit 0a51108

Please sign in to comment.