Skip to content

Commit

Permalink
Merge pull request #400 from monarch-initiative/release
Browse files Browse the repository at this point in the history
Release
  • Loading branch information
ielis authored Jan 16, 2025
2 parents e14538f + 48abc8b commit 83e94b4
Show file tree
Hide file tree
Showing 35 changed files with 530 additions and 114 deletions.
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,9 @@

GPSEA (Genotypes and Phenotypes - Statistical Evaluation of Associations) is a Python package for finding genotype-phenotype associations.


See the [Tutorial](https://monarch-initiative.github.io/gpsea/stable/tutorial.html)
and a comprehensive [User guide](https://monarch-initiative.github.io/gpsea/stable/user-guide/index.html)
for more information.
See our documentation for the [setup](https://monarch-initiative.github.io/gpsea/stable/setup.html) instructions,
a [tutorial](https://monarch-initiative.github.io/gpsea/stable/tutorial.html) with an end-to-end genotype-phenotype association analysis,
and a comprehensive [user guide](https://monarch-initiative.github.io/gpsea/stable/user-guide/index.html) with everything else.

The documentation comes in two flavors:

Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@
# The short X.Y version.
version = u'0.9'
# The full version, including alpha/beta/rc tags.
release = u'0.9.1'
release = u'0.9.2'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
Binary file modified docs/img/tutorial/tbx5_protein_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
Setup
#####

Here we show how to install GPSEA and to prepare your Python environment
for genotype-phenotype association analysis.
Here we show how to install GPSEA and prepare your Python environment
for genotype-phenotype association analyses.


.. contents:: Table of Contents
Expand Down
4 changes: 4 additions & 0 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -273,10 +273,14 @@ depending on presence of a single allele of a missense or truncating variant
>>> from gpsea.analysis.clf import monoallelic_classifier
>>> is_missense = variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id)
>>> truncating_effects = (
... VariantEffect.TRANSCRIPT_ABLATION,
... VariantEffect.TRANSCRIPT_TRANSLOCATION,
... VariantEffect.FRAMESHIFT_VARIANT,
... VariantEffect.START_LOST,
... VariantEffect.STOP_GAINED,
... VariantEffect.SPLICE_DONOR_VARIANT,
... VariantEffect.SPLICE_ACCEPTOR_VARIANT,
... # more effects could be listed here ...
... )
>>> is_truncating = anyof(variant_effect(e, tx_id) for e in truncating_effects)
>>> gt_clf = monoallelic_classifier(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,11 @@ The predicates operate on several lines of information:
+------------------------+-------------------------------------------------------------------------------------------------+
| Protein data | variant is located in a region encoding a protein domain, protein feature type |
+------------------------+-------------------------------------------------------------------------------------------------+
| Genome | overlap with a genomic region of interest |
+------------------------+-------------------------------------------------------------------------------------------------+


The scope of the builtin predicates is fairly narrow
and likely insufficient for real-life analyses.
However, the predicates can be chained into a compound predicate
However, several predicates can be "chained" into a compound predicate using a boolean logic,
to achive more expressivity for testing complex conditions,
such as "variant is a missense or synonymous variant located in exon 6 of `NM_013275.6`".

Expand All @@ -41,8 +39,9 @@ such as "variant is a missense or synonymous variant located in exon 6 of `NM_01
Examples
********

Here we show examples of several simple variant predicates and
how to chain them for testing complex conditions.
Here we show how to use the builtin predicates for simple tests
and how to build a compound predicate from the builtin predicates,
for testing complex conditions.


Load cohort
Expand Down Expand Up @@ -112,10 +111,10 @@ See the :mod:`gpsea.analysis.predicate` module
for a complete list of the builtin predicates.


Predicate chain
===============
Compound predicates
===================

Using the builtin predicates, we can build a logical chain to test complex conditions.
A compound predicate for testing complex conditions can be built from two or more predicates.
For instance, we can test if the variant meets any of several conditions:

>>> import gpsea.analysis.predicate as vp
Expand All @@ -130,7 +129,13 @@ or *all* conditions:
>>> missense_and_exon20.test(variant)
True

All variant predicates overload Python ``&`` (AND) and ``|`` (OR) operators, to allow chaining.
All variant predicates overload Python ``&`` (AND) and ``|`` (OR) operators,
to combine a predicate pair into a compound predicate.

.. note::

Combining three or or more predicates can be achieved with :func:`~gpsea.analysis.allof`
and :func:`~gpsea.analysis.anyof` functions.

Therefore, there is nothing that prevents us to combine the predicates into multi-level tests,
e.g. to test if the variant is a *"chromosomal deletion" or a deletion which removes at least 50 bp*:
Expand Down Expand Up @@ -180,12 +185,16 @@ The builtin predicates should cover majority of use cases.
However, if a predicate seems to be missing,
feel free to submit an issue in our
`GitHub tracker <https://github.com/monarch-initiative/gpsea/issues>`_,
or to implement a custom predicate
by extending the :class:`~gpsea.analysis.predicate.VariantPredicate` class 😎.
or implement your own predicate by following the :ref:`custom-variant-predicate`
guide.


****
Next
****

The variant predicate offers a flexible API for testing if variants meet a condition.
However, the genotype phenotype correlations are done on the individual level
and the variant predicates are used as a component of the genotype predicate.
The next sections show how to use variant predicates to assign individuals into groups.
However, the genotype phenotype correlations are studied on the level of individuals.
As described in :ref:`genotype-classifiers`, GPSEA uses the :class:`~gpsea.analysis.clf.GenotypeClassifier` API
to assign individuals into non-overlapping classes. Variant predicates are essential for creating such classifier.
We explain the details in the following sections.
26 changes: 24 additions & 2 deletions docs/user-guide/analyses/phenotype-scores.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,17 +121,19 @@ In this example, the point mutation is a mutation that meets the following condi
'((change length == 0 AND reference allele length == 1) AND MISSENSE_VARIANT on NM_001042681.2)'


For the loss of function predicate, the following variant effects are considered loss of function:
For the loss-of-function predicate, the following is a non-exhausting list
of variant effects considered as a loss-of-function:

>>> lof_effects = (
... VariantEffect.TRANSCRIPT_TRANSLOCATION,
... VariantEffect.TRANSCRIPT_ABLATION,
... VariantEffect.FRAMESHIFT_VARIANT,
... VariantEffect.START_LOST,
... VariantEffect.STOP_GAINED,
... )
>>> lof_mutation = anyof(variant_effect(eff, tx_id) for eff in lof_effects)
>>> lof_mutation.description
'(TRANSCRIPT_ABLATION on NM_001042681.2 OR FRAMESHIFT_VARIANT on NM_001042681.2 OR START_LOST on NM_001042681.2 OR STOP_GAINED on NM_001042681.2)'
'(TRANSCRIPT_TRANSLOCATION on NM_001042681.2 OR TRANSCRIPT_ABLATION on NM_001042681.2 OR FRAMESHIFT_VARIANT on NM_001042681.2 OR START_LOST on NM_001042681.2 OR STOP_GAINED on NM_001042681.2)'


The genotype predicate will bin the patient into two classes: a point mutation or the loss of function:
Expand All @@ -154,6 +156,26 @@ Phenotype score
This component is responsible for computing a phenotype score for an individual.
As far as GPSEA framework is concerned, the phenotype score must be a floating point number
or a `NaN` value if the score cannot be computed for an individual.
This is the essence of the :class:`~gpsea.analysis.pscore.PhenotypeScorer` class.

GPSEA ships with several builtin phenotype scorers which can be used as

+------------------------------------------------------------+---------------------------------------------+
| Name | Description |
+============================================================+=============================================+
| | Compute the total number of occurrences |
| * :class:`~gpsea.analysis.pscore.CountingPhenotypeScorer` | of specific phenotypic features |
| | (used in this section) |
+------------------------------------------------------------+---------------------------------------------+
| | Compute the "adapted De Vries Score" |
| * :class:`~gpsea.analysis.pscore.DeVriesPhenotypeScorer` | for assessing severity |
| | of intellectual disability |
+------------------------------------------------------------+---------------------------------------------+

.. tip::

See :ref:`custom-phenotype-scorer` section to learn how to build a phenotype scorer from scratch.


Here we use the :class:`~gpsea.analysis.pscore.CountingPhenotypeScorer` for scoring
the individuals based on the number of structural defects
Expand Down
Binary file modified docs/user-guide/analyses/report/rere_phenotype_score_boxplot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
93 changes: 93 additions & 0 deletions docs/user-guide/analyses/survival.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,3 +188,96 @@ or `None` if computing the survival was impossible (see :func:`~gpsea.analysis.t
The `Survival` reports the number of days until attaining the endpoint,
here defined as end stage renal disease (`is_censored=False`),
or until the individual dropped out of the analysis (`is_censored=True`).


Troubleshooting
===============

Sometimes the survival analysis fails and an :class:`~gpsea.analysis.AnalysisException` is raised.
For instance, the current Logrank test implementation reports a p value of `NaN`
if the survival is the same for all individuals.
This is unlikely an expected outcome, therefore GPSEA raises
an :class:`~gpsea.analysis.AnalysisException` to force the user to troubleshoot.

To help with troubleshooting, the data computed prior detecting the error is included in the exception's
:attr:`~gpsea.analysis.AnalysisException.data` attribute. In survival analysis, the data should include
the identifiers, genotype classes, and survivals of the tested individuals.

Let's show this on an example. We will create a toy cohort of 10 individuals
with onset of `Lynch syndrome I <https://hpo.jax.org/browse/disease/OMIM:120435>`_
(`OMIM:120435`) at 40 years.

>>> from gpsea.model import Cohort, Patient, Disease, Age
>>> onset = Age.from_iso8601_period("P40Y")
>>> individuals = [
... Patient.from_raw_parts(
... labels=label,
... diseases=(
... Disease.from_raw_parts(
... term_id="OMIM:120435",
... name="Lynch syndrome I",
... is_observed=True,
... onset=onset,
... ),
... ),
... )
... for label in "ABCDEFGHIJ" # 10 individuals
... ]
>>> cohort = Cohort.from_patients(individuals)

We will assign them into genotype classes on random, ...

>>> from gpsea.analysis.clf import random_classifier
>>> gt_clf = random_classifier(seed=123)
>>> gt_clf.description
'Classify the individual into random classes'

... using the Lynch syndrome I diagnosis as the endpoint ...

>>> from gpsea.analysis.temporal.endpoint import disease_onset
>>> endpoint = disease_onset(disease_id="OMIM:120435")
>>> endpoint.description
'Compute time until OMIM:120435 onset'

... and we will use Logrank test for differences in survival.

>>> from gpsea.analysis.temporal.stats import LogRankTest
>>> survival_statistic = LogRankTest()

We put together the survival analysis ...

>>> from gpsea.analysis.temporal import SurvivalAnalysis
>>> survival_analysis = SurvivalAnalysis(
... statistic=survival_statistic,
... )

... which we expect to fail with an :class:`~gpsea.analysis.AnalysisException`:

>>> result = survival_analysis.compare_genotype_vs_survival(
... cohort=cohort,
... gt_clf=gt_clf,
... endpoint=endpoint,
... )
Traceback (most recent call last):
...
gpsea.analysis._base.AnalysisException: The survival values did not meet the expectation of the statistical test!

The genotype classes and survival values can be retrieved from the exception:

>>> from gpsea.analysis import AnalysisException
>>> try:
... result = survival_analysis.compare_genotype_vs_survival(
... cohort=cohort,
... gt_clf=gt_clf,
... endpoint=endpoint,
... )
... except AnalysisException as ae:
... genotypes = ae.data["genotype"]
... survivals = ae.data["survival"]

and the values can come in handy in troubleshooting:

>>> genotypes[:3]
(0, 0, 0)
>>> survivals[:3]
(Survival(value=14610.0, is_censored=False), Survival(value=14610.0, is_censored=False), Survival(value=14610.0, is_censored=False))
Loading

0 comments on commit 83e94b4

Please sign in to comment.