Merge pull request #400 from monarch-initiative/release

Release
monarch-initiative · Jan 16, 2025 · 83e94b4 · 83e94b4
2 parents e14538f + 48abc8b
commit 83e94b4
Show file tree

Hide file tree

Showing 35 changed files with 530 additions and 114 deletions.
diff --git a/README.md b/README.md
@@ -5,10 +5,9 @@
 
 GPSEA (Genotypes and Phenotypes - Statistical Evaluation of Associations) is a Python package for finding genotype-phenotype associations.
 
-
-See the [Tutorial](https://monarch-initiative.github.io/gpsea/stable/tutorial.html) 
-and a comprehensive [User guide](https://monarch-initiative.github.io/gpsea/stable/user-guide/index.html)
-for more information.
+See our documentation for the [setup](https://monarch-initiative.github.io/gpsea/stable/setup.html) instructions,
+a [tutorial](https://monarch-initiative.github.io/gpsea/stable/tutorial.html) with an end-to-end genotype-phenotype association analysis,
+and a comprehensive [user guide](https://monarch-initiative.github.io/gpsea/stable/user-guide/index.html) with everything else.
 
 The documentation comes in two flavors:
 

diff --git a/docs/conf.py b/docs/conf.py
@@ -63,7 +63,7 @@
 # The short X.Y version.
 version = u'0.9'
 # The full version, including alpha/beta/rc tags.
-release = u'0.9.1'
+release = u'0.9.2'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/docs/img/tutorial/tbx5_protein_diagram.png b/docs/img/tutorial/tbx5_protein_diagram.png
diff --git a/docs/setup.rst b/docs/setup.rst
@@ -4,8 +4,8 @@
 Setup
 #####
 
-Here we show how to install GPSEA and to prepare your Python environment
-for genotype-phenotype association analysis.
+Here we show how to install GPSEA and prepare your Python environment
+for genotype-phenotype association analyses.
 
 
 .. contents:: Table of Contents

diff --git a/docs/tutorial.rst b/docs/tutorial.rst
@@ -273,10 +273,14 @@ depending on presence of a single allele of a missense or truncating variant
 >>> from gpsea.analysis.clf import monoallelic_classifier
 >>> is_missense = variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id)
 >>> truncating_effects = (
+...    VariantEffect.TRANSCRIPT_ABLATION,
+...    VariantEffect.TRANSCRIPT_TRANSLOCATION,
 ...    VariantEffect.FRAMESHIFT_VARIANT,
+...    VariantEffect.START_LOST,
 ...    VariantEffect.STOP_GAINED,
 ...    VariantEffect.SPLICE_DONOR_VARIANT,
 ...    VariantEffect.SPLICE_ACCEPTOR_VARIANT,
+...    # more effects could be listed here ...
 ... )
 >>> is_truncating = anyof(variant_effect(e, tx_id) for e in truncating_effects)
 >>> gt_clf = monoallelic_classifier(

diff --git a/docs/user-guide/analyses/partitioning/genotype/variant_predicates.rst b/docs/user-guide/analyses/partitioning/genotype/variant_predicates.rst
@@ -26,13 +26,11 @@ The predicates operate on several lines of information:
 +------------------------+-------------------------------------------------------------------------------------------------+
 | Protein data           | variant is located in a region encoding a protein domain, protein feature type                  |
 +------------------------+-------------------------------------------------------------------------------------------------+
-| Genome                 | overlap with a genomic region of interest                                                       |
-+------------------------+-------------------------------------------------------------------------------------------------+
 
 
 The scope of the builtin predicates is fairly narrow
 and likely insufficient for real-life analyses.
-However, the predicates can be chained into a compound predicate
+However, several predicates can be "chained" into a compound predicate using a boolean logic,
 to achive more expressivity for testing complex conditions,
 such as "variant is a missense or synonymous variant located in exon 6 of `NM_013275.6`".
 
@@ -41,8 +39,9 @@ such as "variant is a missense or synonymous variant located in exon 6 of `NM_01
 Examples
 ********
 
-Here we show examples of several simple variant predicates and 
-how to chain them for testing complex conditions.
+Here we show how to use the builtin predicates for simple tests
+and how to build a compound predicate from the builtin predicates,
+for testing complex conditions.
 
 
 Load cohort
@@ -112,10 +111,10 @@ See the :mod:`gpsea.analysis.predicate` module
 for a complete list of the builtin predicates.
 
 
-Predicate chain
-===============
+Compound predicates
+===================
 
-Using the builtin predicates, we can build a logical chain to test complex conditions.
+A compound predicate for testing complex conditions can be built from two or more predicates.
 For instance, we can test if the variant meets any of several conditions:
 
 >>> import gpsea.analysis.predicate as vp
@@ -130,7 +129,13 @@ or *all* conditions:
 >>> missense_and_exon20.test(variant)
 True
 
-All variant predicates overload Python ``&`` (AND) and ``|`` (OR) operators, to allow chaining.
+All variant predicates overload Python ``&`` (AND) and ``|`` (OR) operators,
+to combine a predicate pair into a compound predicate.
+
+.. note::
+
+  Combining three or or more predicates can be achieved with :func:`~gpsea.analysis.allof`
+  and :func:`~gpsea.analysis.anyof` functions.
 
 Therefore, there is nothing that prevents us to combine the predicates into multi-level tests, 
 e.g. to test if the variant is a *"chromosomal deletion" or a deletion which removes at least 50 bp*:
@@ -180,12 +185,16 @@ The builtin predicates should cover majority of use cases.
 However, if a predicate seems to be missing,
 feel free to submit an issue in our
 `GitHub tracker <https://github.com/monarch-initiative/gpsea/issues>`_,
-or to implement a custom predicate
-by extending the :class:`~gpsea.analysis.predicate.VariantPredicate` class 😎.
+or implement your own predicate by following the :ref:`custom-variant-predicate`
+guide.
 
 
+****
+Next
+****
 
 The variant predicate offers a flexible API for testing if variants meet a condition.
-However, the genotype phenotype correlations are done on the individual level
-and the variant predicates are used as a component of the genotype predicate.
-The next sections show how to use variant predicates to assign individuals into groups.
+However, the genotype phenotype correlations are studied on the level of individuals.
+As described in :ref:`genotype-classifiers`, GPSEA uses the :class:`~gpsea.analysis.clf.GenotypeClassifier` API
+to assign individuals into non-overlapping classes. Variant predicates are essential for creating such classifier.
+We explain the details in the following sections.
diff --git a/docs/user-guide/analyses/phenotype-scores.rst b/docs/user-guide/analyses/phenotype-scores.rst
@@ -121,17 +121,19 @@ In this example, the point mutation is a mutation that meets the following condi
 '((change length == 0 AND reference allele length == 1) AND MISSENSE_VARIANT on NM_001042681.2)'
 
 
-For the loss of function predicate, the following variant effects are considered loss of function:
+For the loss-of-function predicate, the following is a non-exhausting list
+of variant effects considered as a loss-of-function:
 
 >>> lof_effects = (
+...     VariantEffect.TRANSCRIPT_TRANSLOCATION,
 ...     VariantEffect.TRANSCRIPT_ABLATION,
 ...     VariantEffect.FRAMESHIFT_VARIANT,
 ...     VariantEffect.START_LOST,
 ...     VariantEffect.STOP_GAINED,
 ... )
 >>> lof_mutation = anyof(variant_effect(eff, tx_id) for eff in lof_effects)
 >>> lof_mutation.description
-'(TRANSCRIPT_ABLATION on NM_001042681.2 OR FRAMESHIFT_VARIANT on NM_001042681.2 OR START_LOST on NM_001042681.2 OR STOP_GAINED on NM_001042681.2)'
+'(TRANSCRIPT_TRANSLOCATION on NM_001042681.2 OR TRANSCRIPT_ABLATION on NM_001042681.2 OR FRAMESHIFT_VARIANT on NM_001042681.2 OR START_LOST on NM_001042681.2 OR STOP_GAINED on NM_001042681.2)'
 
 
 The genotype predicate will bin the patient into two classes: a point mutation or the loss of function:
@@ -154,6 +156,26 @@ Phenotype score
 This component is responsible for computing a phenotype score for an individual.
 As far as GPSEA framework is concerned, the phenotype score must be a floating point number
 or a `NaN` value if the score cannot be computed for an individual.
+This is the essence of the :class:`~gpsea.analysis.pscore.PhenotypeScorer` class.
+
+GPSEA ships with several builtin phenotype scorers which can be used as  
+
++------------------------------------------------------------+---------------------------------------------+
+| Name                                                       | Description                                 |
++============================================================+=============================================+
+|                                                            | Compute the total number of occurrences     |
+| * :class:`~gpsea.analysis.pscore.CountingPhenotypeScorer`  | of specific phenotypic features             |
+|                                                            | (used in this section)                      |
++------------------------------------------------------------+---------------------------------------------+
+|                                                            | Compute the "adapted De Vries Score"        |
+| * :class:`~gpsea.analysis.pscore.DeVriesPhenotypeScorer`   | for assessing severity                      |
+|                                                            | of intellectual disability                  |
++------------------------------------------------------------+---------------------------------------------+
+
+.. tip::
+
+   See :ref:`custom-phenotype-scorer` section to learn how to build a phenotype scorer from scratch.
+
 
 Here we use the :class:`~gpsea.analysis.pscore.CountingPhenotypeScorer` for scoring
 the individuals based on the number of structural defects

diff --git a/docs/user-guide/analyses/report/rere_phenotype_score_boxplot.png b/docs/user-guide/analyses/report/rere_phenotype_score_boxplot.png
diff --git a/docs/user-guide/analyses/survival.rst b/docs/user-guide/analyses/survival.rst
@@ -188,3 +188,96 @@ or `None` if computing the survival was impossible (see :func:`~gpsea.analysis.t
 The `Survival` reports the number of days until attaining the endpoint,
 here defined as end stage renal disease (`is_censored=False`),
 or until the individual dropped out of the analysis (`is_censored=True`).
+
+
+Troubleshooting
+===============
+
+Sometimes the survival analysis fails and an :class:`~gpsea.analysis.AnalysisException` is raised.
+For instance, the current Logrank test implementation reports a p value of `NaN`
+if the survival is the same for all individuals.
+This is unlikely an expected outcome, therefore GPSEA raises
+an :class:`~gpsea.analysis.AnalysisException` to force the user to troubleshoot.
+
+To help with troubleshooting, the data computed prior detecting the error is included in the exception's
+:attr:`~gpsea.analysis.AnalysisException.data` attribute. In survival analysis, the data should include
+the identifiers, genotype classes, and survivals of the tested individuals.
+
+Let's show this on an example. We will create a toy cohort of 10 individuals
+with onset of `Lynch syndrome I <https://hpo.jax.org/browse/disease/OMIM:120435>`_
+(`OMIM:120435`) at 40 years.
+
+>>> from gpsea.model import Cohort, Patient, Disease, Age
+>>> onset = Age.from_iso8601_period("P40Y")
+>>> individuals = [
+...     Patient.from_raw_parts(
+...         labels=label,
+...         diseases=(
+...             Disease.from_raw_parts(
+...                 term_id="OMIM:120435",
+...                 name="Lynch syndrome I",
+...                 is_observed=True,
+...                 onset=onset,
+...             ),
+...         ),
+...     )
+...     for label in "ABCDEFGHIJ"  # 10 individuals
+... ]
+>>> cohort = Cohort.from_patients(individuals)
+
+We will assign them into genotype classes on random, ...
+
+>>> from gpsea.analysis.clf import random_classifier
+>>> gt_clf = random_classifier(seed=123)
+>>> gt_clf.description
+'Classify the individual into random classes'
+
+... using the Lynch syndrome I diagnosis as the endpoint ...
+
+>>> from gpsea.analysis.temporal.endpoint import disease_onset
+>>> endpoint = disease_onset(disease_id="OMIM:120435")
+>>> endpoint.description
+'Compute time until OMIM:120435 onset'
+
+... and we will use Logrank test for differences in survival.
+
+>>> from gpsea.analysis.temporal.stats import LogRankTest
+>>> survival_statistic = LogRankTest()
+
+We put together the survival analysis ...
+
+>>> from gpsea.analysis.temporal import SurvivalAnalysis
+>>> survival_analysis = SurvivalAnalysis(
+...     statistic=survival_statistic,
+... )
+
+... which we expect to fail with an :class:`~gpsea.analysis.AnalysisException`:
+
+>>> result = survival_analysis.compare_genotype_vs_survival(
+...     cohort=cohort,
+...     gt_clf=gt_clf,
+...     endpoint=endpoint,
+... )
+Traceback (most recent call last):
+  ...
+gpsea.analysis._base.AnalysisException: The survival values did not meet the expectation of the statistical test!
+
+The genotype classes and survival values can be retrieved from the exception:
+
+>>> from gpsea.analysis import AnalysisException
+>>> try:
+...     result = survival_analysis.compare_genotype_vs_survival(
+...         cohort=cohort,
+...         gt_clf=gt_clf,
+...         endpoint=endpoint,
+...     )
+... except AnalysisException as ae:
+...     genotypes = ae.data["genotype"]
+...     survivals = ae.data["survival"]
+
+and the values can come in handy in troubleshooting:
+
+>>> genotypes[:3]
+(0, 0, 0)
+>>> survivals[:3]
+(Survival(value=14610.0, is_censored=False), Survival(value=14610.0, is_censored=False), Survival(value=14610.0, is_censored=False))