Merge pull request #231 from monarch-initiative/form-statistical-tests

Split `analysis` package into analysis types
monarch-initiative · Aug 27, 2024 · 0ac0df9 · 0ac0df9
2 parents 1c12f31 + 6dfebf4
commit 0ac0df9
Show file tree

Hide file tree

Showing 57 changed files with 3,149 additions and 572 deletions.
diff --git a/.flake8 b/.flake8
@@ -3,7 +3,8 @@ max-line-length = 120
 
 # F405 - ignore error if class is imported in `*` import.
 # W293 - blank line contains whitespace
-ignore = F405,W293
+# W503 - line break before binary operator
+ignore = F405,W293,W503
 
 exclude =
     .git,

diff --git a/docs/img/phenotype_group_counts.png b/docs/img/phenotype_group_counts.png
diff --git a/docs/img/rere_phenotype_score_boxplot.png b/docs/img/rere_phenotype_score_boxplot.png
diff --git a/docs/report/tbx5_frameshift_vs_missense.csv b/docs/report/tbx5_frameshift_vs_missense.csv
diff --git a/docs/report/tbx5_mtc_report.html → ...x5_frameshift_vs_missense.mtc_report.html b/docs/report/tbx5_mtc_report.html → ...x5_frameshift_vs_missense.mtc_report.html
@@ -48,9 +48,9 @@
   <h1>Phenotype testing report</h1>
   <p>Phenotype MTC filter: <em>HPO MTC filter</em></p>
   <p>Multiple testing correction: <em>fdr_bh</em></p>
-  <p>Performed statistical tests for 15 out of the total of 259 HPO terms.</p>
+  <p>Performed statistical tests for 16 out of the total of 259 HPO terms.</p>
     <table>
-      <caption>Using <em>HPO MTC filter</em>, 244 term(s) were omitted from statistical analysis.</caption>
+      <caption>Using <em>HPO MTC filter</em>, 243 term(s) were omitted from statistical analysis.</caption>
         <tbody>
           <tr>
             <th>Code</th>
@@ -61,15 +61,15 @@ <h1>Phenotype testing report</h1>
           <tr>
             <!-- TODO: plug the real reason code here -->
             <td>TODO</td>
-            <td>Skipping term because all genotypes have same HPO observed proportions</td>
-            <td>42</td>
+            <td>Skipping general term</td>
+            <td>43</td>
           </tr>
 
           <tr>
             <!-- TODO: plug the real reason code here -->
             <td>TODO</td>
-            <td>Skipping general term</td>
-            <td>41</td>
+            <td>Skipping term because all genotypes have same HPO observed proportions</td>
+            <td>42</td>
           </tr>
 
           <tr>
@@ -135,13 +135,6 @@ <h1>Phenotype testing report</h1>
             <td>3</td>
           </tr>
 
-          <tr>
-            <!-- TODO: plug the real reason code here -->
-            <td>TODO</td>
-            <td>Skipping non-target term</td>
-            <td>3</td>
-          </tr>
-
           <tr>
             <!-- TODO: plug the real reason code here -->
             <td>TODO</td>

diff --git a/docs/tutorial.rst b/docs/tutorial.rst
@@ -9,26 +9,26 @@ The tutorial illustrates just one of the many ways GPSEA can be used to characte
 The :ref:`user-guide` contains details about additional methods and functionalities.
 
 
-The tutorial presents an analysis of a cohort of individuals with pathogenic variants in *TBX5* leading to 
-`Holt-Oram syndrome MIM:142900 <https://omim.org/entry/142900>`_. 
+The tutorial presents an analysis of a cohort of individuals with pathogenic variants in *TBX5* leading to
+`Holt-Oram syndrome MIM:142900 <https://omim.org/entry/142900>`_.
 
-Holt-Oram syndrome is an autosomal dominant disorder characterized by 
+Holt-Oram syndrome is an autosomal dominant disorder characterized by
 upper limb defects, congenital heart defects, and arrhythmias (`PMID:38336121 <https://pubmed.ncbi.nlm.nih.gov/38336121/>`_).
 It has been observed in the literature that congenital defects of the ventricular and atrial septum are more
 common in the truncating than in the missense variants (`PMID:30552424 <https://pubmed.ncbi.nlm.nih.gov/30552424/>`_).
 Additionally, upper limb defects are more frequent in patients with protein-truncating variants (`PMID:38336121 <https://pubmed.ncbi.nlm.nih.gov/38336121/>`_).
 
-To perform the analysis, we curated the literature and created on `GA4GH phenopacket <https://pubmed.ncbi.nlm.nih.gov/35705716/>`_ for each 
+To perform the analysis, we curated the literature and created on `GA4GH phenopacket <https://pubmed.ncbi.nlm.nih.gov/35705716/>`_ for each
 affected individual. The phenopackets are made available in `Phenopacket Store <https://github.com/monarch-initiative/phenopacket-store>`_.
 
 
 
 The analysis
 ~~~~~~~~~~~~
 
-For the analysis, the `MANE <https://www.ncbi.nlm.nih.gov/refseq/MANE/>`_ transcript 
+For the analysis, the `MANE <https://www.ncbi.nlm.nih.gov/refseq/MANE/>`_ transcript
 (i.e., the "main" biomedically relevant transcript of a gene) should be chosen unless
-there is a specific reason not to (which should occur rarely if at all). 
+there is a specific reason not to (which should occur rarely if at all).
 
 In the case of *TBX5* the MANE transcript is `NM_181486.4`. Note that the trascript identifier (`NM_181486`) and the version (`4`) are both required.
 A good way to find the MANE transcript is to search on the gene symbol (e.g., *TBX5*) in `ClinVar <https://www.ncbi.nlm.nih.gov/clinvar/>`_ and to
@@ -49,7 +49,7 @@ corresponding protein accession `NP_852259.1`.
 Load HPO
 ^^^^^^^^
 
-GPSEA needs HPO to do the analysis. 
+GPSEA needs HPO to do the analysis.
 We use HPO toolkit to load HPO version `v2023-10-09`:
 
 >>> import hpotk
@@ -75,13 +75,13 @@ and stored in `Phenopacket Store <https://github.com/monarch-initiative/phenopac
 156
 
 We loaded 156 phenopackets which need further preprocessing to prepare for the analysis.
-We will compute functional annotations for the mutations and then include the individuals into 
+We will compute functional annotations for the mutations and then include the individuals into
 a :class:`~gpsea.model.Cohort`:
 
 >>> from gpsea.preprocessing import configure_caching_cohort_creator, load_phenopackets
 >>> cohort_creator = configure_caching_cohort_creator(hpo)
 >>> cohort, validation = load_phenopackets(  # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
-...     phenopackets=phenopackets, 
+...     phenopackets=phenopackets,
 ...     cohort_creator=cohort_creator,
 ... )
 Patients Created: ...
@@ -96,7 +96,7 @@ We loaded the patient data into a `cohort` which is ready for the next steps.
 
 .. seealso::
 
-  Here we show how to create a :class:`~gpsea.model.Cohort` from phenopackets. 
+  Here we show how to create a :class:`~gpsea.model.Cohort` from phenopackets.
   See :ref:`input-data` section to learn how to create a cohort from another inputs.
 
 
@@ -113,7 +113,7 @@ We can now explore the cohort to see how many patients are included.
 
 .. raw:: html
   :file: report/tbx5_cohort_info.html
-  
+
 .. note::
 
   The report can also be displayed directly in a Jupyter notebook by running::
@@ -123,14 +123,14 @@ We can now explore the cohort to see how many patients are included.
 
 Now we can show the distribution of variants with respect to the encoded protein.
 We first obtain `tx_coordinates` (:class:`~gpsea.model.TranscriptCoordinates`)
-and `protein_meta` (:class:`~gpsea.model.ProteinMetadata`) 
+and `protein_meta` (:class:`~gpsea.model.ProteinMetadata`)
 with information about the transcript and protein "anatomy":
 
 >>> from gpsea.model.genome import GRCh38
 >>> from gpsea.preprocessing import configure_protein_metadata_service, VVMultiCoordinateService
 >>> txc_service = VVMultiCoordinateService(genome_build=GRCh38)
 >>> pms = configure_protein_metadata_service()
->>> tx_coordinates = txc_service.fetch(tx_id) 
+>>> tx_coordinates = txc_service.fetch(tx_id)
 >>> protein_meta = pms.annotate(px_id)
 
 and we follow with plotting the diagram of the mutations on the protein:
@@ -158,7 +158,7 @@ Prepare genotype and phenotype predicates
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 We will create a predicate to bin patients into group
-depending on presence of a missense and frameshift variant to test 
+depending on presence of a missense and frameshift variant to test
 if there is a difference between frameshift and non-frameshift variants
 in the individuals of the *TBX5* cohort.
 
@@ -176,74 +176,99 @@ in the individuals of the *TBX5* cohort.
 
 .. note::
 
-  There are many other ways to set up a predicate for testing 
+  There are many other ways to set up a predicate for testing
   for a GP correlation.
   See the :ref:`predicates` section to learn more about building
   a predicate of interest.
 
+The phenotype grouping is based on presence or absence of an HPO term.
+We take advantage of the ontology "true path rule" to infer presence
+of the ancestor terms for all present HPO terms
+(e.g. presence of `Abnormal ventricular septum morphology <https://hpo.jax.org/browse/term/HP:0010438>`_
+in an individual annotated with `Ventricular septal defect <https://hpo.jax.org/browse/term/HP:0001629>`_)
+and exclusion of the descendant terms for all excluded terms (e.g. exclusion of
+`Motor seizure <https://hpo.jax.org/browse/term/HP:0020219>`_
+in an individual where `Seizure <https://hpo.jax.org/browse/term/HP:0001250>`_
+was excluded):
+
+>>> from gpsea.analysis.predicate.phenotype import prepare_predicates_for_terms_of_interest
+>>> pheno_predicates = prepare_predicates_for_terms_of_interest(
+...     cohort=cohort,
+...     hpo=hpo,
+...     min_n_of_patients_with_term=2,
+... )
 
-By default, GPSEA will perform one hypothesis test for each HPO term used to annotate more than one individual in the cohort.
-This also includes the terms implied by the ontology "true path rule", 
-which states that presence of a term 
-(e.g., `Ventricular septal defect <https://hpo.jax.org/browse/term/HP:0001629>`_)
-implies presence of all its ancestor terms 
-(e.g., `Abnormal ventricular septum morphology <https://hpo.jax.org/browse/term/HP:0010438>`_, 
-`Abnormal cardiac septum morphology <https://hpo.jax.org/browse/term/HP:0001671>`_,
-`Abnormal cardiac ventricle morphology <https://hpo.jax.org/browse/term/HP:0001713>`_, ...).
-However, testing multiple hypothesis increases the chance of receiving false positive result,
-and multiple testing correction must be applied.
-See :ref:`mtc` for information about how to perform multiple testing correction with GPSEA. 
+By default, GPSEA will perform one hypothesis test for each HPO term used to annotate two or more individuals in the cohort
+(see ``min_n_of_patients_with_term=2`` above).
+Testing multiple hypothesis on the same dataset increases the chance of receiving false positive result.
+However, GPSEA simplifies the application of an appropriate multiple testing correction.
 
 For general use, we recommend using a combination
 of a *Phenotype MTC filter* (:class:`~gpsea.analysis.PhenotypeMtcFilter`) with a *multiple testing correction*.
-Phenotype MTC filter chooses the HPO terms to test according to several heuristics, which 
+Phenotype MTC filter chooses the HPO terms to test according to several heuristics, which
 reduce the multiple testing burden and focus the analysis
 on the most interesting terms (see :ref:`HPO MTC filter <hpo-mtc-filter-strategy>` for more info).
 Then the multiple testing correction, such as Bonferroni or Benjamini-Hochberg,
 is used to control the family-wise error rate or the false discovery rate.
+See :ref:`mtc` for more information.
 
-Here we use HPO MTC filter (:meth:`~gpsea.analysis.CohortAnalysisConfiguration.hpo_mtc_strategy`)
-along with Benjamini-Hochberg procedure (:meth:`~gpsea.analysis.CohortAnalysisConfiguration.pval_correction`):
+In this example, we will use a combination of the HPO MTC filter (:class:`~gpsea.analysis.mtc_filter.HpoMtcFilter`)
+with Benjamini-Hochberg procedure (``mtc_correction='fdr_bh'``)
+with a false discovery control level at (``mtc_alpha=0.05``):
 
->>> from gpsea.analysis import configure_cohort_analysis, CohortAnalysisConfiguration
->>> config = CohortAnalysisConfiguration()
->>> config.hpo_mtc_strategy()
->>> config.pval_correction = 'fdr_bh'
->>> analysis = configure_cohort_analysis(
-...     cohort=cohort,
-...     hpo=hpo,
-...     config=config,
+>>> from gpsea.analysis.mtc_filter import HpoMtcFilter
+>>> mtc_filter = HpoMtcFilter.default_filter(hpo, term_frequency_threshold=0.2)
+>>> mtc_correction = 'fdr_bh'
+>>> mtc_alpha = 0.05
+
+Choosing the statistical procedure for assessment of association between genotype and phenotype
+groups is the last missing piece of the analysis. We will use Fisher Exact Test:
+
+>>> from gpsea.analysis.pcats.stats import ScipyFisherExact
+>>> count_statistic = ScipyFisherExact()
+
+and we finalize the analysis setup by putting all components together
+into :class:`~gpsea.analysis.pcats.HpoTermAnalysis`:
+
+>>> from gpsea.analysis.pcats import HpoTermAnalysis
+>>> analysis = HpoTermAnalysis(
+...     count_statistic=count_statistic,
+...     mtc_filter=mtc_filter,
+...     mtc_correction=mtc_correction,
+...     mtc_alpha=mtc_alpha,
 ... )
 
 Now we can perform the analysis and investigate the results.
 
->>> result = analysis.compare_genotype_vs_cohort_phenotypes(gt_predicate)
+>>> result = analysis.compare_genotype_vs_phenotypes(
+...     cohort=cohort,
+...     gt_predicate=gt_predicate,
+...     pheno_predicates=pheno_predicates,
+... )
 >>> result.total_tests
 16
 
 We only tested 16 HPO terms. This is despite the individuals being collectively annotated with
 259 direct and indirect HPO terms
 
->>> from gpsea.analysis import prepare_hpo_terms_of_interest
->>> terms = prepare_hpo_terms_of_interest(hpo, cohort.all_patients, min_n_of_patients_with_term=2)
->>> len(terms)
+>>> len(result.phenotypes)
 259
 
 We can show the reasoning behind *not* testing 243 (`259 - 16`) HPO terms
 by exploring the phenotype MTC filtering report.
 
 >>> from gpsea.view import MtcStatsViewer
->>> mtc_viewer = MtcStatsViewer() 
->>> mtc_report = mtc_viewer.process(result.mtc_filter_report)
->>> with open('docs/report/tbx5_mtc_report.html', 'w') as fh:  # doctest: +SKIP
+>>> mtc_viewer = MtcStatsViewer()
+>>> mtc_report = mtc_viewer.process(result)
+>>> with open('docs/report/tbx5_frameshift_vs_missense.mtc_report.html', 'w') as fh:  # doctest: +SKIP
 ...     _ = fh.write(mtc_report)
 
 .. raw:: html
-  :file: report/tbx5_mtc_report.html
+  :file: report/tbx5_frameshift_vs_missense.mtc_report.html
 
 ..
-  
-  TODO: 
+
+  TODO:
   Show how to write out the tested HPO terms.
 
 and these are the HPO terms ordered by the p value corrected with the Benjamini-Hochberg procedure:
@@ -259,8 +284,8 @@ and these are the HPO terms ordered by the p value corrected with the Benjamini-
 We see that several HPO terms are significantly associated
 with presence of a frameshift variant in *TBX5*.
 For example, `Ventricular septal defect <https://hpo.jax.org/browse/term/HP:0001629>`_
-was observed in 31/60 (52%) patients with a missense variant 
+was observed in 31/60 (52%) patients with a missense variant
 but it was observed in 19/19 (100%) patients with a frameshift variant.
-Fisher exact test computed a p value of `~0.0000562` 
-and the p value corrected by Benjamini-Hochberg procedure 
+Fisher exact test computed a p value of `~0.0000562`
+and the p value corrected by Benjamini-Hochberg procedure
 is `~0.00112`.
diff --git a/docs/user-guide/glossary.rst b/docs/user-guide/glossary.rst
@@ -0,0 +1,89 @@
+.. _glossary:
+
+========
+Glossary
+========
+
+The glossary summarizes several frequently used concepts.
+
+.. _true-path-rule:
+
+True path rule
+~~~~~~~~~~~~~~
+
+The true path rule of ontologies states that an item (e.g. an individual) annotated with an ontology term
+(e.g. `Focal-onset seizure <https://hpo.jax.org/browse/term/HP:0007359>`_)
+is implicitly annotated with all its *ancestor* terms
+(`Seizure <https://hpo.jax.org/browse/term/HP:0001250>`_,
+`Abnormal nervous system physiology <https://hpo.jax.org/browse/term/HP:0012638>`_, ...).
+Conversely, exclusion of a term (e.g. `Abnormal ventricular septum morphology <https://hpo.jax.org/browse/term/HP:0010438>`_)
+implies exclusion of all its *descendants*
+(`Ventricular septal defect <https://hpo.jax.org/browse/term/HP:0001629>`_,
+`Ventricular septal aneurysm <https://hpo.jax.org/browse/term/HP:0030957>`_, ...).
+
+
+.. _length-of-the-reference-allele:
+
+Length of the reference allele
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The length of the REF allele corresponds to the length of the genomic region affected by the variant.
+
+Let's show a few examples.
+
+>>> from gpsea.model import VariantCoordinates
+>>> from gpsea.model.genome import GRCh38
+>>> chr1 = GRCh38.contig_by_name("chr1")
+
+The length of the reference allele of a missense variant is 1
+because the variant affects a 1-bp region spanned by the ``C`` nucleotide:
+
+>>> missense = VariantCoordinates.from_vcf_literal(chr1, 1001, 'C', 'T')
+>>> len(missense)
+1
+
+The length of a "small" deletion is the same as the length of the ref allele `str`:
+(``'CCC'`` in the example below):
+
+>>> deletion = VariantCoordinates.from_vcf_literal(chr1, 1001, 'CCC', 'C')
+>>> len(deletion)
+3
+
+This is because the literal notation spells out the alleles.
+However, this simple rule does not apply in symbolic notation.
+Here, the REF length corresponds to the length of the allele region.
+
+For instance, for the following structural variant::
+
+    #CHROM   POS    ID   REF  ALT     QUAL   FILTER   INFO
+    1        1001   .    C    <DEL>   6      PASS     SVTYPE=DEL;END=1003;SVLEN=-3
+
+the length of the REF allele is `3`:
+
+>>> sv_deletion = VariantCoordinates.from_vcf_symbolic(
+...     chr1, pos=1001, end=1003,
+...     ref='C', alt='<DEL>', svlen=-3,
+... )
+>>> len(sv_deletion)
+3
+
+because the deletion removes 3 base pairs at the coordinates :math:`[1001, 1003]`.
+
+
+.. _change-length-of-an-allele:
+
+Change length of an allele
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Change length is defined as the difference between lengths of the *alt* and *ref* alleles.
+SNVs lead to change length of zero, deletions and insertions/duplications lead to negative
+and positive change lengths, respectively. See the table below for examples.
+
+==================  ==================  ===============
+ Reference allele    Alternate allele    Change length
+==================  ==================  ===============
+ C                   T                    0
+ CG                  AT                   0
+ ACTG                A                   -3
+ A                   AAT                  2
+==================  ==================  ===============