Skip to content

Commit

Permalink
Merge pull request #231 from monarch-initiative/form-statistical-tests
Browse files Browse the repository at this point in the history
Split `analysis` package into analysis types
  • Loading branch information
ielis authored Aug 27, 2024
2 parents 1c12f31 + 6dfebf4 commit 0ac0df9
Show file tree
Hide file tree
Showing 57 changed files with 3,149 additions and 572 deletions.
3 changes: 2 additions & 1 deletion .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ max-line-length = 120

# F405 - ignore error if class is imported in `*` import.
# W293 - blank line contains whitespace
ignore = F405,W293
# W503 - line break before binary operator
ignore = F405,W293,W503

exclude =
.git,
Expand Down
Binary file removed docs/img/phenotype_group_counts.png
Binary file not shown.
Binary file added docs/img/rere_phenotype_score_boxplot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
275 changes: 259 additions & 16 deletions docs/report/tbx5_frameshift_vs_missense.csv

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@
<h1>Phenotype testing report</h1>
<p>Phenotype MTC filter: <em>HPO MTC filter</em></p>
<p>Multiple testing correction: <em>fdr_bh</em></p>
<p>Performed statistical tests for 15 out of the total of 259 HPO terms.</p>
<p>Performed statistical tests for 16 out of the total of 259 HPO terms.</p>
<table>
<caption>Using <em>HPO MTC filter</em>, 244 term(s) were omitted from statistical analysis.</caption>
<caption>Using <em>HPO MTC filter</em>, 243 term(s) were omitted from statistical analysis.</caption>
<tbody>
<tr>
<th>Code</th>
Expand All @@ -61,15 +61,15 @@ <h1>Phenotype testing report</h1>
<tr>
<!-- TODO: plug the real reason code here -->
<td>TODO</td>
<td>Skipping term because all genotypes have same HPO observed proportions</td>
<td>42</td>
<td>Skipping general term</td>
<td>43</td>
</tr>

<tr>
<!-- TODO: plug the real reason code here -->
<td>TODO</td>
<td>Skipping general term</td>
<td>41</td>
<td>Skipping term because all genotypes have same HPO observed proportions</td>
<td>42</td>
</tr>

<tr>
Expand Down Expand Up @@ -135,13 +135,6 @@ <h1>Phenotype testing report</h1>
<td>3</td>
</tr>

<tr>
<!-- TODO: plug the real reason code here -->
<td>TODO</td>
<td>Skipping non-target term</td>
<td>3</td>
</tr>

<tr>
<!-- TODO: plug the real reason code here -->
<td>TODO</td>
Expand Down
125 changes: 75 additions & 50 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,26 +9,26 @@ The tutorial illustrates just one of the many ways GPSEA can be used to characte
The :ref:`user-guide` contains details about additional methods and functionalities.


The tutorial presents an analysis of a cohort of individuals with pathogenic variants in *TBX5* leading to
`Holt-Oram syndrome MIM:142900 <https://omim.org/entry/142900>`_.
The tutorial presents an analysis of a cohort of individuals with pathogenic variants in *TBX5* leading to
`Holt-Oram syndrome MIM:142900 <https://omim.org/entry/142900>`_.

Holt-Oram syndrome is an autosomal dominant disorder characterized by
Holt-Oram syndrome is an autosomal dominant disorder characterized by
upper limb defects, congenital heart defects, and arrhythmias (`PMID:38336121 <https://pubmed.ncbi.nlm.nih.gov/38336121/>`_).
It has been observed in the literature that congenital defects of the ventricular and atrial septum are more
common in the truncating than in the missense variants (`PMID:30552424 <https://pubmed.ncbi.nlm.nih.gov/30552424/>`_).
Additionally, upper limb defects are more frequent in patients with protein-truncating variants (`PMID:38336121 <https://pubmed.ncbi.nlm.nih.gov/38336121/>`_).

To perform the analysis, we curated the literature and created on `GA4GH phenopacket <https://pubmed.ncbi.nlm.nih.gov/35705716/>`_ for each
To perform the analysis, we curated the literature and created on `GA4GH phenopacket <https://pubmed.ncbi.nlm.nih.gov/35705716/>`_ for each
affected individual. The phenopackets are made available in `Phenopacket Store <https://github.com/monarch-initiative/phenopacket-store>`_.



The analysis
~~~~~~~~~~~~

For the analysis, the `MANE <https://www.ncbi.nlm.nih.gov/refseq/MANE/>`_ transcript
For the analysis, the `MANE <https://www.ncbi.nlm.nih.gov/refseq/MANE/>`_ transcript
(i.e., the "main" biomedically relevant transcript of a gene) should be chosen unless
there is a specific reason not to (which should occur rarely if at all).
there is a specific reason not to (which should occur rarely if at all).

In the case of *TBX5* the MANE transcript is `NM_181486.4`. Note that the trascript identifier (`NM_181486`) and the version (`4`) are both required.
A good way to find the MANE transcript is to search on the gene symbol (e.g., *TBX5*) in `ClinVar <https://www.ncbi.nlm.nih.gov/clinvar/>`_ and to
Expand All @@ -49,7 +49,7 @@ corresponding protein accession `NP_852259.1`.
Load HPO
^^^^^^^^

GPSEA needs HPO to do the analysis.
GPSEA needs HPO to do the analysis.
We use HPO toolkit to load HPO version `v2023-10-09`:

>>> import hpotk
Expand All @@ -75,13 +75,13 @@ and stored in `Phenopacket Store <https://github.com/monarch-initiative/phenopac
156

We loaded 156 phenopackets which need further preprocessing to prepare for the analysis.
We will compute functional annotations for the mutations and then include the individuals into
We will compute functional annotations for the mutations and then include the individuals into
a :class:`~gpsea.model.Cohort`:

>>> from gpsea.preprocessing import configure_caching_cohort_creator, load_phenopackets
>>> cohort_creator = configure_caching_cohort_creator(hpo)
>>> cohort, validation = load_phenopackets( # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
... phenopackets=phenopackets,
... phenopackets=phenopackets,
... cohort_creator=cohort_creator,
... )
Patients Created: ...
Expand All @@ -96,7 +96,7 @@ We loaded the patient data into a `cohort` which is ready for the next steps.

.. seealso::

Here we show how to create a :class:`~gpsea.model.Cohort` from phenopackets.
Here we show how to create a :class:`~gpsea.model.Cohort` from phenopackets.
See :ref:`input-data` section to learn how to create a cohort from another inputs.


Expand All @@ -113,7 +113,7 @@ We can now explore the cohort to see how many patients are included.

.. raw:: html
:file: report/tbx5_cohort_info.html

.. note::

The report can also be displayed directly in a Jupyter notebook by running::
Expand All @@ -123,14 +123,14 @@ We can now explore the cohort to see how many patients are included.

Now we can show the distribution of variants with respect to the encoded protein.
We first obtain `tx_coordinates` (:class:`~gpsea.model.TranscriptCoordinates`)
and `protein_meta` (:class:`~gpsea.model.ProteinMetadata`)
and `protein_meta` (:class:`~gpsea.model.ProteinMetadata`)
with information about the transcript and protein "anatomy":

>>> from gpsea.model.genome import GRCh38
>>> from gpsea.preprocessing import configure_protein_metadata_service, VVMultiCoordinateService
>>> txc_service = VVMultiCoordinateService(genome_build=GRCh38)
>>> pms = configure_protein_metadata_service()
>>> tx_coordinates = txc_service.fetch(tx_id)
>>> tx_coordinates = txc_service.fetch(tx_id)
>>> protein_meta = pms.annotate(px_id)

and we follow with plotting the diagram of the mutations on the protein:
Expand Down Expand Up @@ -158,7 +158,7 @@ Prepare genotype and phenotype predicates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We will create a predicate to bin patients into group
depending on presence of a missense and frameshift variant to test
depending on presence of a missense and frameshift variant to test
if there is a difference between frameshift and non-frameshift variants
in the individuals of the *TBX5* cohort.

Expand All @@ -176,74 +176,99 @@ in the individuals of the *TBX5* cohort.

.. note::

There are many other ways to set up a predicate for testing
There are many other ways to set up a predicate for testing
for a GP correlation.
See the :ref:`predicates` section to learn more about building
a predicate of interest.

The phenotype grouping is based on presence or absence of an HPO term.
We take advantage of the ontology "true path rule" to infer presence
of the ancestor terms for all present HPO terms
(e.g. presence of `Abnormal ventricular septum morphology <https://hpo.jax.org/browse/term/HP:0010438>`_
in an individual annotated with `Ventricular septal defect <https://hpo.jax.org/browse/term/HP:0001629>`_)
and exclusion of the descendant terms for all excluded terms (e.g. exclusion of
`Motor seizure <https://hpo.jax.org/browse/term/HP:0020219>`_
in an individual where `Seizure <https://hpo.jax.org/browse/term/HP:0001250>`_
was excluded):

>>> from gpsea.analysis.predicate.phenotype import prepare_predicates_for_terms_of_interest
>>> pheno_predicates = prepare_predicates_for_terms_of_interest(
... cohort=cohort,
... hpo=hpo,
... min_n_of_patients_with_term=2,
... )

By default, GPSEA will perform one hypothesis test for each HPO term used to annotate more than one individual in the cohort.
This also includes the terms implied by the ontology "true path rule",
which states that presence of a term
(e.g., `Ventricular septal defect <https://hpo.jax.org/browse/term/HP:0001629>`_)
implies presence of all its ancestor terms
(e.g., `Abnormal ventricular septum morphology <https://hpo.jax.org/browse/term/HP:0010438>`_,
`Abnormal cardiac septum morphology <https://hpo.jax.org/browse/term/HP:0001671>`_,
`Abnormal cardiac ventricle morphology <https://hpo.jax.org/browse/term/HP:0001713>`_, ...).
However, testing multiple hypothesis increases the chance of receiving false positive result,
and multiple testing correction must be applied.
See :ref:`mtc` for information about how to perform multiple testing correction with GPSEA.
By default, GPSEA will perform one hypothesis test for each HPO term used to annotate two or more individuals in the cohort
(see ``min_n_of_patients_with_term=2`` above).
Testing multiple hypothesis on the same dataset increases the chance of receiving false positive result.
However, GPSEA simplifies the application of an appropriate multiple testing correction.

For general use, we recommend using a combination
of a *Phenotype MTC filter* (:class:`~gpsea.analysis.PhenotypeMtcFilter`) with a *multiple testing correction*.
Phenotype MTC filter chooses the HPO terms to test according to several heuristics, which
Phenotype MTC filter chooses the HPO terms to test according to several heuristics, which
reduce the multiple testing burden and focus the analysis
on the most interesting terms (see :ref:`HPO MTC filter <hpo-mtc-filter-strategy>` for more info).
Then the multiple testing correction, such as Bonferroni or Benjamini-Hochberg,
is used to control the family-wise error rate or the false discovery rate.
See :ref:`mtc` for more information.

Here we use HPO MTC filter (:meth:`~gpsea.analysis.CohortAnalysisConfiguration.hpo_mtc_strategy`)
along with Benjamini-Hochberg procedure (:meth:`~gpsea.analysis.CohortAnalysisConfiguration.pval_correction`):
In this example, we will use a combination of the HPO MTC filter (:class:`~gpsea.analysis.mtc_filter.HpoMtcFilter`)
with Benjamini-Hochberg procedure (``mtc_correction='fdr_bh'``)
with a false discovery control level at (``mtc_alpha=0.05``):

>>> from gpsea.analysis import configure_cohort_analysis, CohortAnalysisConfiguration
>>> config = CohortAnalysisConfiguration()
>>> config.hpo_mtc_strategy()
>>> config.pval_correction = 'fdr_bh'
>>> analysis = configure_cohort_analysis(
... cohort=cohort,
... hpo=hpo,
... config=config,
>>> from gpsea.analysis.mtc_filter import HpoMtcFilter
>>> mtc_filter = HpoMtcFilter.default_filter(hpo, term_frequency_threshold=0.2)
>>> mtc_correction = 'fdr_bh'
>>> mtc_alpha = 0.05

Choosing the statistical procedure for assessment of association between genotype and phenotype
groups is the last missing piece of the analysis. We will use Fisher Exact Test:

>>> from gpsea.analysis.pcats.stats import ScipyFisherExact
>>> count_statistic = ScipyFisherExact()

and we finalize the analysis setup by putting all components together
into :class:`~gpsea.analysis.pcats.HpoTermAnalysis`:

>>> from gpsea.analysis.pcats import HpoTermAnalysis
>>> analysis = HpoTermAnalysis(
... count_statistic=count_statistic,
... mtc_filter=mtc_filter,
... mtc_correction=mtc_correction,
... mtc_alpha=mtc_alpha,
... )

Now we can perform the analysis and investigate the results.

>>> result = analysis.compare_genotype_vs_cohort_phenotypes(gt_predicate)
>>> result = analysis.compare_genotype_vs_phenotypes(
... cohort=cohort,
... gt_predicate=gt_predicate,
... pheno_predicates=pheno_predicates,
... )
>>> result.total_tests
16

We only tested 16 HPO terms. This is despite the individuals being collectively annotated with
259 direct and indirect HPO terms

>>> from gpsea.analysis import prepare_hpo_terms_of_interest
>>> terms = prepare_hpo_terms_of_interest(hpo, cohort.all_patients, min_n_of_patients_with_term=2)
>>> len(terms)
>>> len(result.phenotypes)
259

We can show the reasoning behind *not* testing 243 (`259 - 16`) HPO terms
by exploring the phenotype MTC filtering report.

>>> from gpsea.view import MtcStatsViewer
>>> mtc_viewer = MtcStatsViewer()
>>> mtc_report = mtc_viewer.process(result.mtc_filter_report)
>>> with open('docs/report/tbx5_mtc_report.html', 'w') as fh: # doctest: +SKIP
>>> mtc_viewer = MtcStatsViewer()
>>> mtc_report = mtc_viewer.process(result)
>>> with open('docs/report/tbx5_frameshift_vs_missense.mtc_report.html', 'w') as fh: # doctest: +SKIP
... _ = fh.write(mtc_report)

.. raw:: html
:file: report/tbx5_mtc_report.html
:file: report/tbx5_frameshift_vs_missense.mtc_report.html

..
TODO:
TODO:
Show how to write out the tested HPO terms.

and these are the HPO terms ordered by the p value corrected with the Benjamini-Hochberg procedure:
Expand All @@ -259,8 +284,8 @@ and these are the HPO terms ordered by the p value corrected with the Benjamini-
We see that several HPO terms are significantly associated
with presence of a frameshift variant in *TBX5*.
For example, `Ventricular septal defect <https://hpo.jax.org/browse/term/HP:0001629>`_
was observed in 31/60 (52%) patients with a missense variant
was observed in 31/60 (52%) patients with a missense variant
but it was observed in 19/19 (100%) patients with a frameshift variant.
Fisher exact test computed a p value of `~0.0000562`
and the p value corrected by Benjamini-Hochberg procedure
Fisher exact test computed a p value of `~0.0000562`
and the p value corrected by Benjamini-Hochberg procedure
is `~0.00112`.
89 changes: 89 additions & 0 deletions docs/user-guide/glossary.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
.. _glossary:

========
Glossary
========

The glossary summarizes several frequently used concepts.

.. _true-path-rule:

True path rule
~~~~~~~~~~~~~~

The true path rule of ontologies states that an item (e.g. an individual) annotated with an ontology term
(e.g. `Focal-onset seizure <https://hpo.jax.org/browse/term/HP:0007359>`_)
is implicitly annotated with all its *ancestor* terms
(`Seizure <https://hpo.jax.org/browse/term/HP:0001250>`_,
`Abnormal nervous system physiology <https://hpo.jax.org/browse/term/HP:0012638>`_, ...).
Conversely, exclusion of a term (e.g. `Abnormal ventricular septum morphology <https://hpo.jax.org/browse/term/HP:0010438>`_)
implies exclusion of all its *descendants*
(`Ventricular septal defect <https://hpo.jax.org/browse/term/HP:0001629>`_,
`Ventricular septal aneurysm <https://hpo.jax.org/browse/term/HP:0030957>`_, ...).


.. _length-of-the-reference-allele:

Length of the reference allele
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The length of the REF allele corresponds to the length of the genomic region affected by the variant.

Let's show a few examples.

>>> from gpsea.model import VariantCoordinates
>>> from gpsea.model.genome import GRCh38
>>> chr1 = GRCh38.contig_by_name("chr1")

The length of the reference allele of a missense variant is 1
because the variant affects a 1-bp region spanned by the ``C`` nucleotide:

>>> missense = VariantCoordinates.from_vcf_literal(chr1, 1001, 'C', 'T')
>>> len(missense)
1

The length of a "small" deletion is the same as the length of the ref allele `str`:
(``'CCC'`` in the example below):

>>> deletion = VariantCoordinates.from_vcf_literal(chr1, 1001, 'CCC', 'C')
>>> len(deletion)
3

This is because the literal notation spells out the alleles.
However, this simple rule does not apply in symbolic notation.
Here, the REF length corresponds to the length of the allele region.

For instance, for the following structural variant::

#CHROM POS ID REF ALT QUAL FILTER INFO
1 1001 . C <DEL> 6 PASS SVTYPE=DEL;END=1003;SVLEN=-3

the length of the REF allele is `3`:

>>> sv_deletion = VariantCoordinates.from_vcf_symbolic(
... chr1, pos=1001, end=1003,
... ref='C', alt='<DEL>', svlen=-3,
... )
>>> len(sv_deletion)
3

because the deletion removes 3 base pairs at the coordinates :math:`[1001, 1003]`.


.. _change-length-of-an-allele:

Change length of an allele
~~~~~~~~~~~~~~~~~~~~~~~~~~

Change length is defined as the difference between lengths of the *alt* and *ref* alleles.
SNVs lead to change length of zero, deletions and insertions/duplications lead to negative
and positive change lengths, respectively. See the table below for examples.

================== ================== ===============
Reference allele Alternate allele Change length
================== ================== ===============
C T 0
CG AT 0
ACTG A -3
A AAT 2
================== ================== ===============
Loading

0 comments on commit 0ac0df9

Please sign in to comment.