Skip to content

Commit

Permalink
Merge pull request #251 from monarch-initiative/documentation
Browse files Browse the repository at this point in the history
updating documentation (WIP)
  • Loading branch information
pnrobinson authored Sep 6, 2024
2 parents 817b64e + 40f3d91 commit a314f26
Show file tree
Hide file tree
Showing 25 changed files with 588 additions and 523 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
![PyPi downloads](https://img.shields.io/pypi/dm/gpsea.svg?label=Pypi%20downloads)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/gpsea)

GPSEA is a Python library for discovery of genotype-phenotype associations.
GPSEA (Genotypes and Phenotypes - Statistical Evaluation of Associations, pronounced "G"-"P"-"C") is a Python package designed to support genotype-phenotype correlation analysis.


See the [Tutorial](https://monarch-initiative.github.io/gpsea/stable/tutorial.html)
and a comprehensive [User guide](https://monarch-initiative.github.io/gpsea/stable/user-guide/index.html)
Expand Down
4 changes: 2 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ A key question in biology and human genetics concerns the relationships between
genetics, the focus is generally placed on the study of whether specific disease-causing alleles are associated with specific phenotypic
manifestations of the disease.

`GPSEA` (genotypes and phenotypes - study and evaluation of associations) is a Python package designed to support genotype-phenotype correlation analysis.
We pronounce GPSEA as "G"-"P"-"C". The input to `GPSEA` is a collection of `Global Alliance for Genomics and Health (GA4GH) Phenopackets <https://pubmed.ncbi.nlm.nih.gov/35705716/>`_.
`GPSEA` (Genotypes and Phenotypes - Statistical Evaluation of Associations, pronounced "G"-"P"-"C") is a Python package designed to support genotype-phenotype correlation analysis.
The input to `GPSEA` is a collection of `Global Alliance for Genomics and Health (GA4GH) Phenopackets <https://pubmed.ncbi.nlm.nih.gov/35705716/>`_.
`gpsea` ingests data from these phenopackets and performs analysis of the correlation of specific variants,
variant types (e.g., missense vs. premature termination codon), or variant location in protein motifs or other features.
The phenotypic abnormalities are represented by `Human Phenotype Ontology (HPO) <https://hpo.jax.org/app/>`_ terms.
Expand Down
12 changes: 6 additions & 6 deletions docs/report/tbx5_frameshift_vs_missense.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"Genotype group: Missense, Frameshift",Missense,Missense,Frameshift,Frameshift,,
Genotype group,Missense,Missense,Frameshift,Frameshift,,
,Count,Percent,Count,Percent,Corrected p values,p values
Ventricular septal defect [HP:0001629],31/60,52%,19/19,100%,0.0009552459156234353,5.6190936213143254e-05
Abnormal atrioventricular conduction [HP:0005150],0/22,0%,3/3,100%,0.003695652173913043,0.00043478260869565214
Expand All @@ -13,10 +13,10 @@ Muscular ventricular septal defect [HP:0011623],6/59,10%,6/25,24%,0.286867598598
Pulmonary arterial hypertension [HP:0002092],4/6,67%,0/2,0%,0.6623376623376622,0.42857142857142855
Hypoplasia of the ulna [HP:0003022],1/12,8%,2/10,20%,0.8095238095238093,0.5714285714285713
Hypoplasia of the radius [HP:0002984],30/62,48%,6/14,43%,1.0,0.7735491022101784
Short thumb [HP:0009778],11/41,27%,8/30,27%,1.0,1.0
Absent radius [HP:0003974],7/32,22%,6/25,24%,1.0,1.0
Short humerus [HP:0005792],7/17,41%,4/9,44%,1.0,1.0
Short thumb [HP:0009778],11/41,27%,8/30,27%,1.0,1.0
Atrial septal defect [HP:0001631],42/44,95%,20/20,100%,1.0,1.0
Abnormal ventricular septum morphology [HP:0010438],31/31,100%,19/19,100%,,
Abnormal cardiac ventricle morphology [HP:0001713],31/31,100%,19/19,100%,,
Abnormal heart morphology [HP:0001627],62/62,100%,30/30,100%,,
Absent radius [HP:0003974],7/32,22%,6/25,24%,1.0,1.0
Aplasia/Hypoplasia of the thumb [HP:0009601],20/20,100%,19/19,100%,,
Aplasia/Hypoplasia of fingers [HP:0006265],22/22,100%,19/19,100%,,
Aplasia/hypoplasia involving bones of the hand [HP:0005927],22/22,100%,19/19,100%,,
9 changes: 7 additions & 2 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,13 @@
User guide
==========

TODO - write high level overview and bridge to individual sections.
GPSEA allows users to perform many different kinds of genotype-phenotype correlation (GPCs) analysis. See the :ref:`tutorial` for an introduction.
In general, the analysis will include steps for data input, exploration of the cohort to assist in generating hypotheses about potential GPCs to test,
corresponding choice of genotype and phenotype predicates to perform the test, choice of statistical test, and approach to multiple-testing correction (mtc).
The pages shown in the table of contents provide more information about each step.




.. toctree::
:maxdepth: 1
Expand All @@ -13,7 +19,6 @@ TODO - write high level overview and bridge to individual sections.
input-data
exploratory
predicates
phenotype_predicates
stats
mtc
glossary
Expand Down
497 changes: 15 additions & 482 deletions docs/user-guide/predicates.rst

Large diffs are not rendered by default.

File renamed without changes.
21 changes: 21 additions & 0 deletions docs/user-guide/predicates/diagnosis_predicate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
.. _diagnosis-predicate:

========================
Partition by a diagnosis
========================

It is also possible to bin the individuals based on a diagnosis.
The :func:`~gpsea.analysis.predicate.genotype.diagnosis_predicate`
prepares a genotype predicate for assigning an individual into a diagnosis group:

>>> from gpsea.analysis.predicate.genotype import diagnosis_predicate
>>> gt_predicate = diagnosis_predicate(
... diagnoses=('OMIM:154700', 'OMIM:129600'),
... labels=('Marfan syndrome', 'Ectopia lentis, familial'),
... )
>>> gt_predicate.display_question()
'What disease was diagnosed: OMIM:154700, OMIM:129600'

Note, an individual must match only one diagnosis group. Any individuals labeled with two or more diagnoses
(e.g. an individual with both *Marfan syndrome* and *Ectopia lentis, familial*)
will be automatically omitted from the analysis.
56 changes: 56 additions & 0 deletions docs/user-guide/predicates/filtering_predicate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
.. _filtering-predicate:


===================
Filtering predicate
===================

Sometimes a predicate can bin individuals into more genotype groups than necessary and there may be need
to consider only a subset of the groups. A `GenotypePolyPredicate`
created by :class:`~gpsea.analysis.predicate.genotype.filtering_predicate` can retain only a subset
of the target categorizations of interest.

Example
-------

Let's suppose we want test the genotype-phenotype association between variants
that lead to frameshift or a stop gain in a fictional transcript `NM_1234.5`,
and we are specifically interested in comparing the heterozygous variants
in a biallelic alternative allele genotypes (homozygous alternate and compound heterozygous).

First, we set up a :class:`~gpsea.analysis.predicate.genotype.VariantPredicate`
for testing if a variant introduces a premature stop codon or leads to the shift of the reading frame:

>>> from gpsea.model import VariantEffect
>>> from gpsea.analysis.predicate.genotype import VariantPredicates
>>> tx_id = 'NM_1234.5'
>>> is_frameshift_or_stop_gain = VariantPredicates.variant_effect(VariantEffect.FRAMESHIFT_VARIANT, tx_id) \
... | VariantPredicates.variant_effect(VariantEffect.STOP_GAINED, tx_id)
>>> is_frameshift_or_stop_gain.get_question()
'(FRAMESHIFT_VARIANT on NM_1234.5 OR STOP_GAINED on NM_1234.5)'

Then, we create :class:`~gpsea.analysis.predicate.genotype.ModeOfInheritancePredicate.autosomal_recessive`
to bin according to a genotype group:

>>> from gpsea.analysis.predicate.genotype import ModeOfInheritancePredicate
>>> gt_predicate = ModeOfInheritancePredicate.autosomal_recessive(is_frameshift_or_stop_gain)
>>> gt_predicate.display_question()
'What is the genotype group: HOM_REF, HET, BIALLELIC_ALT'

We see that the `gt_predicate` bins the patients into three groups:

>>> cats = gt_predicate.get_categorizations()
>>> cats
(Categorization(category=HOM_REF), Categorization(category=HET), Categorization(category=BIALLELIC_ALT))

We wrap the categorizations of interest along with the `gt_predicate` by the `filtering_predicate` function,
and we will get a :class:`~gpsea.analysis.predicate.genotype.GenotypePolyPredicate`
that includes only the categories of interest:

>>> from gpsea.analysis.predicate.genotype import filtering_predicate
>>> fgt_predicate = filtering_predicate(
... predicate=gt_predicate,
... targets=(cats[1], cats[2]),
... )
>>> fgt_predicate.display_question()
'What is the genotype group: HET, BIALLELIC_ALT'
33 changes: 33 additions & 0 deletions docs/user-guide/predicates/genotype_predicates.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
.. _genotype-predicates:

===================
Genotype Predicates
===================


A genotype predicate seeks to divide the individuals along an axis that is orthogonal to phenotypes.
Typically, this includes using the genotype data, such as presence of a missense variant
in a heterozygous genotype. However, other categorical variables,
such as diagnoses (TODO - add link to disease predicate) or cluster ids can also be used.

The genotype predicates test the individual for a presence of variants that meet certain inclusion criteria.
The testing is done in two steps. First, we count the alleles
of the matching variants and then we interpret the count, possibly including factors
such as the expected mode of inheritance and sex, to assign the individual into a group.
Finding the matching variants is what
the :class:`~gpsea.analysis.predicate.genotype.VariantPredicate` is all about.


.. toctree::
:maxdepth: 1
:caption: Contents:

variant_predicates
mode_of_inheritance_predicate
filtering_predicate
male_female_predicate
diagnosis_predicate
groups_predicate



41 changes: 41 additions & 0 deletions docs/user-guide/predicates/groups_predicate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
.. _groups-predicate:

================
Groups Predicate
================



Sometimes, all we want is to compare if there is a difference between individuals
who include one or more alleles of variant `X` vs. individuals with variants `Y`,
vs. individuals with variants `Z`, where `X`, `Y` and `Z` are variant predicates.
We can do this with a *groups* predicate.

The :func:`~gpsea.analysis.predicate.genotype.groups_predicate`
takes *n* variant predicates and *n* group labels, and it will assign the patients
into the respective groups if one or more matching allele is found.
However, only one predicate is allowed to return a non-zero allele count.
Otherwise, the patient is assigned with ``None`` and excluded from the analysis.

Example
-------

Here we show how to build a :class:`~gpsea.analysis.predicate.genotype.GenotypePolyPredicate`
for testing if the individual has at least one missense vs. frameshift vs. synonymous variant.

>>> from gpsea.model import VariantEffect
>>> from gpsea.analysis.predicate.genotype import VariantPredicates, groups_predicate
>>> tx_id = 'NM_1234.5'
>>> gt_predicate = groups_predicate(
... predicates=(
... VariantPredicates.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id),
... VariantPredicates.variant_effect(VariantEffect.FRAMESHIFT_VARIANT, tx_id),
... VariantPredicates.variant_effect(VariantEffect.SYNONYMOUS_VARIANT, tx_id),
... ),
... group_names=('Missense', 'Frameshift', 'Synonymous'),
... )
>>> gt_predicate.display_question()
'Genotype group: Missense, Frameshift, Synonymous'



87 changes: 87 additions & 0 deletions docs/user-guide/predicates/hpo_predicate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
.. _hpo-predicate:


HPO predicate
=============

When testing for presence or absence of an HPO term, the :class:`~gpsea.analysis.predicate.phenotype.HpoPredicate`
leverages the :ref:`true-path-rule` to take advantage of the HPO hierarchy.
In result, an individual annotated with a term is implicitly annotated with all its ancestors.
For instance, an individual annotated with `Ectopia lentis <https://hpo.jax.org/browse/term/HP:0001083>`_
is also annotated with `Abnormal lens morphology <https://hpo.jax.org/browse/term/HP:0000517>`_,
`Abnormal anterior eye segment morphology <https://hpo.jax.org/browse/term/HP:0004328>`_,
`Abnormal eye morphology <https://hpo.jax.org/browse/term/HP:0012372>`_, ...

Similarly, all descendants of a term, whose presence was specifically excluded in an individual,
are implicitly excluded.

Example
-------

Here we show how to set up :class:`~gpsea.analysis.predicate.phenotype.HpoPredicate`
to test for a presence of `Abnormal lens morphology <https://hpo.jax.org/browse/term/HP:0000517>`_.

We need to load :class:`~hpotk.MinimalOntology` with HPO data to access the HPO hierarchy:

>>> import hpotk
>>> store = hpotk.configure_ontology_store()
>>> hpo = store.load_minimal_hpo(release='v2024-07-01')

and now we can set up a predicate to test for presence of *Abnormal lens morphology*:

>>> from gpsea.analysis.predicate.phenotype import HpoPredicate
>>> query = hpotk.TermId.from_curie('HP:0000517')
>>> pheno_predicate = HpoPredicate(
... hpo=hpo,
... query=query,
... )
>>> pheno_predicate.display_question()
'Is Abnormal lens morphology present in the patient: Yes, No'



missing_implies_phenotype_excluded
----------------------------------

In many cases, published reports of clinical data about individuals with rare diseases describes phenotypic features that were observed, but do not
provide a comprehensive list of features that were explicitly excluded. By default, GPSEA will only include features that are recorded as observed or excluded in a phenopacket.
Setting this argument to True will cause "n/a" entries to be set to "excluded". We provide this option for exploration but do not recommend its use for the
final analysis unless the assumption behind it is known to be true.



Predicates for all cohort phenotypes
====================================

Constructing phenotype predicates for all HPO terms of a cohort sounds a bit tedious.
The :func:`~gpsea.analysis.predicate.phenotype.prepare_predicates_for_terms_of_interest`
function cuts down the tedium.

For a given phenopacket collection (e.g. 156 patients with mutations in *WWOX* gene included in Phenopacket Store version `0.1.18`)

>>> from ppktstore.registry import configure_phenopacket_registry
>>> registry = configure_phenopacket_registry()
>>> with registry.open_phenopacket_store(release='0.1.18') as ps:
... phenopackets = tuple(ps.iter_cohort_phenopackets('TBX5'))
>>> len(phenopackets)
156

processed into a cohort

>>> from gpsea.preprocessing import configure_caching_cohort_creator, load_phenopackets
>>> cohort_creator = configure_caching_cohort_creator(hpo)
>>> cohort, _ = load_phenopackets(phenopackets, cohort_creator) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
Patients Created: ...


we can create HPO predicates for testing all 260 HPO terms used in the cohort

>>> from gpsea.analysis.predicate.phenotype import prepare_predicates_for_terms_of_interest
>>> pheno_predicates = prepare_predicates_for_terms_of_interest(
... cohort=cohort,
... hpo=hpo,
... )
>>> len(pheno_predicates)
260

and subject the predicates into further analysis, such as :class:`~gpsea.analysis.pcats.HpoTermAnalysis`.
21 changes: 21 additions & 0 deletions docs/user-guide/predicates/male_female_predicate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
.. _male-female-predicate:

Partition by the sex of the individual
======================================

It is easy to investigate the phenotypic differences between females and males.
The :func:`~gpsea.analysis.predicate.genotype.sex_predicate` provides a predicate
for partitioning based on the sex of the individual:

>>> from gpsea.analysis.predicate.genotype import sex_predicate
>>> gt_predicate = sex_predicate()
>>> gt_predicate.display_question()
'Sex of the individual: FEMALE, MALE'

The individuals with :class:`~gpsea.model.Sex.UNKNOWN_SEX` will be omitted from the analysis.

Note that we have implemented this predicate as a genotype predicate, because it is used in
place of other genotype predicates. Currently, it is not possible to compare the distribution of genotypes across sexes.



Loading

0 comments on commit a314f26

Please sign in to comment.