Merge pull request #251 from monarch-initiative/documentation

updating documentation (WIP)
monarch-initiative · Sep 6, 2024 · a314f26 · a314f26
2 parents 817b64e + 40f3d91
commit a314f26
Show file tree

Hide file tree

Showing 25 changed files with 588 additions and 523 deletions.
diff --git a/README.md b/README.md
@@ -3,7 +3,8 @@
 ![PyPi downloads](https://img.shields.io/pypi/dm/gpsea.svg?label=Pypi%20downloads)
 ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/gpsea)
 
-GPSEA is a Python library for discovery of genotype-phenotype associations.
+GPSEA (Genotypes and Phenotypes - Statistical Evaluation of Associations, pronounced "G"-"P"-"C") is a Python package designed to support genotype-phenotype correlation analysis.
+
 
 See the [Tutorial](https://monarch-initiative.github.io/gpsea/stable/tutorial.html) 
 and a comprehensive [User guide](https://monarch-initiative.github.io/gpsea/stable/user-guide/index.html)

diff --git a/docs/index.rst b/docs/index.rst
@@ -10,8 +10,8 @@ A key question in biology and human genetics concerns the relationships between
 genetics, the focus is generally placed on the study of whether specific disease-causing alleles are associated with specific phenotypic 
 manifestations of the disease. 
 
-`GPSEA`  (genotypes and phenotypes - study and evaluation of associations) is a Python package designed to support genotype-phenotype correlation analysis.
-We pronounce GPSEA as "G"-"P"-"C". The input to `GPSEA` is a collection of `Global Alliance for Genomics and Health (GA4GH) Phenopackets <https://pubmed.ncbi.nlm.nih.gov/35705716/>`_.
+`GPSEA`  (Genotypes and Phenotypes - Statistical Evaluation of Associations, pronounced "G"-"P"-"C") is a Python package designed to support genotype-phenotype correlation analysis.
+The input to `GPSEA` is a collection of `Global Alliance for Genomics and Health (GA4GH) Phenopackets <https://pubmed.ncbi.nlm.nih.gov/35705716/>`_.
 `gpsea` ingests data from these phenopackets and performs analysis of the correlation of specific variants,
 variant types (e.g., missense vs. premature termination codon), or variant location in protein motifs or other features.
 The phenotypic abnormalities are represented by `Human Phenotype Ontology (HPO) <https://hpo.jax.org/app/>`_ terms.

diff --git a/docs/report/tbx5_frameshift_vs_missense.csv b/docs/report/tbx5_frameshift_vs_missense.csv
@@ -1,4 +1,4 @@
-"Genotype group: Missense, Frameshift",Missense,Missense,Frameshift,Frameshift,,
+Genotype group,Missense,Missense,Frameshift,Frameshift,,
 ,Count,Percent,Count,Percent,Corrected p values,p values
 Ventricular septal defect [HP:0001629],31/60,52%,19/19,100%,0.0009552459156234353,5.6190936213143254e-05
 Abnormal atrioventricular conduction [HP:0005150],0/22,0%,3/3,100%,0.003695652173913043,0.00043478260869565214
@@ -13,10 +13,10 @@ Muscular ventricular septal defect [HP:0011623],6/59,10%,6/25,24%,0.286867598598
 Pulmonary arterial hypertension [HP:0002092],4/6,67%,0/2,0%,0.6623376623376622,0.42857142857142855
 Hypoplasia of the ulna [HP:0003022],1/12,8%,2/10,20%,0.8095238095238093,0.5714285714285713
 Hypoplasia of the radius [HP:0002984],30/62,48%,6/14,43%,1.0,0.7735491022101784
-Short thumb [HP:0009778],11/41,27%,8/30,27%,1.0,1.0
-Absent radius [HP:0003974],7/32,22%,6/25,24%,1.0,1.0
 Short humerus [HP:0005792],7/17,41%,4/9,44%,1.0,1.0
+Short thumb [HP:0009778],11/41,27%,8/30,27%,1.0,1.0
 Atrial septal defect [HP:0001631],42/44,95%,20/20,100%,1.0,1.0
-Abnormal ventricular septum morphology [HP:0010438],31/31,100%,19/19,100%,,
-Abnormal cardiac ventricle morphology [HP:0001713],31/31,100%,19/19,100%,,
-Abnormal heart morphology [HP:0001627],62/62,100%,30/30,100%,,
+Absent radius [HP:0003974],7/32,22%,6/25,24%,1.0,1.0
+Aplasia/Hypoplasia of the thumb [HP:0009601],20/20,100%,19/19,100%,,
+Aplasia/Hypoplasia of fingers [HP:0006265],22/22,100%,19/19,100%,,
+Aplasia/hypoplasia involving bones of the hand [HP:0005927],22/22,100%,19/19,100%,,
diff --git a/docs/user-guide/index.rst b/docs/user-guide/index.rst
@@ -4,7 +4,13 @@
 User guide
 ==========
 
-TODO - write high level overview and bridge to individual sections.
+GPSEA allows users to perform many different kinds of genotype-phenotype correlation (GPCs) analysis. See the :ref:`tutorial` for an introduction.
+In general, the analysis will include steps for data input, exploration of the cohort to assist in generating hypotheses about potential GPCs to test, 
+corresponding choice of genotype and phenotype predicates to perform the test, choice of statistical test, and approach to multiple-testing correction (mtc).
+The pages shown in the table of contents provide more information about each step.
+
+
+
 
 .. toctree::
   :maxdepth: 1
@@ -13,7 +19,6 @@ TODO - write high level overview and bridge to individual sections.
   input-data
   exploratory
   predicates
-  phenotype_predicates
   stats
   mtc
   glossary

diff --git a/docs/user-guide/predicates.rst b/docs/user-guide/predicates.rst
diff --git a/docs/user-guide/devries.rst → docs/user-guide/predicates/devries.rst b/docs/user-guide/devries.rst → docs/user-guide/predicates/devries.rst
diff --git a/docs/user-guide/predicates/diagnosis_predicate.rst b/docs/user-guide/predicates/diagnosis_predicate.rst
@@ -0,0 +1,21 @@
+.. _diagnosis-predicate:
+
+========================
+Partition by a diagnosis
+========================
+
+It is also possible to bin the individuals based on a diagnosis.
+The :func:`~gpsea.analysis.predicate.genotype.diagnosis_predicate` 
+prepares a genotype predicate for assigning an individual into a diagnosis group:
+
+>>> from gpsea.analysis.predicate.genotype import diagnosis_predicate
+>>> gt_predicate = diagnosis_predicate(
+...     diagnoses=('OMIM:154700', 'OMIM:129600'),
+...     labels=('Marfan syndrome', 'Ectopia lentis, familial'),
+... )
+>>> gt_predicate.display_question()
+'What disease was diagnosed: OMIM:154700, OMIM:129600'
+
+Note, an individual must match only one diagnosis group. Any individuals labeled with two or more diagnoses
+(e.g. an individual with both *Marfan syndrome* and *Ectopia lentis, familial*)
+will be automatically omitted from the analysis.
diff --git a/docs/user-guide/predicates/filtering_predicate.rst b/docs/user-guide/predicates/filtering_predicate.rst
@@ -0,0 +1,56 @@
+.. _filtering-predicate:
+
+
+===================
+Filtering predicate
+===================
+
+Sometimes a predicate can bin individuals into more genotype groups than necessary and there may be need
+to consider only a subset of the groups. A `GenotypePolyPredicate`
+created by :class:`~gpsea.analysis.predicate.genotype.filtering_predicate` can retain only a subset
+of the target categorizations of interest.
+
+Example
+-------
+
+Let's suppose we want test the genotype-phenotype association between variants
+that lead to frameshift or a stop gain in a fictional transcript `NM_1234.5`,
+and we are specifically interested in comparing the heterozygous variants
+in a biallelic alternative allele genotypes (homozygous alternate and compound heterozygous).
+
+First, we set up a :class:`~gpsea.analysis.predicate.genotype.VariantPredicate`
+for testing if a variant introduces a premature stop codon or leads to the shift of the reading frame:
+
+>>> from gpsea.model import VariantEffect
+>>> from gpsea.analysis.predicate.genotype import VariantPredicates
+>>> tx_id = 'NM_1234.5'
+>>> is_frameshift_or_stop_gain = VariantPredicates.variant_effect(VariantEffect.FRAMESHIFT_VARIANT, tx_id) \
+...     | VariantPredicates.variant_effect(VariantEffect.STOP_GAINED, tx_id)
+>>> is_frameshift_or_stop_gain.get_question()
+'(FRAMESHIFT_VARIANT on NM_1234.5 OR STOP_GAINED on NM_1234.5)'
+
+Then, we create :class:`~gpsea.analysis.predicate.genotype.ModeOfInheritancePredicate.autosomal_recessive`
+to bin according to a genotype group:
+
+>>> from gpsea.analysis.predicate.genotype import ModeOfInheritancePredicate
+>>> gt_predicate = ModeOfInheritancePredicate.autosomal_recessive(is_frameshift_or_stop_gain)
+>>> gt_predicate.display_question()
+'What is the genotype group: HOM_REF, HET, BIALLELIC_ALT'
+
+We see that the `gt_predicate` bins the patients into three groups:
+
+>>> cats = gt_predicate.get_categorizations()
+>>> cats
+(Categorization(category=HOM_REF), Categorization(category=HET), Categorization(category=BIALLELIC_ALT))
+
+We wrap the categorizations of interest along with the `gt_predicate` by the `filtering_predicate` function,
+and we will get a :class:`~gpsea.analysis.predicate.genotype.GenotypePolyPredicate`
+that includes only the categories of interest:
+
+>>> from gpsea.analysis.predicate.genotype import filtering_predicate
+>>> fgt_predicate = filtering_predicate(
+...     predicate=gt_predicate,
+...     targets=(cats[1], cats[2]),
+... )
+>>> fgt_predicate.display_question()
+'What is the genotype group: HET, BIALLELIC_ALT'
diff --git a/docs/user-guide/predicates/genotype_predicates.rst b/docs/user-guide/predicates/genotype_predicates.rst
@@ -0,0 +1,33 @@
+.. _genotype-predicates:
+
+===================
+Genotype Predicates
+===================
+
+
+A genotype predicate seeks to divide the individuals along an axis that is orthogonal to phenotypes.
+Typically, this includes using the genotype data, such as presence of a missense variant
+in a heterozygous genotype. However, other categorical variables,
+such as diagnoses (TODO - add link to disease predicate) or cluster ids can also be used.
+
+The genotype predicates test the individual for a presence of variants that meet certain inclusion criteria.
+The testing is done in two steps. First, we count the alleles
+of the matching variants and then we interpret the count, possibly including factors
+such as the expected mode of inheritance and sex, to assign the individual into a group.
+Finding the matching variants is what
+the :class:`~gpsea.analysis.predicate.genotype.VariantPredicate` is all about.
+
+
+.. toctree::
+  :maxdepth: 1
+  :caption: Contents:
+
+  variant_predicates
+  mode_of_inheritance_predicate
+  filtering_predicate
+  male_female_predicate
+  diagnosis_predicate
+  groups_predicate
+
+
+
diff --git a/docs/user-guide/predicates/groups_predicate.rst b/docs/user-guide/predicates/groups_predicate.rst
@@ -0,0 +1,41 @@
+.. _groups-predicate:
+
+================
+Groups Predicate
+================
+
+
+
+Sometimes, all we want is to compare if there is a difference between individuals
+who include one or more alleles of variant `X` vs. individuals with variants `Y`,
+vs. individuals with variants `Z`, where `X`, `Y` and `Z` are variant predicates.
+We can do this with a *groups* predicate.
+
+The :func:`~gpsea.analysis.predicate.genotype.groups_predicate`
+takes *n* variant predicates and *n* group labels, and it will assign the patients
+into the respective groups if one or more matching allele is found.
+However, only one predicate is allowed to return a non-zero allele count.
+Otherwise, the patient is assigned with ``None`` and excluded from the analysis.
+
+Example
+-------
+
+Here we show how to build a :class:`~gpsea.analysis.predicate.genotype.GenotypePolyPredicate`
+for testing if the individual has at least one missense vs. frameshift vs. synonymous variant.
+
+>>> from gpsea.model import VariantEffect
+>>> from gpsea.analysis.predicate.genotype import VariantPredicates, groups_predicate
+>>> tx_id = 'NM_1234.5'
+>>> gt_predicate = groups_predicate(
+...     predicates=(
+...         VariantPredicates.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id),
+...         VariantPredicates.variant_effect(VariantEffect.FRAMESHIFT_VARIANT, tx_id),
+...         VariantPredicates.variant_effect(VariantEffect.SYNONYMOUS_VARIANT, tx_id),
+...     ),
+...     group_names=('Missense', 'Frameshift', 'Synonymous'),
+... )
+>>> gt_predicate.display_question()
+'Genotype group: Missense, Frameshift, Synonymous'
+
+
+
diff --git a/docs/user-guide/predicates/hpo_predicate.rst b/docs/user-guide/predicates/hpo_predicate.rst
@@ -0,0 +1,87 @@
+.. _hpo-predicate:
+
+
+HPO predicate
+=============
+
+When testing for presence or absence of an HPO term, the :class:`~gpsea.analysis.predicate.phenotype.HpoPredicate`
+leverages the :ref:`true-path-rule` to take advantage of the HPO hierarchy.
+In result, an individual annotated with a term is implicitly annotated with all its ancestors.
+For instance, an individual annotated with `Ectopia lentis <https://hpo.jax.org/browse/term/HP:0001083>`_
+is also annotated with `Abnormal lens morphology <https://hpo.jax.org/browse/term/HP:0000517>`_,
+`Abnormal anterior eye segment morphology <https://hpo.jax.org/browse/term/HP:0004328>`_,
+`Abnormal eye morphology <https://hpo.jax.org/browse/term/HP:0012372>`_, ...
+
+Similarly, all descendants of a term, whose presence was specifically excluded in an individual,
+are implicitly excluded.
+
+Example
+-------
+
+Here we show how to set up :class:`~gpsea.analysis.predicate.phenotype.HpoPredicate`
+to test for a presence of `Abnormal lens morphology <https://hpo.jax.org/browse/term/HP:0000517>`_.
+
+We need to load :class:`~hpotk.MinimalOntology` with HPO data to access the HPO hierarchy:
+
+>>> import hpotk
+>>> store = hpotk.configure_ontology_store()
+>>> hpo = store.load_minimal_hpo(release='v2024-07-01')
+
+and now we can set up a predicate to test for presence of *Abnormal lens morphology*:
+
+>>> from gpsea.analysis.predicate.phenotype import HpoPredicate
+>>> query = hpotk.TermId.from_curie('HP:0000517')
+>>> pheno_predicate = HpoPredicate(
+...     hpo=hpo,
+...     query=query,
+... )
+>>> pheno_predicate.display_question()
+'Is Abnormal lens morphology present in the patient: Yes, No'
+
+
+
+missing_implies_phenotype_excluded
+----------------------------------
+
+In many cases, published reports of clinical data about individuals with rare diseases describes phenotypic features that were observed, but do not 
+provide a comprehensive list of features that were explicitly excluded. By default, GPSEA will only include features that are recorded as observed or excluded in a phenopacket.
+Setting this argument to True will cause "n/a" entries to be set to "excluded". We provide this option for exploration but do not recommend its use for the 
+final analysis unless the assumption behind it is known to be true.
+
+
+
+Predicates for all cohort phenotypes
+====================================
+
+Constructing phenotype predicates for all HPO terms of a cohort sounds a bit tedious.
+The :func:`~gpsea.analysis.predicate.phenotype.prepare_predicates_for_terms_of_interest`
+function cuts down the tedium.
+
+For a given phenopacket collection (e.g. 156 patients with mutations in *WWOX* gene included in Phenopacket Store version `0.1.18`)
+
+>>> from ppktstore.registry import configure_phenopacket_registry
+>>> registry = configure_phenopacket_registry()
+>>> with registry.open_phenopacket_store(release='0.1.18') as ps:
+...     phenopackets = tuple(ps.iter_cohort_phenopackets('TBX5'))
+>>> len(phenopackets)
+156
+
+processed into a cohort
+
+>>> from gpsea.preprocessing import configure_caching_cohort_creator, load_phenopackets
+>>> cohort_creator = configure_caching_cohort_creator(hpo)
+>>> cohort, _ = load_phenopackets(phenopackets, cohort_creator)  # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
+Patients Created: ...
+
+
+we can create HPO predicates for testing all 260 HPO terms used in the cohort
+
+>>> from gpsea.analysis.predicate.phenotype import prepare_predicates_for_terms_of_interest
+>>> pheno_predicates = prepare_predicates_for_terms_of_interest(
+...     cohort=cohort,
+...     hpo=hpo,
+... )
+>>> len(pheno_predicates)
+260
+
+and subject the predicates into further analysis, such as :class:`~gpsea.analysis.pcats.HpoTermAnalysis`.
diff --git a/docs/user-guide/predicates/male_female_predicate.rst b/docs/user-guide/predicates/male_female_predicate.rst
@@ -0,0 +1,21 @@
+.. _male-female-predicate:
+
+Partition by the sex of the individual
+======================================
+
+It is easy to investigate the phenotypic differences between females and males.
+The :func:`~gpsea.analysis.predicate.genotype.sex_predicate` provides a predicate
+for partitioning based on the sex of the individual:
+
+>>> from gpsea.analysis.predicate.genotype import sex_predicate
+>>> gt_predicate = sex_predicate()
+>>> gt_predicate.display_question()
+'Sex of the individual: FEMALE, MALE'
+
+The individuals with :class:`~gpsea.model.Sex.UNKNOWN_SEX` will be omitted from the analysis.
+
+Note that we have implemented this predicate as a genotype predicate, because it is used in 
+place of other genotype predicates. Currently, it is not possible to compare the distribution of genotypes across sexes.
+
+
+