Skip to content

Commit

Permalink
Merge pull request #398 from monarch-initiative/fix-bugs
Browse files Browse the repository at this point in the history
Fix various bugs
  • Loading branch information
ielis authored Jan 15, 2025
2 parents dd3dcc4 + bf691ef commit d0f69d0
Show file tree
Hide file tree
Showing 9 changed files with 209 additions and 21 deletions.
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,9 @@

GPSEA (Genotypes and Phenotypes - Statistical Evaluation of Associations) is a Python package for finding genotype-phenotype associations.


See the [Tutorial](https://monarch-initiative.github.io/gpsea/stable/tutorial.html)
and a comprehensive [User guide](https://monarch-initiative.github.io/gpsea/stable/user-guide/index.html)
for more information.
See our documentation for the [setup](https://monarch-initiative.github.io/gpsea/stable/setup.html) instructions,
a [tutorial](https://monarch-initiative.github.io/gpsea/stable/tutorial.html) with an end-to-end genotype-phenotype association analysis,
and a comprehensive [user guide](https://monarch-initiative.github.io/gpsea/stable/user-guide/index.html) with everything else.

The documentation comes in two flavors:

Expand Down
4 changes: 2 additions & 2 deletions docs/setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
Setup
#####

Here we show how to install GPSEA and to prepare your Python environment
for genotype-phenotype association analysis.
Here we show how to install GPSEA and prepare your Python environment
for genotype-phenotype association analyses.


.. contents:: Table of Contents
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -185,12 +185,16 @@ The builtin predicates should cover majority of use cases.
However, if a predicate seems to be missing,
feel free to submit an issue in our
`GitHub tracker <https://github.com/monarch-initiative/gpsea/issues>`_,
or to implement a custom predicate
by extending the :class:`~gpsea.analysis.predicate.VariantPredicate` class 😎.
or implement your own predicate by following the :ref:`custom-variant-predicate`
guide.


****
Next
****

The variant predicate offers a flexible API for testing if variants meet a condition.
However, the genotype phenotype correlations are done on the individual level
and the variant predicates are used as a component of the genotype predicate.
The next sections show how to use variant predicates to assign individuals into groups.
However, the genotype phenotype correlations are studied on the level of individuals.
As described in :ref:`genotype-classifiers`, GPSEA uses the :class:`~gpsea.analysis.clf.GenotypeClassifier` API
to assign individuals into non-overlapping classes. Variant predicates are essential for creating such classifier.
We explain the details in the following sections.
20 changes: 20 additions & 0 deletions docs/user-guide/analyses/phenotype-scores.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,26 @@ Phenotype score
This component is responsible for computing a phenotype score for an individual.
As far as GPSEA framework is concerned, the phenotype score must be a floating point number
or a `NaN` value if the score cannot be computed for an individual.
This is the essence of the :class:`~gpsea.analysis.pscore.PhenotypeScorer` class.

GPSEA ships with several builtin phenotype scorers which can be used as

+------------------------------------------------------------+---------------------------------------------+
| Name | Description |
+============================================================+=============================================+
| | Compute the total number of occurrences |
| * :class:`~gpsea.analysis.pscore.CountingPhenotypeScorer` | of specific phenotypic features |
| | (used in this section) |
+------------------------------------------------------------+---------------------------------------------+
| | Compute the "adapted De Vries Score" |
| * :class:`~gpsea.analysis.pscore.DeVriesPhenotypeScorer` | for assessing severity |
| | of intellectual disability |
+------------------------------------------------------------+---------------------------------------------+

.. tip::

See :ref:`custom-phenotype-scorer` section to learn how to build a phenotype scorer from scratch.


Here we use the :class:`~gpsea.analysis.pscore.CountingPhenotypeScorer` for scoring
the individuals based on the number of structural defects
Expand Down
142 changes: 142 additions & 0 deletions docs/user-guide/custom-components.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
.. _custom-components:

#################
Custom components
#################

GPSEA aims to stay useful in the long run. Therefore, we took a great care
to adhere to "good software development practices" and we framed GPSEA functionalities
as a set of loosely coupled components. As a rule of thumb, each component corresponds
to a Python abstract base class which is then extended by the builtin components
and can also be extended by the future components, to serve both common or exotic use cases.

The abstract base classes define the component API.
Per guidelines in Python's :mod:`abc` module, the abstract classes use :class:`abc.ABCMeta` as a metaclass
and the class API consists of methods annotated with the :func:`abc.abstractmethod` decorator.
These decorated methods must be overridden in the subclasses.

The following sections provide guidance for customizing the most commonly used GPSEA components.


.. _custom-phenotype-scorer:

****************
Phenotype scorer
****************

:class:`~gpsea.analysis.pscore.PhenotypeScorer` computes a phenotype score for an individual.
The phenotype score is a `float` with range :math:`(-\infty, \infty)` where `NaN` indicates
that a score cannot be computed (e.g. the lab measurement value was not ascertained for the individual).

Here we show an example of a toy phenotype scorer
for using body mass index (BMI) as a phenotype score.

>>> import typing
>>> from gpsea.model import Patient
>>> from gpsea.analysis.pscore import PhenotypeScorer
>>> class BmiScorer(PhenotypeScorer): #
...
... def __init__( #
... self,
... id2bmi: typing.Mapping[str, float],
... ):
... self._id2bmi = id2bmi
...
... @property
... def name(self) -> str: #
... return "BMI phenotype scorer"
...
... @property
... def description(self) -> str: #
... return "Body mass index used as a phenotype score"
...
... @property
... def variable_name(self) -> str: #
... return "BMI"
...
... def score(self, patient: Patient) -> float: #
... try:
... return self._id2bmi[patient.labels.label]
... except KeyError:
... return float('nan')

❶ The ``BmiScorer`` must extend :class:`~gpsea.analysis.pscore.PhenotypeScorer`
to be used as a phenotype scorer.
❷ The scorer needs a ``dict`` with `label` → `BMI` for the analyzed individuals.
We assume the user will pre-compute the corresponding ``dict``.

Then, the scorer must expose several properties, including ❸ ``name``, ❹ ``description``,
and the ❺ ``variable_name`` it operates on.
The properties provide bookkeeping metadata to use in e.g. visualizations.
Try to choose short and concise names.

The most important part of the scorer is the ❻ `score` method
which retrieves the BMI for an individual or returns `NaN` if the value is not available
and the individual should be omitted from the analysis.

.. _custom-variant-predicate:

*****************
Variant predicate
*****************

The purpose of a :class:`~gpsea.analysis.predicate.VariantPredicate` is to test
if a variant meets a certain criterion and GPSEA ships with an array
of builtin predicates (see :mod:`gpsea.analysis.predicate` module).
However, chances are a custom predicate will be needed in future,
so we show how to how to extend
the :class:`~gpsea.analysis.predicate.VariantPredicate` class
to create one's own predicate.

Specifically, we show how to create a predicate to test if the variant affects a glycine residue
of the transcript of interest.

>>> from gpsea.model import Variant, VariantEffect
>>> from gpsea.analysis.predicate import VariantPredicate
>>> class AffectsGlycinePredicate(VariantPredicate): #
... def __init__( #
... self,
... tx_id: str,
... ):
... self._tx_id = tx_id
... self._aa_code = "Gly"
...
... @property
... def name(self) -> str: #
... return "Affects Glycine"
...
... @property
... def description(self) -> str: #
... return "affects a glycine residue"
...
... @property
... def variable_name(self) -> str: #
... return "affected aminoacid residue"
...
... def test(self, variant: Variant) -> bool: #
... tx_ann = variant.get_tx_anno_by_tx_id(self._tx_id)
... if tx_ann is not None:
... hgvsp = tx_ann.hgvsp
... if hgvsp is not None:
... return hgvsp.startswith(f"p.{self._aa_code}")
... return False
...
... def __eq__(self, value: object) -> bool: #
... return isinstance(value, AffectsGlycinePredicate) and self._tx_id == value._tx_id
...
... def __hash__(self) -> int: #
... return hash((self._tx_id,))
...
... def __repr__(self) -> str: #
... return str(self)
...
... def __str__(self) -> str: #
... return f"AffectsGlycinePredicate(tx_id={self._tx_id})"

❶ The ``AffectsGlycinePredicate`` must extend :class:`~gpsea.analysis.predicate.VariantPredicate`.
❷ We ask the user to provide the transcript accession `str` and we set the target aminoacid code to glycine ``Gly``.
Like in the :ref:`custom-phenotype-scorer` above, ❸❹❺ provide metadata required for the bookkeeping.
The ❻ ``test`` method includes the most interesting part - we retrieve the :class:`~gpsea.model.TranscriptAnnotation`
with the functional annotation data for the transcript of interest, and we test if the HGVS protein indicates
that the reference aminoacid is glycine.
Last, we override ➐ ``__eq__()`` and ❽ ``__hash__()`` (required) as well as ❾ ``__repr__()`` and ➓ ``__str__()`` (recommended).
1 change: 1 addition & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,5 @@ in an interactive Python environment, such as Jupyter notebook.
input-data
exploratory
analyses/index
custom-components
glossary
8 changes: 7 additions & 1 deletion src/gpsea/analysis/pscore/_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,13 +141,15 @@ def plot_boxplots(
self,
ax,
colors=("darksalmon", "honeydew"),
median_color: str = "black",
):
"""
Draw box plot with distributions of phenotype scores for the genotype groups.
:param gt_predicate: the genotype predicate used to produce the genotype groups.
:param ax: the Matplotlib :class:`~matplotlib.axes.Axes` to draw on.
:param colors: a tuple with colors to use for coloring the box patches of the box plot.
:param colors: a sequence with colors to use for coloring the box patches of the box plot.
:param median_color: a `str` with the color for the boxplot median line.
"""
# skip the patients with unassigned genotype group
bla = self._data.notna()
Expand Down Expand Up @@ -175,9 +177,13 @@ def plot_boxplots(
tick_labels=gt_cat_names,
)

# Set face colors of the boxes
for patch, color in zip(bplot["boxes"], colors):
patch.set_facecolor(color)

for median in bplot['medians']:
median.set_color(median_color)

def __eq__(self, value: object) -> bool:
return isinstance(value, PhenotypeScoreAnalysisResult) and super(
MonoPhenotypeAnalysisResult, self
Expand Down
2 changes: 1 addition & 1 deletion src/gpsea/preprocessing/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -495,7 +495,7 @@ def load_phenopackets(
# Keep track of the progress by wrapping the list of phenopackets
# with TQDM 😎
cohort_iter = tqdm(
phenopackets, desc="Individuals Processed", file=sys.stdout, unit="individuals"
phenopackets, desc="Individuals Processed", file=sys.stdout, unit=" individuals"
)
notepad = create_notepad(label="Phenopackets")
cohort = cohort_creator.process(cohort_iter, notepad)
Expand Down
32 changes: 24 additions & 8 deletions src/gpsea/preprocessing/_phenopacket.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ def find_coordinates(
f"{variation_descriptor.vcf_record.genome_assembly} is not {self._build.identifier}."
)
contig = self._build.contig_by_name(variation_descriptor.vcf_record.chrom)
assert contig is not None
start = int(variation_descriptor.vcf_record.pos) - 1
ref = variation_descriptor.vcf_record.ref
alt = variation_descriptor.vcf_record.alt
Expand All @@ -142,6 +143,7 @@ def find_coordinates(
seq_location = variation.copy_number.allele.sequence_location
refseq_contig_name = seq_location.sequence_id.split(":")[1]
contig = self._build.contig_by_name(refseq_contig_name)
assert contig is not None

# Assuming SV coordinates are 1-based (VCF style),
# so we subtract 1 to transform to 0-based coordinate system
Expand Down Expand Up @@ -591,7 +593,11 @@ def _extract_age(
) -> typing.Optional[Age]:
if individual.HasField("time_at_last_encounter"):
tale = individual.time_at_last_encounter
return parse_age_element(tale, notepad)
return parse_age_element(
'time_at_last_encounter',
tale,
notepad,
)
return None

@staticmethod
Expand All @@ -614,6 +620,7 @@ def _extract_vital_status(

if vital_status.HasField("time_of_death"):
age_of_death = parse_age_element(
'time_of_death',
time_element=vital_status.time_of_death,
notepad=notepad,
)
Expand Down Expand Up @@ -904,21 +911,27 @@ def parse_onset_element(
We allow to use `GestationalAge`, `Age` or `OntologyClass` as onset.
"""
case = time_element.WhichOneof("element")
if case == "ontology_class":
if case == "age":
age = time_element.age
return Age.from_iso8601_period(value=age.iso8601duration)
elif case == "gestational_age":
age = time_element.gestational_age
return Age.gestational(weeks=age.weeks, days=age.days)
elif case == "ontology_class":
if term_onset_parser is None:
return None
else:
return term_onset_parser.process(
ontology_class=time_element.ontology_class,
notepad=notepad,
)
)
else:
return parse_age_element(
time_element=time_element,
notepad=notepad,
)
notepad.add_warning(f"`time_element` is in currently unsupported format `{case}`")
return None


def parse_age_element(
field: str,
time_element: PPTimeElement,
notepad: Notepad,
) -> typing.Optional[Age]:
Expand All @@ -933,5 +946,8 @@ def parse_age_element(
age = time_element.age
return Age.from_iso8601_period(value=age.iso8601duration)
else:
notepad.add_warning(f"`time_element` is in currently unsupported format `{case}`")
notepad.add_warning(
f"{case} of the {field} field cannot be parsed into age",
"Consider formatting the age as ISO8601 duration (e.g., \"P31Y2M\" for 31 years and 2 months)"
)
return None

0 comments on commit d0f69d0

Please sign in to comment.