Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logoplots #534

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ dependencies = [
'pooch>=1.7.0',
'pycairo>=1.20; sys_platform == "win32"',
'joblib>=1.3.1',
'palmotif',
'IPython',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the IPython dependency necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Palmotif returns a SVG object, which is by default saved as such accordingly. What I did here was that I used the IPython.display's function SVG to directly display the SVG in my notebook and not always have to save it as a file. This was a workaround that I have almost forgotten, and I am not sure that this is even necessary...although I like to be able to directly investigate plots inside the notebook. However, I noticed that this is probably not the smartest way to do as I do not offer the possibility to save the SVG to a file any more and I might adapt this accordingly. Either way, I am not sure if SVG is a handy format for users to deal with... do you know if there is a possibility to save it as a png file or any other format that might be more convenient to deal with?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like palmotif can use matplotlib as a backend:
https://github.com/agartland/palmotif/blob/e228c2a9772acf1e4a2a0f3e15782b8096704cec/palmotif/mpl_plot.py#L30

This should be supported by jupyter notebooks natively, and the user can save it to any format they like as with any other matplotlib plot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have a look into this, but sounds promising :D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just having a hard time to manipulate the plot as plt.show() is called inside mpl_plot and it returns three objects, which are somewhat confusing to me...

Do you know if a object such as
[[(Text(0, 0, 'C'), 178.8135375281251)],
[(Text(0, 0, 'A'), 178.8135375281251)],
[(Text(0, 0, 'S'), 178.8135375281251)],
[(Text(0, 0, 'S'), 178.8135375281251)],
[(Text(0, 0, 'E'), 11.495650387247952),
(Text(0, 0, 'H'), 11.495650387247952),
(Text(0, 0, 'L'), 11.495650387247952),
(Text(0, 0, 'T'), 11.495650387247952),
(Text(0, 0, 'Y'), 11.495650387247958),
(Text(0, 0, 'P'), 22.991300774495915)],
[(Text(0, 0, 'G'), 13.08748712194183),
(Text(0, 0, 'S'), 13.08748712194183),
(Text(0, 0, 'Y'), 13.08748712194183),
(Text(0, 0, 'P'), 26.17497424388366),
(Text(0, 0, 'V'), 26.17497424388366)],
[(Text(0, 0, 'G'), 11.495650387247952),
(Text(0, 0, 'I'), 11.495650387247952),
(Text(0, 0, 'L'), 11.495650387247952),
(Text(0, 0, 'S'), 11.495650387247952),
(Text(0, 0, 'V'), 11.495650387247958),
(Text(0, 0, 'W'), 22.991300774495915)],
[(Text(0, 0, 'F'), 17.862997326023464),
(Text(0, 0, 'A'), 35.72599465204693),
(Text(0, 0, 'G'), 71.45198930409386)],
[(Text(0, 0, 'A'), 13.688315950194387),
(Text(0, 0, 'E'), 13.688315950194387),
(Text(0, 0, 'L'), 13.688315950194387),
(Text(0, 0, 'V'), 13.688315950194387),
(Text(0, 0, 'G'), 41.06494785058316)],
[(Text(0, 0, 'A'), 11.49565038724795),
(Text(0, 0, 'E'), 11.49565038724795),
(Text(0, 0, 'L'), 11.49565038724795),
(Text(0, 0, 'P'), 11.49565038724795),
(Text(0, 0, 'V'), 11.495650387247956),
(Text(0, 0, 'G'), 22.9913007744959)],
[(Text(0, 0, 'T'), 14.679323856635706),
(Text(0, 0, 'G'), 29.358647713271413),
(Text(0, 0, 'L'), 29.358647713271413),
(Text(0, 0, 'P'), 29.358647713271402)],
[(Text(0, 0, 'D'), 13.08748712194183),
(Text(0, 0, 'I'), 13.08748712194183),
(Text(0, 0, 'L'), 13.08748712194183),
(Text(0, 0, 'G'), 26.17497424388366),
(Text(0, 0, 'S'), 26.17497424388366)],
[(Text(0, 0, 'N'), 15.280152684888265),
(Text(0, 0, 'I'), 15.280152684888265),
(Text(0, 0, 'L'), 30.56030536977653),
(Text(0, 0, 'S'), 45.840458054664815)],
[(Text(0, 0, 'L'), 17.472818247834695),
(Text(0, 0, 'G'), 52.41845474350409),
(Text(0, 0, 'S'), 52.41845474350409)],
[(Text(0, 0, 'N'), 13.688315950194387),
(Text(0, 0, 'Q'), 13.688315950194387),
(Text(0, 0, 'S'), 13.688315950194387),
(Text(0, 0, 'T'), 13.688315950194387),
(Text(0, 0, 'A'), 41.06494785058316)],
[(Text(0, 0, 'D'), 13.688315950194387),
(Text(0, 0, 'Q'), 13.688315950194387),
(Text(0, 0, 'E'), 13.688315950194387),
(Text(0, 0, 'Y'), 13.688315950194387),
(Text(0, 0, 'N'), 41.06494785058316)],
[(Text(0, 0, 'E'), 15.280152684888265),
(Text(0, 0, 'P'), 15.280152684888265),
(Text(0, 0, 'T'), 30.56030536977653),
(Text(0, 0, 'V'), 45.840458054664815)],
[(Text(0, 0, 'L'), 60.16698866690969), (Text(0, 0, 'Q'), 80.2226515558796)],
[(Text(0, 0, 'H'), 17.472818247834695),
(Text(0, 0, 'T'), 52.41845474350409),
(Text(0, 0, 'Y'), 52.41845474350409)],
[(Text(0, 0, 'F'), 178.8135375281251)]]
can be easily transformed into a matplotlib figure or if it can be used to customize the returned matplotlib plot?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as plt.show() is called inside mpl_plot

that's annoying.

When choosing palmotif, did you also take a look at logomaker?
https://logomaker.readthedocs.io/en/latest/

Neither palmotif nor logomaker seem very actively maintained, but logomaker seems much more popular (according to github stars). From a first glance at the docs, logomaker seems more customizable and I also like that it doesn't have parasail as a hard dependency which I'd like to get rid of soon (see #450). But I'm not sure if it has other limitations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I am not sure why I have chosen palmotif...probably because it was used in the single-cell best practice book as well. I will have a look on logomaker, but if it's even preferable to use logomaker over palmotif (to get rid of parasail) I would be happy to adapt my code accordingly unless it is for whatever reason impossible to use logomaker in our case...

]

[project.optional-dependencies]
Expand Down
1 change: 1 addition & 0 deletions src/scirpy/pl/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from ._clonotypes import COLORMAP_EDGES, clonotype_network
from ._diversity import alpha_diversity
from ._group_abundance import group_abundance
from ._logoplots import logoplot_cdr3_motif
from ._repertoire_overlap import repertoire_overlap
from ._spectratype import spectratype
from ._vdj_usage import vdj_usage
Expand Down
177 changes: 177 additions & 0 deletions src/scirpy/pl/_logoplots.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
from collections.abc import Sequence
from typing import Literal, Union

import palmotif as palm
import pandas as pd
from IPython.display import SVG

from scirpy.get import airr as get_airr
from scirpy.util import DataHandler


@DataHandler.inject_param_docs()
def logoplot_cdr3_motif(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the function to the API documentation: https://github.com/scverse/scirpy/blob/main/docs/api.rst

adata: DataHandler.TYPE,
chains: Union[
Literal["VJ_1", "VDJ_1", "VJ_2", "VDJ_2"],
Sequence[Literal["VJ_1", "VDJ_1", "VJ_2", "VDJ_2"]],
] = "VDJ_1",
airr_mod="airr",
airr_key="airr",
chain_idx_key="chain_indices",
cdr3_col: str = "junction_aa",
*,
by: Sequence[Literal["gene_segment", "clonotype", "length"]] = "length",
target_col: Union[None, str] = None,
gene_annotation: Union[None, list] = None,
clonotype_id: Union[None, list] = None,
clonotype_key: Union[None, str] = None,
cdr_len: int,
plot: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need that option, better to always use matplotlib. If the user wants to save a logo, they can do that through matplotlib.

color_scheme: Sequence[
Literal["nucleotide", "base_pairing", "hydrophobicity", "chemistry", "charge", "taylor", "logojs", "shapely"]
] = "taylor",
):
"""
A user friendly wrapper function for the palmotif python package.
Enables the analysis of potential amino acid motifs by displaying logo plots.

Parameters
----------
{adata}
chains
One or multiple chains from which to use CDR3 sequences
MKanetscheider marked this conversation as resolved.
Show resolved Hide resolved
{airr_mod}
{airr_key}
{chain_idx_key}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typicalle allow the user to specify an ax object into which the plot is added. This allows the user to easily compose multi-panel plots, e.g.

fig, ax =plt.subplots()
ir.pl.something (..., ax=ax)

Do you think this is possible with logomaker?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
I will have a look on your other graphical implementations, but as far as I'm concerned (by looking at the logomaker documentation with working examples) this should be possible. So the idea is that the user can access the ax object after using the function right? So it has to be returned at some point in the function call, right?

cdr3_col
key inside awkward array to retrieve junction information (aa)
by
Three options for convenient customisation:
length -- compares all sequences that match the selected length
clonotype -- compares all sequences that match the selected clonotype cluster(s)
-> need to define `clonotype_id` and `clonotype_key`
gene_segment -- compares all sequences that match the selected gene segment(s)
-> need to define `gene_annotation` and `target_col`
target_col
key inside awkward array to retrieve gene annotation information e.g. v_call, j_call
gene_annotation
a list of predefined genes deemed interesting to include in a logoplot
clonotype_id
predefined clonotype cluster id to investigate as a logoplot
clonotype_key
key inside .obs column under which clonotype cluster ids are stored
cdr_len
Specify one exact sequence length t compute sequence motifs
plot
defaults to true to return a SVG logoplot for direct investigation
set to false to retrieve the raw sequence motif for customised use
color_scheme
different color schemes used by palmotif. see https://github.com/agartland/palmotif/blob/master/palmotif/aacolors.py for more details

Returns
-------
Depending on `plot` either returns a SVG object or the calculated sequence motif as a pd.DataFrame
"""
params = DataHandler(adata, airr_mod, airr_key, chain_idx_key)

if by == "length":
airr_df = get_airr(params, [cdr3_col], chains)
if type(chains) is list:
if len(chains) > 2:
raise Exception("Only two different chains are allowed e.g. VDJ_1 and VDJ_2")

else:
cdr3_list = airr_df[airr_df[chains[0] + "_" + cdr3_col].str.len() == cdr_len][
chains[0] + "_" + cdr3_col
].to_list()
cdr3_list += airr_df[airr_df[chains[1] + "_" + cdr3_col].str.len() == cdr_len][
chains[1] + "_" + cdr3_col
].to_list()
motif = palm.compute_motif(cdr3_list)
else:
motif = palm.compute_motif(
airr_df[airr_df[chains + "_" + cdr3_col].str.len() == cdr_len][chains + "_" + cdr3_col].to_list()
)

if plot:
return SVG(palm.svg_logo(motif, return_str=False, color_scheme=color_scheme))
else:
return motif

if by == "gene_segment":
if target_col is None or gene_annotation is None:
raise Exception(
"Please specify where the gene information is stored (`target_col`) and which genes to include (`gene_annotation`) as a list"
)
if type(gene_annotation) is not list:
gene_annotation = list(gene_annotation.split(" "))

airr_df = get_airr(params, [cdr3_col, target_col], chains)
if type(chains) is list:
if len(chains) > 2:
raise Exception("Only two different chains are allowed e.g. VDJ_1 and VDJ_2")

cdr3_list = airr_df[
(airr_df[chains[0] + "_" + target_col].isin(gene_annotation))
& (airr_df[chains[0] + "_" + cdr3_col].str.len() == cdr_len)
][chains[0] + "_" + cdr3_col].to_list()
cdr3_list += airr_df[
(airr_df[chains[1] + "_" + target_col].isin(gene_annotation))
& (airr_df[chains[1] + "_" + cdr3_col].str.len() == cdr_len)
][chains[1] + "_" + cdr3_col].to_list()
motif = palm.compute_motif(cdr3_list)

else:
motif = palm.compute_motif(
airr_df[
(airr_df[chains + "_" + target_col].isin(gene_annotation))
& (airr_df[chains + "_" + cdr3_col].str.len() == cdr_len)
][chains + "_" + cdr3_col].to_list()
)

if plot:
return SVG(palm.svg_logo(motif, return_str=False, color_scheme=color_scheme))
else:
return motif

if by == "clonotype":
if clonotype_id is None or clonotype_key is None:
raise Exception(
"Please select desired clonotype cluster and the name of the column where this information is stored!"
)

if type(clonotype_id) is not list:
clonotype_id = list(clonotype_id.split(" "))

if type(chains) is list:
airr_df = get_airr(params, [cdr3_col], chains)
else:
airr_df = get_airr(params, [cdr3_col], [chains])
airr_df = pd.concat([airr_df, params.get_obs(clonotype_key)])
airr_df = airr_df.loc[params.get_obs(clonotype_key).isin(clonotype_id)]

if type(chains) is list:
if len(chains) > 2:
raise Exception("Only two different chains are allowed e.g. VDJ_1 and VDJ_2")

else:
cdr3_list = airr_df[airr_df[chains[0] + "_" + cdr3_col].str.len() == cdr_len][
chains[0] + "_" + cdr3_col
].to_list()
cdr3_list += airr_df[airr_df[chains[1] + "_" + cdr3_col].str.len() == cdr_len][
chains[1] + "_" + cdr3_col
].to_list()
motif = palm.compute_motif(cdr3_list)
else:
motif = palm.compute_motif(
airr_df[airr_df[chains + "_" + cdr3_col].str.len() == cdr_len][chains + "_" + cdr3_col].to_list()
)

if plot:
return SVG(palm.svg_logo(motif, return_str=False, color_scheme=color_scheme))
else:
return motif

else:
raise Exception("Invalid input for parameter `by`!")
Loading