Mutational load function (SHM) #536

MKanetscheider · 2024-08-09T14:57:16Z

Added mutational_load function to calculate differences between sequence and germline alignment. This is especially useful/insightful for BCR due to SHM and help to understand how much mutational actually occurred. However, this is a rather simple approach!

Closes #...

CHANGELOG.md updated
Tests added (For bug fixes or new features)
Tutorial updated (if necessary)

…nce and germline alignment

for more information, see https://pre-commit.ci

src/scirpy/tl/_mutational_load.py

grst · 2024-08-13T15:23:37Z

src/scirpy/tl/__init__.py

@@ -7,5 +7,6 @@
 from ._diversity import alpha_diversity
 from ._group_abundance import group_abundance
 from ._ir_query import ir_query, ir_query_annotate, ir_query_annotate_df
+from ._mutational_load import mutational_load


Please make sure to also add the tool to the API documentation here:
https://github.com/scverse/scirpy/blob/main/docs/api.rst#tools-tl

I added it and I think it looks quite good :)

grst · 2024-08-13T15:27:43Z

src/scirpy/tl/_mutational_load.py

+            mutation_cdr1 = []
+            mutation_cdr2 = []
+            mutation_cdr3 = []
+
+            for row in range(len(airr_df)):
+                fwr1_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][:78]
+                cdr1_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][78:114]
+                fwr2_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][114:165]
+                cdr2_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][165:195]
+                fwr3_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][195:312]
+                cdr3_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][
+                    312 : (312 + airr_df.iloc[row].loc[f"{chain}_junction_len"] - 6)
+                ]
+                fwr4_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][
+                    (312 + airr_df.iloc[row].loc[f"{chain}_junction_len"] - 6) :
+                ]
+
+                if frequency:
+                    fwr1_mu_rel = simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_fwr1"], fwr1_germline, frequency=True
+                    )
+                    cdr1_mu_rel = simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_cdr1"], cdr1_germline, frequency=True
+                    )
+                    fwr2_mu_rel = simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_fwr2"], fwr2_germline, frequency=True
+                    )
+                    cdr2_mu_rel = simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_cdr2"], cdr2_germline, frequency=True
+                    )
+                    fwr3_mu_rel = simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_fwr3"], fwr3_germline, frequency=True
+                    )
+                    cdr3_mu_rel = simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_cdr3"], cdr3_germline, frequency=True
+                    )
+                    fwr4_mu_rel = simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_fwr4"], fwr4_germline, frequency=True
+                    )
+
+                    mutation_fwr1.append(fwr1_mu_rel)
+                    mutation_fwr2.append(fwr2_mu_rel)
+                    mutation_fwr3.append(fwr3_mu_rel)
+                    mutation_fwr4.append(fwr4_mu_rel)
+                    mutation_cdr1.append(cdr1_mu_rel)
+                    mutation_cdr2.append(cdr2_mu_rel)
+                    mutation_cdr3.append(cdr3_mu_rel)
+
+                else:
+                    fwr1_mu_count = simple_hamming_distance(subregion_df.iloc[row].loc[f"{chain}_fwr1"], fwr1_germline)
+                    cdr1_mu_count = simple_hamming_distance(subregion_df.iloc[row].loc[f"{chain}_cdr1"], cdr1_germline)
+                    fwr2_mu_count = simple_hamming_distance(subregion_df.iloc[row].loc[f"{chain}_fwr2"], fwr2_germline)
+                    cdr2_mu_count = simple_hamming_distance(subregion_df.iloc[row].loc[f"{chain}_cdr2"], cdr2_germline)
+                    fwr3_mu_count = simple_hamming_distance(subregion_df.iloc[row].loc[f"{chain}_fwr3"], fwr3_germline)
+                    cdr3_mu_count = simple_hamming_distance(subregion_df.iloc[row].loc[f"{chain}_cdr3"], cdr3_germline)
+                    fwr4_mu_count = simple_hamming_distance(subregion_df.iloc[row].loc[f"{chain}_fwr4"], fwr4_germline)
+
+                    mutation_fwr1.append(fwr1_mu_count)
+                    mutation_fwr2.append(fwr2_mu_count)
+                    mutation_fwr3.append(fwr3_mu_count)
+                    mutation_fwr4.append(fwr4_mu_count)
+                    mutation_cdr1.append(cdr1_mu_count)
+                    mutation_cdr2.append(cdr2_mu_count)
+                    mutation_cdr3.append(cdr3_mu_count)
+
+            if not inplace and frequency:
+                mutation_df[f"{chain}_fwr1_mu_freq"] = mutation_fwr1
+                mutation_df[f"{chain}_cdr1_mu_freq"] = mutation_cdr1
+                mutation_df[f"{chain}_fwr2_mu_freq"] = mutation_fwr2
+                mutation_df[f"{chain}_cdr2_mu_freq"] = mutation_cdr2
+                mutation_df[f"{chain}_fwr3_mu_freq"] = mutation_fwr3
+                mutation_df[f"{chain}_cdr3_mu_freq"] = mutation_cdr3
+                mutation_df[f"{chain}_fwr4_mu_freq"] = mutation_fwr4
+
+            if inplace and frequency:
+                params.set_obs(f"{chain}_fwr1_mu_freq", mutation_fwr1)
+                params.set_obs(f"{chain}_cdr1_mu_freq", mutation_cdr1)
+                params.set_obs(f"{chain}_fwr2_mu_freq", mutation_fwr2)
+                params.set_obs(f"{chain}_cdr2_mu_freq", mutation_cdr2)
+                params.set_obs(f"{chain}_fwr3_mu_freq", mutation_fwr3)
+                params.set_obs(f"{chain}_cdr3_mu_freq", mutation_cdr3)
+                params.set_obs(f"{chain}_fwr4_mu_freq", mutation_fwr4)
+
+            if inplace and not frequency:
+                params.set_obs(f"{chain}_fwr1_mu_count", mutation_fwr1)
+                params.set_obs(f"{chain}_cdr1_mu_count", mutation_cdr1)
+                params.set_obs(f"{chain}_fwr2_mu_count", mutation_fwr2)
+                params.set_obs(f"{chain}_cdr2_mu_count", mutation_cdr2)
+                params.set_obs(f"{chain}_fwr3_mu_count", mutation_fwr3)
+                params.set_obs(f"{chain}_cdr3_mu_count", mutation_cdr3)
+                params.set_obs(f"{chain}_fwr4_mu_count", mutation_fwr4)
+
+            if not inplace and not frequency:
+                mutation_df[f"{chain}_fwr1_mu_count"] = mutation_fwr1
+                mutation_df[f"{chain}_cdr1_mu_count"] = mutation_cdr1
+                mutation_df[f"{chain}_fwr2_mu_count"] = mutation_fwr2
+                mutation_df[f"{chain}_cdr2_mu_count"] = mutation_cdr2
+                mutation_df[f"{chain}_fwr3_mu_count"] = mutation_fwr3
+                mutation_df[f"{chain}_cdr3_mu_count"] = mutation_cdr3
+                mutation_df[f"{chain}_fwr4_mu_count"] = mutation_fwr4


This could surely be written more compactly by using a bunch of for loops...
Ideally try to extract the functionality you apply to one sequence into a smaller function and then apply it to each sequence.

I rewrote the function based on this feedback and it looks now tidier! Please let me know if I should change/adapt it further

… function to api.rst

for more information, see https://pre-commit.ci

…cirpy into mutational_load

for more information, see https://pre-commit.ci

grst · 2024-08-15T12:05:22Z

src/scirpy/tl/_mutational_load.py

+            mutation_dict = {"fwr1": [], "fwr2": [], "fwr3": [], "fwr4": [], "cdr1": [], "cdr2": [], "cdr3": []}
+
+            for row in range(len(airr_df)):
+                fwr1_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][:78]


Where do the numbers of the indices come from? Can we be sure they will remain stable?

These indices come from the IMGT unique numbering scheme (https://pubmed.ncbi.nlm.nih.gov/12477501/). This scheme is a standard approach to ensure that we can compare different V-regions of different cells. The neat thing is that sequences are aligned in a way that fwr 1-3 and cdr1-2 are always on the same spot in the germline and sequence alignment that's why these fixed indices work. cdr3 and fwr4 can be inferred by knowing the junction length and total sequence length as it is used in my code.

grst · 2024-08-15T12:10:03Z

src/scirpy/tl/_mutational_load.py

+                fwr1_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][:78]
+                cdr1_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][78:114]
+                fwr2_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][114:165]
+                cdr2_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][165:195]
+                fwr3_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][195:312]
+                cdr3_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][
+                    312 : (312 + airr_df.iloc[row].loc[f"{chain}_junction_len"] - 6)
+                ]
+                fwr4_germline = airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][
+                    (312 + airr_df.iloc[row].loc[f"{chain}_junction_len"] - 6) :
+                ]
+
+                mutation_dict["fwr1"].append(
+                    simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_fwr1"],
+                        fwr1_germline,
+                        frequency=frequency,
+                        ignore_chars=ignore_chars,
+                    )
+                )
+                mutation_dict["cdr1"].append(
+                    simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_cdr1"],
+                        cdr1_germline,
+                        frequency=frequency,
+                        ignore_chars=ignore_chars,
+                    )
+                )
+                mutation_dict["fwr2"].append(
+                    simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_fwr2"],
+                        fwr2_germline,
+                        frequency=frequency,
+                        ignore_chars=ignore_chars,
+                    )
+                )
+                mutation_dict["cdr2"].append(
+                    simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_cdr2"],
+                        cdr2_germline,
+                        frequency=frequency,
+                        ignore_chars=ignore_chars,
+                    )
+                )
+                mutation_dict["fwr3"].append(
+                    simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_fwr3"],
+                        fwr3_germline,
+                        frequency=frequency,
+                        ignore_chars=ignore_chars,
+                    )
+                )
+                mutation_dict["cdr3"].append(
+                    simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_cdr3"],
+                        cdr3_germline,
+                        frequency=frequency,
+                        ignore_chars=ignore_chars,
+                    )
+                )
+                mutation_dict["fwr4"].append(
+                    simple_hamming_distance(
+                        subregion_df.iloc[row].loc[f"{chain}_fwr4"],
+                        fwr4_germline,
+                        frequency=frequency,
+                        ignore_chars=ignore_chars,
+                    )
+                )


I think this could be further simplified by

(1)
defining a dict of regions

regions = { "fwr1": (0, 78), "cdr1": (78, 114), ... }

and then looping through it; somewhat like

mutation_dict = {} for region, coordinates in regions.items(): mutation_dict[region] = simple_hamming_distance( subregion_df.iloc[row].loc[f"{chain}_fwr3"], airr_df.iloc[row].loc[f"{chain}_{germline_alignment}"][slice(*coordinates)] ... )

I see :D
Thank's for the suggestions! I will try to further simplify it!

Thanks for the hint. I just adapted the code based on this suggestion and now it looks so much better :D

What's still pending is a robust test-case. I had a look on how other tests are written and I get the idea. However, I still don't know what's the best way to get/generate test-data. Should I manually generate a small (e.g. 10 sequences) dataset inside the test-function, which is used to test mutational_load or what is the best-practice here?
The dataset needs to be IMGT numbered so I don't think that I could load any of scirpy's native datasets here...

Should I manually generate a small (e.g. 10 sequences) dataset inside the test-function

Yes, this is common practice. You can also put a small data file in src/scirpy/tests/data.

Oke, I'll come up with something for my next push here :)

… documentation

for more information, see https://pre-commit.ci

grst · 2024-08-19T15:27:39Z

src/scirpy/tl/_mutational_load.py

+                    ),
+                }
+
+                for v, coordinates in regions.items():


Suggested change

for v, coordinates in regions.items():

for region, coordinates in regions.items():

One letter loop variables should only be used if they follow certain conventions, e.g. i/j/k for counters in for loops,
or k, v for key, value pairs from dict.items().

Since you use v for the dict key, this can be confusing and I suggest to use a "proper" variable name like region here.

grst · 2024-08-19T15:29:11Z

src/scirpy/tl/_mutational_load.py

+        for chain in chains:
+            airr_df[f"{chain}_junction_len"] = [len(a) for a in airr_df[f"{chain}_junction"]]
+
+            mutation_dict = {"fwr1": [], "fwr2": [], "fwr3": [], "fwr4": [], "cdr1": [], "cdr2": [], "cdr3": []}


Suggested change

mutation_dict = {"fwr1": [], "fwr2": [], "fwr3": [], "fwr4": [], "cdr1": [], "cdr2": [], "cdr3": []}

mutation_dict = defaultdict(list)

grst · 2024-08-19T15:30:43Z

In terms of implementation, I think we're getting there :)
Still need to try it out myself to check if I like the overall workflow/interface when using this in a juptyer notebook.

…cirpy into mutational_load

… fixed

for more information, see https://pre-commit.ci

…cirpy into mutational_load

MKanetscheider · 2024-08-29T10:02:52Z

Hi Gregor,
I worked on the test case for the mutational_load function and finished some kind of "beta" version. I would be very grateful if you could have a look if I'm going in the right direction here :)
Additionally, I discovered some bugs while testing, which I quickly resolved, but maybe not elegantly so please have also a look on that.

For some reason pushing these changes seem to have broken something with MuData, but I have no idea why and what I could possibly have done to cause this 😢 The error massage seems to be everywhere the same:
ImportError: cannot import name 'AlignedViewMixin' from 'anndata._core.aligned_mapping'
Could you help me solve this?

grst · 2024-08-29T20:31:02Z

Breaking mudata is not your fault. It was caused by an anndata release and should be fixed by now. Just rerun the tests :)

for more information, see https://pre-commit.ci

codecov · 2024-08-30T18:28:31Z

Codecov Report

Attention: Patch coverage is 87.50000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 81.70%. Comparing base (d1db848) to head (ea80d69).
Report is 24 commits behind head on main.

Files with missing lines	Patch %	Lines
src/scirpy/tl/_mutational_load.py	85.88%	12 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #536      +/-   ##
==========================================
+ Coverage   80.19%   81.70%   +1.51%     
==========================================
  Files          49       50       +1     
  Lines        4079     4297     +218     
==========================================
+ Hits         3271     3511     +240     
+ Misses        808      786      -22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

review-notebook-app · 2024-10-17T19:49:59Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

MKanetscheider and others added 3 commits August 9, 2024 16:54

Added mutational_load function to calculate differences between seque…

9f8558f

…nce and germline alignment

[pre-commit.ci] auto fixes from pre-commit.com hooks

4755736

for more information, see https://pre-commit.ci

Merge branch 'scverse:main' into mutational_load

4b020b6

grst reviewed Aug 13, 2024

View reviewed changes

MKanetscheider and others added 5 commits August 15, 2024 12:22

Rewrote mutational_load function based on previous feedback and added…

56a8594

… function to api.rst

[pre-commit.ci] auto fixes from pre-commit.com hooks

c599a39

for more information, see https://pre-commit.ci

Fixed an issue with pre-commit

5c7c92c

Merge branch 'mutational_load' of https://github.com/MKanetscheider/s…

c84708c

…cirpy into mutational_load

[pre-commit.ci] auto fixes from pre-commit.com hooks

12ada2f

for more information, see https://pre-commit.ci

grst reviewed Aug 15, 2024

View reviewed changes

MKanetscheider and others added 2 commits August 18, 2024 12:18

Further optimized mutational_load function and formating of docstring…

a416e06

… documentation

[pre-commit.ci] auto fixes from pre-commit.com hooks

c0e795c

for more information, see https://pre-commit.ci

grst reviewed Aug 19, 2024

View reviewed changes

MKanetscheider and others added 7 commits August 20, 2024 09:18

Fixed small issues with the code layout as suggested by grst

9793062

Merge branch 'scverse:main' into mutational_load

d12f7b5

Merge branch 'mutational_load' of https://github.com/MKanetscheider/s…

906df48

…cirpy into mutational_load

Added a first beta-test case, which revealed some bugs that were also…

196177e

… fixed

[pre-commit.ci] auto fixes from pre-commit.com hooks

11101a5

for more information, see https://pre-commit.ci

Specified 'except' condition

cf53b72

Merge branch 'mutational_load' of https://github.com/MKanetscheider/s…

9c6a56c

…cirpy into mutational_load

MKanetscheider and others added 2 commits August 30, 2024 20:17

Merge branch 'main' into mutational_load

e5c4d76

[pre-commit.ci] auto fixes from pre-commit.com hooks

ea80d69

for more information, see https://pre-commit.ci

MKanetscheider and others added 2 commits October 15, 2024 10:28

Merge branch 'main' into mutational_load

e662e36

Add notebook section about somatic hypermutation

ae9563b

grst mentioned this pull request Oct 17, 2024

BCR tutorial #542

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutational load function (SHM) #536

Mutational load function (SHM) #536

MKanetscheider commented Aug 9, 2024

grst Aug 13, 2024

MKanetscheider Aug 15, 2024

grst Aug 13, 2024

MKanetscheider Aug 15, 2024

grst Aug 15, 2024

MKanetscheider Aug 15, 2024

grst Aug 15, 2024

MKanetscheider Aug 15, 2024

MKanetscheider Aug 18, 2024

grst Aug 19, 2024

MKanetscheider Aug 20, 2024

grst Aug 19, 2024

grst Aug 19, 2024

grst commented Aug 19, 2024

MKanetscheider commented Aug 29, 2024

grst commented Aug 29, 2024

codecov bot commented Aug 30, 2024 •

edited

Loading

review-notebook-app bot commented Oct 17, 2024

	for v, coordinates in regions.items():
	for region, coordinates in regions.items():

	mutation_dict = {"fwr1": [], "fwr2": [], "fwr3": [], "fwr4": [], "cdr1": [], "cdr2": [], "cdr3": []}
	mutation_dict = defaultdict(list)

Mutational load function (SHM) #536

Are you sure you want to change the base?

Mutational load function (SHM) #536

Conversation

MKanetscheider commented Aug 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grst commented Aug 19, 2024

MKanetscheider commented Aug 29, 2024

grst commented Aug 29, 2024

codecov bot commented Aug 30, 2024 • edited Loading

Codecov Report

review-notebook-app bot commented Oct 17, 2024

codecov bot commented Aug 30, 2024 •

edited

Loading