Updates to Full Sequence Writing and Parsing of Mods #839

pcruzparri · 2025-03-10T20:16:35Z

Purpose:

ParseModifications was bugged in that it could output negative indices. Updated the regex pattern also to explicitly ignore cation brackets (i.e. [Metal: Calcium[II] on E] will ignore the first closing bracket, since it is not the end of the mod representation).
Our current string representation of full sequences does not differentiate between a C-terminus mod and a mod on the side chain of the C-terminus amino acid.

Once these changes are positively reviewed, I'll update some of the remaining tests not passing, since they are more tedious to update.

First commit message:
Changed the BioPolymerWithSetModsExtensions class to write full sequences separating the C-terminus with a dash. Updated some of the tests that failed because of the new notation of C-terminus mods. Some tests are still failing, and will be updated once happy with this general change.

…. Changed the BioPolymerWithSetModsExtensions class to write full sequences separating the C-terminus with a dash. Updated some of the tests that failed because of the new notation of C-terminus mods. Some tests are still failing, and will be updated once happy with this general change.

…t handle ambiguity(or multiple mods at the same position). Modified the corresponding tests or commented them out in case we want to revert.

codecov · 2025-03-17T18:48:21Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.13%. Comparing base (9c30c4c) to head (6678f54).

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #839   +/-   ##
=======================================
  Coverage   78.13%   78.13%           
=======================================
  Files         234      234           
  Lines       35014    35018    +4     
  Branches     3657     3658    +1     
=======================================
+ Hits        27358    27362    +4     
  Misses       7034     7034           
  Partials      622      622

Files with missing lines	Coverage Δ
mzLib/MzLibUtil/ClassExtensions.cs	`100.00% <100.00%> (ø)`
mzLib/Omics/BioPolymerWithSetModsExtensions.cs	`95.50% <100.00%> (ø)`
mzLib/Omics/SpectrumMatch/SpectrumMatchFromTsv.cs	`97.02% <100.00%> (-0.33%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pcruzparri · 2025-03-19T14:46:58Z

The only failing integration error is the following:
Error: D:\a\mzLib\mzLib\MetaMorpheus\MetaMorpheus\EngineLayer\Silac\SilacConversions.cs(659,130): error CS1061: 'IIndexedMzPeak' does not contain a definition for 'ZeroBasedMs1ScanIndex' and no accessible extension method 'ZeroBasedMs1ScanIndex' accepting a first argument of type 'IIndexedMzPeak' could be found (are you missing a using directive or an assembly reference?)

The IIndexedMzPeak interface renamed the field ZeroBasedMs1ScanIndex to ZeroBasedScanIndex but the change has not been made on MetaMorpheus. Will open a small PR to fix that.
Edit: PR #2494 on MetaMorpheus.

mzLib/Omics/SpectrumMatch/SpectrumMatchFromTsv.cs

Alexander-Sol · 2025-03-20T20:57:42Z

The only failing integration error is the following: Error: D:\a\mzLib\mzLib\MetaMorpheus\MetaMorpheus\EngineLayer\Silac\SilacConversions.cs(659,130): error CS1061: 'IIndexedMzPeak' does not contain a definition for 'ZeroBasedMs1ScanIndex' and no accessible extension method 'ZeroBasedMs1ScanIndex' accepting a first argument of type 'IIndexedMzPeak' could be found (are you missing a using directive or an assembly reference?)

The IIndexedMzPeak interface renamed the field ZeroBasedMs1ScanIndex to ZeroBasedScanIndex but the change has not been made on MetaMorpheus. Will open a small PR to fix that. Edit: PR #2494 on MetaMorpheus.

I'm responsible for that test breaking. We need to put out a mzLib release before we fix it in MetaMorpheus. When we release, I'll fix that issue in MM

…s sequences, since it covers most but not many interesting cases. Best to remove it to maintain code coverage. I will add some notes on the issue on the PR for future reference.

pcruzparri · 2025-03-28T16:26:49Z

We'll want to address ambiguous sequences in the future with these string parsing methods. Here are some non-trivial cases I saw when trying a first implementation of that in this PR, that is probably best addressed later on.

Problem:
I built a method that can parse the ambiguous sequences essentially by splitting them at the | character, finding mods for each, and then consolidating the mods found in each ambiguous sequence into one Dictionary<int, string> where an entry may look like <int PosX, string "ModY|ModZ"> if there is ambiguity at a given position. However, this is still problematic. An interesting case is P[Less Common:Proline pyrrole to pyrrolidine six member ring on P]HSEAGTAFIQTQQLHAAMADTFLEHMC[Common Fixed:Carbamidomethyl on C]R|[Less Common:Formylation on X]PHS[Less Common:Reduction on S]EAGTAFIQTQQLHAAMADTFLEHMC[Common Fixed:Carbamidomethyl on C]R, because the ambiguity spans two positions of the dictionary, namely the position of the N-terminus (index 0) and the position of the first side chain (index 1). Let's call this first scenario S1.
While thinking about this problem, I considered that when dealing with ambiguity, we may also have a situation where the full sequence is [ModX]R[ModY]AA|R[ModZ]AA. Lets call this second scenario S2.

Potential Solutions:

A new ParseModificationsWithAmbiguity() method output is not indexed with termini, but just amino acid positions (one based index). So an N-terminus mod is localized to position 1, namely the first amino acid. Side chains mods on the first aminoacid are also localized to position 1. The mods can be added to this position in a way that represents both terminus/sidechain position as well as ambiguity. The ; represent a separator for the terminus while | represents ambiguity (logical OR, similar to parsimony). For example,
- Case with just a N-terminus mod
  - Full Sequence: [ModX]RAAAA
  - Dictionary Entry: <int 1, string "ModX"; >
- Case with mod on first side chain
  - Full Sequence: R[ModX]AAAA
  - Dictionary Entry: <int 1, string ";ModX" >
- Case with mod on first side chain
  - Full Sequence: [ModX]R[ModY]AAAA
  - Dictionary Entry: <int 1, string "ModX;ModY ">
- Now, S1 can be represented as <int 1, string "ModX;|;ModY">
- S2 can be represented as <int 1, string "ModX;ModY|;ModZ">
While this seems somewhat outside the current implementation of ProForma, a "ProForma-esque" approach can be used by giving mods unique identifiers, which might allow us to keep terminal indexing (keeping index 0 for the N-terminus, for example). In the output dictionary, we would write the mod, followed by #ID, allowing us to reference other mods at other positions for ambiguity. I think this would have redundant information, which is not great but not the worst. Here, the ; represents the logical AND (similar to parsimony) and | represents ambiguity (logical OR).
- S1 would look like {<int 0, string "ModX#p0|#p1">, <int 1, string "#p0|ModY#p1>}
- S2 would look like {<int 0, string "ModX#p0;ModY#p1|#p2">, <int 1, string "#p0;#p1|ModZ#p2>}

nbollis

lgtm

pcruzparri marked this pull request as draft March 10, 2025 20:17

pcruzparri requested review from trishorts, nbollis and Alexander-Sol March 10, 2025 20:20

pcruzparri and others added 4 commits March 11, 2025 12:44

Cleaned up the ParseModification() method as well as updated it to no…

6fd3eac

…t handle ambiguity(or multiple mods at the same position). Modified the corresponding tests or commented them out in case we want to revert.

Merge branch 'smith-chem-wisc:master' into PatchFullSeqParseAndWrite

4baa201

Merge branch 'master' into PatchFullSeqParseAndWrite

429be58

updated the remaining tests that were failing.

1d82ffd

Removed two unused lines from ParseModifications

e6dd7bc

pcruzparri marked this pull request as ready for review March 19, 2025 15:27

nbollis previously approved these changes Mar 20, 2025

View reviewed changes

mzLib/Omics/SpectrumMatch/SpectrumMatchFromTsv.cs Outdated Show resolved Hide resolved

removing RemoveSpecialCharacter method from SpectrumMatchFromTsv

e37615f

pcruzparri dismissed nbollis’s stale review via e37615f March 24, 2025 20:02

nbollis and others added 4 commits March 25, 2025 15:08

Merge branch 'master' into PatchFullSeqParseAndWrite

0ef48e7

Merge branch 'smith-chem-wisc:master' into PatchFullSeqParseAndWrite

8df09e9

extra test for ParseModifications

9d3e033

merge master into branch

d7f8c4f

trishorts previously approved these changes Mar 28, 2025

View reviewed changes

removed the simple method I considered for parsing mods from ambiguou…

6678f54

…s sequences, since it covers most but not many interesting cases. Best to remove it to maintain code coverage. I will add some notes on the issue on the PR for future reference.

pcruzparri dismissed trishorts’s stale review via 6678f54 March 28, 2025 16:12

pcruzparri requested review from trishorts, nbollis and RayMSMS March 28, 2025 16:28

nbollis approved these changes Apr 2, 2025

View reviewed changes

pcruzparri mentioned this pull request Aug 20, 2025

Update ParseModifications output and modification regex. #939

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updates to Full Sequence Writing and Parsing of Mods #839

Updates to Full Sequence Writing and Parsing of Mods #839

Uh oh!

pcruzparri commented Mar 10, 2025 •

edited

Loading

Uh oh!

codecov bot commented Mar 17, 2025 •

edited

Loading

Uh oh!

pcruzparri commented Mar 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Alexander-Sol commented Mar 20, 2025

Uh oh!

pcruzparri commented Mar 28, 2025

Uh oh!

nbollis left a comment

Uh oh!

Uh oh!

Updates to Full Sequence Writing and Parsing of Mods #839

Are you sure you want to change the base?

Updates to Full Sequence Writing and Parsing of Mods #839

Uh oh!

Conversation

pcruzparri commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pcruzparri commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Alexander-Sol commented Mar 20, 2025

Uh oh!

pcruzparri commented Mar 28, 2025

Uh oh!

nbollis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcruzparri commented Mar 10, 2025 •

edited

Loading

codecov bot commented Mar 17, 2025 •

edited

Loading

pcruzparri commented Mar 19, 2025 •

edited

Loading