Skip to content

Conversation

pcruzparri
Copy link
Contributor

@pcruzparri pcruzparri commented Mar 10, 2025

Purpose:

  1. ParseModifications was bugged in that it could output negative indices. Updated the regex pattern also to explicitly ignore cation brackets (i.e. [Metal: Calcium[II] on E] will ignore the first closing bracket, since it is not the end of the mod representation).
  2. Our current string representation of full sequences does not differentiate between a C-terminus mod and a mod on the side chain of the C-terminus amino acid.

Once these changes are positively reviewed, I'll update some of the remaining tests not passing, since they are more tedious to update.

First commit message:
Changed the BioPolymerWithSetModsExtensions class to write full sequences separating the C-terminus with a dash. Updated some of the tests that failed because of the new notation of C-terminus mods. Some tests are still failing, and will be updated once happy with this general change.

…. Changed the BioPolymerWithSetModsExtensions class to write full sequences separating the C-terminus with a dash. Updated some of the tests that failed because of the new notation of C-terminus mods. Some tests are still failing, and will be updated once happy with this general change.
@pcruzparri pcruzparri marked this pull request as draft March 10, 2025 20:17
Copy link

codecov bot commented Mar 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.13%. Comparing base (9c30c4c) to head (6678f54).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #839   +/-   ##
=======================================
  Coverage   78.13%   78.13%           
=======================================
  Files         234      234           
  Lines       35014    35018    +4     
  Branches     3657     3658    +1     
=======================================
+ Hits        27358    27362    +4     
  Misses       7034     7034           
  Partials      622      622           
Files with missing lines Coverage Δ
mzLib/MzLibUtil/ClassExtensions.cs 100.00% <100.00%> (ø)
mzLib/Omics/BioPolymerWithSetModsExtensions.cs 95.50% <100.00%> (ø)
mzLib/Omics/SpectrumMatch/SpectrumMatchFromTsv.cs 97.02% <100.00%> (-0.33%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pcruzparri
Copy link
Contributor Author

pcruzparri commented Mar 19, 2025

The only failing integration error is the following:
Error: D:\a\mzLib\mzLib\MetaMorpheus\MetaMorpheus\EngineLayer\Silac\SilacConversions.cs(659,130): error CS1061: 'IIndexedMzPeak' does not contain a definition for 'ZeroBasedMs1ScanIndex' and no accessible extension method 'ZeroBasedMs1ScanIndex' accepting a first argument of type 'IIndexedMzPeak' could be found (are you missing a using directive or an assembly reference?)

The IIndexedMzPeak interface renamed the field ZeroBasedMs1ScanIndex to ZeroBasedScanIndex but the change has not been made on MetaMorpheus. Will open a small PR to fix that.
Edit: PR #2494 on MetaMorpheus.

@pcruzparri pcruzparri marked this pull request as ready for review March 19, 2025 15:27
nbollis
nbollis previously approved these changes Mar 20, 2025
@Alexander-Sol
Copy link
Contributor

The only failing integration error is the following: Error: D:\a\mzLib\mzLib\MetaMorpheus\MetaMorpheus\EngineLayer\Silac\SilacConversions.cs(659,130): error CS1061: 'IIndexedMzPeak' does not contain a definition for 'ZeroBasedMs1ScanIndex' and no accessible extension method 'ZeroBasedMs1ScanIndex' accepting a first argument of type 'IIndexedMzPeak' could be found (are you missing a using directive or an assembly reference?)

The IIndexedMzPeak interface renamed the field ZeroBasedMs1ScanIndex to ZeroBasedScanIndex but the change has not been made on MetaMorpheus. Will open a small PR to fix that. Edit: PR #2494 on MetaMorpheus.

I'm responsible for that test breaking. We need to put out a mzLib release before we fix it in MetaMorpheus. When we release, I'll fix that issue in MM

trishorts
trishorts previously approved these changes Mar 28, 2025
…s sequences, since it covers most but not many interesting cases. Best to remove it to maintain code coverage. I will add some notes on the issue on the PR for future reference.
@pcruzparri
Copy link
Contributor Author

We'll want to address ambiguous sequences in the future with these string parsing methods. Here are some non-trivial cases I saw when trying a first implementation of that in this PR, that is probably best addressed later on.

Problem:
I built a method that can parse the ambiguous sequences essentially by splitting them at the | character, finding mods for each, and then consolidating the mods found in each ambiguous sequence into one Dictionary<int, string> where an entry may look like <int PosX, string "ModY|ModZ"> if there is ambiguity at a given position. However, this is still problematic. An interesting case is P[Less Common:Proline pyrrole to pyrrolidine six member ring on P]HSEAGTAFIQTQQLHAAMADTFLEHMC[Common Fixed:Carbamidomethyl on C]R|[Less Common:Formylation on X]PHS[Less Common:Reduction on S]EAGTAFIQTQQLHAAMADTFLEHMC[Common Fixed:Carbamidomethyl on C]R, because the ambiguity spans two positions of the dictionary, namely the position of the N-terminus (index 0) and the position of the first side chain (index 1). Let's call this first scenario S1.
While thinking about this problem, I considered that when dealing with ambiguity, we may also have a situation where the full sequence is [ModX]R[ModY]AA|R[ModZ]AA. Lets call this second scenario S2.

Potential Solutions:

  1. A new ParseModificationsWithAmbiguity() method output is not indexed with termini, but just amino acid positions (one based index). So an N-terminus mod is localized to position 1, namely the first amino acid. Side chains mods on the first aminoacid are also localized to position 1. The mods can be added to this position in a way that represents both terminus/sidechain position as well as ambiguity. The ; represent a separator for the terminus while | represents ambiguity (logical OR, similar to parsimony). For example,
    • Case with just a N-terminus mod
      • Full Sequence: [ModX]RAAAA
      • Dictionary Entry: <int 1, string "ModX"; >
    • Case with mod on first side chain
      • Full Sequence: R[ModX]AAAA
      • Dictionary Entry: <int 1, string ";ModX" >
    • Case with mod on first side chain
      • Full Sequence: [ModX]R[ModY]AAAA
      • Dictionary Entry: <int 1, string "ModX;ModY ">
    • Now, S1 can be represented as <int 1, string "ModX;|;ModY">
    • S2 can be represented as <int 1, string "ModX;ModY|;ModZ">
  2. While this seems somewhat outside the current implementation of ProForma, a "ProForma-esque" approach can be used by giving mods unique identifiers, which might allow us to keep terminal indexing (keeping index 0 for the N-terminus, for example). In the output dictionary, we would write the mod, followed by #ID, allowing us to reference other mods at other positions for ambiguity. I think this would have redundant information, which is not great but not the worst. Here, the ; represents the logical AND (similar to parsimony) and | represents ambiguity (logical OR).
    • S1 would look like {<int 0, string "ModX#p0|#p1">, <int 1, string "#p0|ModY#p1>}
    • S2 would look like {<int 0, string "ModX#p0;ModY#p1|#p2">, <int 1, string "#p0;#p1|ModZ#p2>}

Copy link
Member

@nbollis nbollis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants