Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add Ability to accept multiple result files #137

Merged
merged 69 commits into from
Oct 15, 2024

Conversation

jcharkow
Copy link
Collaborator

@jcharkow jcharkow commented Sep 30, 2024

Description

  1. Loaders (MzMLLoader and SqMassLoader can accept a list of rsltsFile this is useful for automatically plotting features from a different software on the same plot.
  2. Auto detect results file based on contents (does not have to be supplied by user)
  3. Can specify a runName in loadTransitionGroup() and loadTransitionGroupFeatures() to only fetch information from that run. This is useful for plotChromatogram() method can only plot a chromatogram from a single run
  4. Update tests + docs to reflect these changes.

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

This is still a work in progress

Contents (#137)

Other

  • extend ResultsLoader class
  • Add tests for new ResultsLoader methods
  • minor fixes
  • update tests snapshots
  • fix tests
  • add scoring distribution
  • add documentation to OSWAccess getScoreTable
  • add tests for new Score distribution functions
  • add back searchResultAnalysis plots
  • SearchResultsAnalysis server
  • get GUI for search plot distributions working
  • remove include_groups flag
  • bug in transitionLoading
  • Roestlab/massdash into refactor/resultsPlotting
  • docs: Update Loading Feature Information Docs
  • update docs to reflect new behaviour
  • bug when to pick to label by software
  • update plotting gallery for new code refactoring
  • snapshots not directory specific
  • add tests of multiple result files with mzML loader
  • fix: failing tests based on OS
  • remove unneeded dependency
  • add multi-result file tests for sqmassLoader
  • add test throw error if sqMass with DIA-NN
  • fix bugs with new changes
  • fix: streamlit interface with new refactoring
  • fix result file reading with streamlit
  • feature: add custom logger
  • select specific runs to get transitions/features from
  • bug with loading transition data
  • load DIA-NN when Precursor.Mz not found
  • fix: add tests for new "select runs" functionality + bug fixes
  • update snapshots
  • typing for python 3.9
  • updates docs for new codebase
  • control when to output IM column
  • bug fixes
  • update tests
  • fix typo
  • better check for if streamlit is running
  • fix: update quickstart to showcase multiple files
  • check if running in jupytere context
  • add loaders/access to docs
  • update ResultsLoader docs
  • loadPrecursorScoreDistribution --> loadPeakGroupScoreDistribution

Uncategorised!

  • Merge branch 'test/add_tests' into refactor/resultsPlotting
  • [FEATURE] autodetect .TSV results type
  • Rename GenericLoader -> ResultsLoader
  • [FEATURE][TEST] load multiple result files in loader
  • [REFACTOR][FIX] SqMassLoader+MzMLLoader new interface
  • Add access methods getIdentifications
  • Add getExperimentSummary
  • Merge branch 'dev' into refactor/resultsPlotting
  • Merge branch 'dev' into refactor/resultsPlotting
  • Merge branch 'dev' into refactor/resultsPlotting
  • Merge branch 'dev' into refactor/resultsPlotting
  • add pep, qvalue and p value as valid columns
  • Merge branch 'dev' into refactor/resultsPlotting
  • fix tests
  • Update tutorials for refactored changes
  • Merge branch 'dev' into refactor/resultsPlotting
  • Merge branch 'dev' into refactor/resultsPlotting
  • bug fig
  • Add runNames to repr
  • remove unneeded code
  • apply more justin suggestions
  • apply more suggestions
  • apply more comments
  • add checks that expected tables are there

Results loader accepts a list of result files
- Rename generic loader to ResultsLoader, do not make it an abstract
  class, can have just result files

- Add methods for loading features to this class e.g.
  loadTopTransitionGroupFeature() which can take a list of result files
- Code cleanup
- remove (comment out for now) feature tests from lower level loaders
- Ensure tests still work
getIdentifiedPrecursors
getIdentifiedProteins
getIdentifiedPeptides

getNumIndenticiations (precursors, peptides, proteins)

getCV

add appropriate tests for these new methods
add method which gets a df of # precursors/peptides/proteins

update tests accordingly
extend ResultsLoader class to have methods to query the entire result
file and plotting methods.
also bug fixes when add tests
fix tests update snapshots resulting from merging
Add back functionality of scoring distribution plotting
Also update bugs associated with these functions
Refactor SearchResultAnalysisPlotter for plotting search results
this does not work with all versions of pandas
fix new tests snapshots everything should pass now.

Results a bit strange because it is a test file
Also fix bug fixes associated with these changes
@jcharkow jcharkow force-pushed the refactor/resultsPlotting branch 3 times, most recently from a91c35f to d278d4f Compare October 9, 2024 21:52
fix typing references for python 3.9
output consensusApexIM column in dataframe if present
minor bug fix reverse order of @Property and @AbstractMethod
bug fixes from last commit
update test snapshots to have IM dataframe
@jcharkow jcharkow marked this pull request as ready for review October 10, 2024 15:29
update quickstart to show working with multiple chromatograms in a
single loader.

fix bugs that found when editing the notebook
@jcharkow jcharkow requested a review from singjc October 10, 2024 18:18
Copy link
Collaborator

@singjc singjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates and changes. Mostly looks good, I just made a few comments/suggestions. Should be good to merge after some of them are addressed/fixed.

Comment on lines +26 to +27
def __init__(self, **kwargs):
super().__init__(**kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the doc-string need to be updated? Why are the args removed and replaced with kwargs? Does the ABSMeta take care of validating correct input kwargs? I'm not a huge fan of have only **kwargs as the only input to a function. Makes it a little complicated to track what variables are used and where they're initialized.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so GenericChromatogramLoader inherits from GenericRawDataLoader which inherits from ResultsLoader, which is where the args are explicitly set. Is the reason for setting **kwargs in the GenericLoaders to allow for more flexible args for the actual implementations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to not have to copy the arguments all the way up the chain (e.g. if ResultsLoader is edited then we do not have to edit GenericChromatogramLoader)
It definitely is harder to read though so I'm not sure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm, that makes sense. I think it's good as is, it will make maintaining easier. I would add a doc-string link to where the args are laid out though, so it's easier to track references/inheritance.

massdash/loaders/GenericChromatogramLoader.py Show resolved Hide resolved
massdash/loaders/GenericRawDataLoader.py Outdated Show resolved Hide resolved
massdash/loaders/GenericRawDataLoader.py Show resolved Hide resolved
massdash/loaders/GenericRawDataLoader.py Outdated Show resolved Hide resolved
INNER JOIN SCORE_PROTEIN ON SCORE_PROTEIN.PROTEIN_ID = PEPTIDE_PROTEIN_MAPPING.PROTEIN_ID
WHERE FEATURE.RUN_ID = {run_id} AND SCORE_MS2.QVALUE <= {qvalue} AND PRECURSOR.DECOY = 0 AND SCORE_PEPTIDE.QVALUE <= {qvalue} AND SCORE_PROTEIN.QVALUE <= {qvalue} and SCORE_MS2.RANK == 1"""
rslt = self.conn.execute(stmt)
return set([i[0] for i in rslt.fetchall()])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this faster than doing DISTINCT within the sql query?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was done this way because we want the end result to be a set

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distinct is likely faster though not sure if I should add distinct and then still just convert to set

massdash/loaders/access/OSWDataAccess.py Outdated Show resolved Hide resolved
massdash/structs/TransitionGroup.py Show resolved Hide resolved
@jcharkow
Copy link
Collaborator Author

@singjc I think this is ready for merging now?

@singjc
Copy link
Collaborator

singjc commented Oct 15, 2024

@singjc I think this is ready for merging now?

Great, thanks for the changes. Will merge now.

@singjc singjc merged commit 4ee27ce into dev Oct 15, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants