Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #52

Merged
merged 76 commits into from
Dec 13, 2023
Merged

Dev #52

merged 76 commits into from
Dec 13, 2023

Conversation

enryH
Copy link
Member

@enryH enryH commented Oct 11, 2023

No description provided.

Henry Webel added 30 commits September 13, 2023 19:11
create a new intermediate dump of 7,444 HeLa runs
- KNN dumps val and test data with specified "args.model_key"
  in "config.yaml"
- update color palette for "unknown" models
- make performance_plots.py more robust
- training configs are created and saved on the fly
  (-> avoid separate model configs, collect all in one)

R methods are fixed, no customization so far. To do this one
would probably need to generate separte NBs for each method.
based on Lazar et. al. (2016)
- below a quantile -> MNAR, select from there
- quantile is defined based on overall frac of missing values
- mix MCAR and MNAR

- format and clean-up code in script
- refactoring error -> select correct data
- only test CF, DAE and VAE functionally
- select configs in example folder...
- both scripts (notebooks)
- and library code
- msImpute
- trKNN (from source)

Add to workflow check.
- start grouping output for an easier overview (than only alphabetical)
- update depreciated functionality in pandas

-> some scripts might have further depreciation warnings
- igraph installation in conda on the fly fails for windows otherwise:
  https://stackoverflow.com/a/71711600/9684872
- reversed decoy sequence matches should be removed (it's only a few)
- grouping of plots was not reflected in Snakemake workflow
- aim: specify long run time for R jobs with a high max
- run long running job in parallel on one big node
- log file paths for submitted jobs added (should be unique)
- -V: forward set environment for submitted job
- precursors from reversed protein sequences are removed from the evidence
  table
- adapt code to use local information (yaml files)
- colab uses pandas and pytorch two
- datetime_is_numeric parameter removed from
   describe, see
   https://pandas.pydata.org/docs/whatsnew/v2.0.0.html
- append is depreciated.
in case a tool, e.g. the torque scheduler, creates log files, these
can be requested per task (job):
in the run_snakememake_cluster bash script, this is done
using -e and -o options.
- submit required parameters using the -v option, e.g.
  qsub run_snakemake_cluster.sh \
   -N snakemake_exp0 \
   -v configfile=path_to/config.yaml,prefix=exp0
- rename also protein groups and precursors (evidence) dumps
- drop entries from reversed sequences in evidence files
- increase robustness of notebook, ignoring all NA methods
 (here: IMPSEQ)
- To consider:
  should 01_1_train_NAGuideR.ipynb throw an error if all pred are NAs?
- function loading and filtering data
- add IDs making it possible to make precursors (Evidence IDs),
  Peptide ID and Protein Groups IDs to each other.
  in a file the id column is always "id" (e.g. proteinGroups.txt id column = Protein Groups IDs in the other two)
- tbc: see what works

Next: merge with version where parameters for python based models
can be set in config.yaml
Filter reversed -> parts for collecting data will be factored out
🚧 prepare cluster execution

- default: CPU execution, not accelerated (e.g. GPU)
- job script for torque cluster
- logs with notebook outputs
⬆️ remove constraints on pandas and pytorch

-> faster setup on google collab 
- less constraints on version
Henry Webel added 26 commits November 11, 2023 15:06
- create individual logs for nb execution
  -> separate files on local execution
  -> documentation of how long training step took
- config dict has to be copied. Otherwise value
  None is not dumped as null:
  Before:
  - column_names: "None"
  Now:
  - column_names: null
- one or two features have with 50 samples less than 4 intensities in
  training data split
  -> move the validation data for these to the training split
- new dataset balancing between GSIMP runtime and
  SEQKNN need for a minimum number of features
- run each method one by one (avoid race conditions when installing, only
  a problem on first time setup)
- is GSIMP fast enough (227-> ~1h)?
- probably test GSIMP here once, then remove from "fast testing" workflow
remove warnings thrown by papermill
- update defaults to results from small grid search (smallest of top 3)
document also qsub command and update submission script
(add more models)
- needs to be completed and cleaned-up
- rather "bigger" batches with more training steps
- update Fig. 2 plots generation to 25MNAR
Methods:

- added GSimp.
- reduced the dimensionality of the example data in the GitHub Action so 
  GSimp finishes (~1h) -> does not scale
- MNAR algorithm of MSIMPUTE added

Data:

- ensure that training data has at least 4 samples (MSIMPUTE includes that check)
- Formatted and updated workflow configs and declarations (v1&v2). Added script for command creation
- Figure 2: add custom selection of models to aggregate best 5 models
  of several datasets (custom plotting for paper)
- rotate performance label
- add NA if model did not run (here: error or not finished within 24h)
- for large pep and evi, the top five are already
  the correct set
- for subselected models the colors were not reselected
- based on seaborn example of _ColorPalette
- tables for Supp. Data
- update plots (fontsize, support)
- use a share of 25% MNAR in removed data
- use a share of 25% MNAR in comparison
- update figures for publication (names, label, fontsize, etc)
- 🐛 remove metadata fpath from train_X.yaml
- also run KNN comp. with workflow v2 with a share of 25MNAR
@enryH enryH marked this pull request as ready for review December 12, 2023 15:17
@enryH enryH merged commit 0b761d2 into main Dec 13, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant