Dev #52

enryH · 2023-10-11T13:41:10Z

No description provided.

create a new intermediate dump of 7,444 HeLa runs

- KNN dumps val and test data with specified "args.model_key" in "config.yaml" - update color palette for "unknown" models - make performance_plots.py more robust - training configs are created and saved on the fly (-> avoid separate model configs, collect all in one) R methods are fixed, no customization so far. To do this one would probably need to generate separte NBs for each method.

based on Lazar et. al. (2016) - below a quantile -> MNAR, select from there - quantile is defined based on overall frac of missing values - mix MCAR and MNAR - format and clean-up code in script

- refactoring error -> select correct data

- only test CF, DAE and VAE functionally - select configs in example folder...

- both scripts (notebooks) - and library code

- msImpute - trKNN (from source) Add to workflow check.

- start grouping output for an easier overview (than only alphabetical)

- update depreciated functionality in pandas -> some scripts might have further depreciation warnings

- igraph installation in conda on the fly fails for windows otherwise: https://stackoverflow.com/a/71711600/9684872

- reversed decoy sequence matches should be removed (it's only a few)

- grouping of plots was not reflected in Snakemake workflow

- aim: specify long run time for R jobs with a high max - run long running job in parallel on one big node

- log file paths for submitted jobs added (should be unique) - -V: forward set environment for submitted job

- precursors from reversed protein sequences are removed from the evidence table - adapt code to use local information (yaml files)

- colab uses pandas and pytorch two - datetime_is_numeric parameter removed from describe, see https://pandas.pydata.org/docs/whatsnew/v2.0.0.html

- append is depreciated.

in case a tool, e.g. the torque scheduler, creates log files, these can be requested per task (job): in the run_snakememake_cluster bash script, this is done using -e and -o options.

- submit required parameters using the -v option, e.g. qsub run_snakemake_cluster.sh \ -N snakemake_exp0 \ -v configfile=path_to/config.yaml,prefix=exp0

- rename also protein groups and precursors (evidence) dumps - drop entries from reversed sequences in evidence files

- increase robustness of notebook, ignoring all NA methods (here: IMPSEQ) - To consider: should 01_1_train_NAGuideR.ipynb throw an error if all pred are NAs?

- function loading and filtering data - add IDs making it possible to make precursors (Evidence IDs), Peptide ID and Protein Groups IDs to each other. in a file the id column is always "id" (e.g. proteinGroups.txt id column = Protein Groups IDs in the other two)

- tbc: see what works Next: merge with version where parameters for python based models can be set in config.yaml

Filter reversed -> parts for collecting data will be factored out

🚧 prepare cluster execution - default: CPU execution, not accelerated (e.g. GPU) - job script for torque cluster - logs with notebook outputs

⬆️ remove constraints on pandas and pytorch -> faster setup on google collab - less constraints on version

-> https://github.com/RasmussenLab/hela_qc_mnt_data commit link: RasmussenLab/hela_qc_mnt_data@f88586b - make minor adaption needed due to deletions

🔥 move hela data collection code to new repo: https://github.com/RasmussenLab/hela_qc_mnt_data

- create individual logs for nb execution -> separate files on local execution -> documentation of how long training step took

- config dict has to be copied. Otherwise value None is not dumped as null: Before: - column_names: "None" Now: - column_names: null

- one or two features have with 50 samples less than 4 intensities in training data split -> move the validation data for these to the training split

- new dataset balancing between GSIMP runtime and SEQKNN need for a minimum number of features - run each method one by one (avoid race conditions when installing, only a problem on first time setup)

- is GSIMP fast enough (227-> ~1h)? - probably test GSIMP here once, then remove from "fast testing" workflow

remove warnings thrown by papermill

- update defaults to results from small grid search (smallest of top 3)

document also qsub command and update submission script

(add more models) - needs to be completed and cleaned-up

- rather "bigger" batches with more training steps - update Fig. 2 plots generation to 25MNAR

Methods: - added GSimp. - reduced the dimensionality of the example data in the GitHub Action so GSimp finishes (~1h) -> does not scale - MNAR algorithm of MSIMPUTE added Data: - ensure that training data has at least 4 samples (MSIMPUTE includes that check) - Formatted and updated workflow configs and declarations (v1&v2). Added script for command creation

- Figure 2: add custom selection of models to aggregate best 5 models of several datasets (custom plotting for paper) - rotate performance label - add NA if model did not run (here: error or not finished within 24h)

- for large pep and evi, the top five are already the correct set

- for subselected models the colors were not reselected

- based on seaborn example of _ColorPalette

improve readability

- tables for Supp. Data - update plots (fontsize, support)

- use a share of 25% MNAR in removed data - use a share of 25% MNAR in comparison - update figures for publication (names, label, fontsize, etc)

- dump config

- 🐛 remove metadata fpath from train_X.yaml - also run KNN comp. with workflow v2 with a share of 25MNAR

Henry Webel added 30 commits September 13, 2023 19:11

✨ dump and rename peptide files with PEP score

e2e7521

create a new intermediate dump of 7,444 HeLa runs

✨ MCAR-MNAR sampling

afc4f02

based on Lazar et. al. (2016) - below a quantile -> MNAR, select from there - quantile is defined based on overall frac of missing values - mix MCAR and MNAR - format and clean-up code in script

🐛 regression: use filtered dataset

ac66abc

- refactoring error -> select correct data

⚡ CICD pipeline: some R methods are slow

a2b97b8

- only test CF, DAE and VAE functionally - select configs in example folder...

🎨 format all code using autopep8

13dba85

- both scripts (notebooks) - and library code

✨ add further methods

0517113

- msImpute - trKNN (from source) Add to workflow check.

✨ group plots by seocndary nb number

25468d2

- start grouping output for an easier overview (than only alphabetical)

👽 update code to pandas 2.0

a1dfa2b

- update depreciated functionality in pandas -> some scripts might have further depreciation warnings

➕ add igraph (for windows)

7118bec

- igraph installation in conda on the fly fails for windows otherwise: https://stackoverflow.com/a/71711600/9684872

🐛 remove peptides from reversed protein groups

1da36a6

- reversed decoy sequence matches should be removed (it's only a few)

🐛 set index correctly

9725406

🐛 target renamed plot in rule

4085578

- grouping of plots was not reflected in Snakemake workflow

🚧 prepare cluster execution

2dbd185

- aim: specify long run time for R jobs with a high max - run long running job in parallel on one big node

✨ enable torque cluster execution

c5236f7

- log file paths for submitted jobs added (should be unique) - -V: forward set environment for submitted job

🐛 remove reversed sequences from evidence

6d8ecbf

- precursors from reversed protein sequences are removed from the evidence table - adapt code to use local information (yaml files)

⬆️ remove constraints on pandas and pytorch

8a4ab8b

- colab uses pandas and pytorch two - datetime_is_numeric parameter removed from describe, see https://pandas.pydata.org/docs/whatsnew/v2.0.0.html

🐛 make pandas 2.0 compatible

12218c5

- append is depreciated.

✨📝 specify task specific log files

26dc891

in case a tool, e.g. the torque scheduler, creates log files, these can be requested per task (job): in the run_snakememake_cluster bash script, this is done using -e and -o options.

🔧 set default to cpu (no accelerator, e.g. gpu)

c7503e1

✨ torque (qsub) script with parameters using -v

6e780a2

- submit required parameters using the -v option, e.g. qsub run_snakemake_cluster.sh \ -N snakemake_exp0 \ -v configfile=path_to/config.yaml,prefix=exp0

🎨🐛 reverse column for evidence, rename all dumps

0fcc086

- rename also protein groups and precursors (evidence) dumps - drop entries from reversed sequences in evidence files

🐛 methods that give all NA are not filtered

2c78071

- increase robustness of notebook, ignoring all NA methods (here: IMPSEQ) - To consider: should 01_1_train_NAGuideR.ipynb throw an error if all pred are NAs?

🎨📝 order by R methods by alphabet, start documenting

f53aa96

- tbc: see what works Next: merge with version where parameters for python based models can be set in config.yaml

Merge pull request #49 from RasmussenLab/filter_reversed

cfc5fe5

Filter reversed -> parts for collecting data will be factored out

Merge pull request #50 from RasmussenLab/pbs_cluster_exec

c00f600

🚧 prepare cluster execution - default: CPU execution, not accelerated (e.g. GPU) - job script for torque cluster - logs with notebook outputs

Merge pull request #51 from RasmussenLab/pip_dependencies

41ace60

⬆️ remove constraints on pandas and pytorch -> faster setup on google collab - less constraints on version

🔥 move hela data collection code to new repo

5b3c0b0

-> https://github.com/RasmussenLab/hela_qc_mnt_data commit link: RasmussenLab/hela_qc_mnt_data@f88586b - make minor adaption needed due to deletions

Merge pull request #53 from RasmussenLab/move_hela_data_code

fd1d07d

🔥 move hela data collection code to new repo: https://github.com/RasmussenLab/hela_qc_mnt_data

Henry Webel added 26 commits November 11, 2023 15:06

✨ log papermill output for each job

7a55767

- create individual logs for nb execution -> separate files on local execution -> documentation of how long training step took

🐛 None is not dumped as null without cp dict

8bd3211

- config dict has to be copied. Otherwise value None is not dumped as null: Before: - column_names: "None" Now: - column_names: null

🐛 Quote strings to allow white spaces in folder names

d015517

🐛 few features have less than 4 training observations

139c792

- one or two features have with 50 samples less than 4 intensities in training data split -> move the validation data for these to the training split

🐛 GSIMP slow, SEQKNN does not like too few features

296cbf9

- new dataset balancing between GSIMP runtime and SEQKNN need for a minimum number of features - run each method one by one (avoid race conditions when installing, only a problem on first time setup)

🐛 execute one-by-one, show errors in main process

0bee424

🐛 SeqKNN crashes with too few samples

1b23b95

- is GSIMP fast enough (227-> ~1h)? - probably test GSIMP here once, then remove from "fast testing" workflow

🎨 update metadata

2bb3d4d

remove warnings thrown by papermill

🐛 add method and set defaults from grid search

9456d91

- update defaults to results from small grid search (smallest of top 3)

✨ configs for MNAR MCAR experiments

1569b97

document also qsub command and update submission script

🎨🚧 improve plots for Figure 2

35322b3

(add more models) - needs to be completed and cleaned-up

🐛 fix remaining colors, test

0b0d747

🐛 don't train with too small batches

89046b4

- rather "bigger" batches with more training steps - update Fig. 2 plots generation to 25MNAR

🎨🔧 improve swarmplots, add methods in ALD comp.

c9e00e4

🎨 allow custom subselection, add NA if not available

748a1d7

- Figure 2: add custom selection of models to aggregate best 5 models of several datasets (custom plotting for paper) - rotate performance label - add NA if model did not run (here: error or not finished within 24h)

🐛 sync and specify selected

1f2d682

- for large pep and evi, the top five are already the correct set

🐛✨ Pick colors for selected

e206483

- for subselected models the colors were not reselected

🎨 switch colors and show model tag for color

e62f80b

- based on seaborn example of _ColorPalette

🎨 center swarmplot labels

db2469a

🎨 rotate other direction, use space better

49ee8e5

improve readability

🎨 allow custom display name of feat

e49c1eb

🔧🎨 25MNAR share as default, update path & fontsize

43de8bd

- tables for Supp. Data - update plots (fontsize, support)

🎨🔧 Update comp. with subsetted data

bbe068e

- use a share of 25% MNAR in removed data - use a share of 25% MNAR in comparison - update figures for publication (names, label, fontsize, etc)

🔧 update exp.: repeated runs of full ald data

052ed78

- dump config

🎨🐛 update overfitting analysis (25MNAR)

49d628b

- 🐛 remove metadata fpath from train_X.yaml - also run KNN comp. with workflow v2 with a share of 25MNAR

enryH marked this pull request as ready for review December 12, 2023 15:17

Henry added 2 commits December 12, 2023 16:46

📝 add three newly added methods to overview

27d8ad2

🔖 bump version to v.0.2.0

af85dd7

enryH merged commit 0b761d2 into main Dec 13, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev #52

Dev #52

enryH commented Oct 11, 2023

Dev #52

Dev #52

Conversation

enryH commented Oct 11, 2023