forked from primasanjaya/muat-github
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathstep.txt
77 lines (47 loc) · 4.05 KB
/
step.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
Preprocessing:
A) Prepare the environment:
1) create conda env python 2.7, conda create -n <env_name> python=2.7
2) conda install -c anaconda pandas
3) conda install -c anaconda numpy
3) conda install -c bioconda bedops
4) conda install -c bioconda swalign
B) Download dataset:
5) run scripts/download_icgc.sh to download the data
6) create temporary dir (assign all temporary preprocessing directory in this dir)
6a) How to download genome tracks?
6b) How to download genome reference?
7) Install tabix > apt-get install tabix
8) copy scripts/dmmpre.sh, scripts/preprocessdmm.sh, annotate_preprocessed.sh to <data_dir>/release_28/
9) edit dmmpre.sh assigning to dependencies [check the dmmpre.sh for information]
10) run preprocessdmm.sh > if successful, this will give an output of 'somatic_mutations.<icgc_project>.dmm.k256.tsv.gz' in every icgc project folder
11) annotate_preprocessed.sh > ex: ./annotate_preprocessed.sh ./BLCA-US/somatic_mutations.BLCA-US.dmm.k256.tsv.gz BLCA-US
if successful, this file will give an output of somatic_mutations.<icgc_project>.dmm.k256.annotated.tsv.gz in every icgc project folder
C) Prepare environment for MuAt
1) conda create -n <env_name> python=3.7
2) conda install pytorch 1.8
3) conda install -c conda-forge tqdm
4) conda install -c anaconda scikit-learn
5) conda install -c conda-forge tensorboardx
6) conda install -c anaconda seaborn
D) Create dataset for MuAt
1) run preprocessing/create_PCAWG_dataset.py (adjust the dependencies accordingly)
DEMO
I) Training from the scratch
python3 main.py --dataloader 'pcawg' --block-size 5000 --n-class 24 --n-layer 1 --n-head 1 --n-emb 256 --motif --mut-type 'SNV+MNV' --fold 1 --input-data-dir 'path/to/input/data/dir/' --save-ckpt-dir 'path/to/saveckptdir/' --train
II) Predicting from I
python3 main.py --dataloader 'pcawg' --block-size 5000 --context-length 3 --n-class 24 --n-layer 1 --n-head 1 --n-emb 256 --motif --mut-type 'SNV+MNV' --fold 1 --input-data-dir 'path/to/input/data/dir/' --load-ckpt-dir 'path/to/saveckptdir/' --predict
III) Predicting from pretrained model
python3 main.py --dataloader 'pcawg' --block-size 5000 --context-length 3 --n-class 24 --n-layer 2 --n-head 1 --n-emb 512 --motif-pos --mut-type 'SNV+MNV+indel' --fold 1 --input-data-dir 'path/to/input/data/dir/' --load-ckpt-dir './bestckpt/fullpcawgfold1_11100_wpos_TripletPosition_bs5000_nl2_nh1_ne512_cl3/' --predict
IV) Predicting vcf files from pretrained model
2)
#environment
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install -c conda-forge tqdm
conda install -c anaconda scikit-learn
conda install -c conda-forge tensorboardx
conda install -c anaconda seaborn
can run this
python main.py --dataset 'finalpcawg' --arch 'TripletPosition' --block-size 5000 --context-length 3 --n-class 24 --tag 'fullpcawgfold1_11100_wpos' --n-layer 2 --n-head 1 --n-emb 512 --addpostoken --mutratio 0.4-0.3-0.3-0-0 --fold 1 --data-dir '/csc/epitkane/data/TCIA/vcf/00b9d0e6-69dc-4345-bffd-ce32880c8eef.consensus.20160830.somatic.snv_mnv.vcf' --single-pred-vcf
python main.py --dataset 'finalpcawg' --arch 'TripletPosition' --block-size 5000 --context-length 3 --n-class 24 --tag 'fullpcawgfold1_11100_wpos' --n-layer 2 --n-head 1 --n-emb 512 --addpostoken --mutratio 0.4-0.3-0.3-0-0 --fold 1 --data-dir '/csc/epitkane/data/TCIA/vcf/00b9d0e6-69dc-4345-bffd-ce32880c8eef.consensus.20160830.somatic.snv_mnv.vcf' --single-pred-vcf
#running all prediction folder (after preprocessing) - with '_new'
python main.py --dataset 'finalpcawg' --arch 'TripletPosition' --block-size 5000 --context-length 3 --n-class 24 --tag 'fullpcawgfold1_11100_wpos' --n-layer 2 --n-head 1 --n-emb 512 --addpostoken --mutratio 0.4-0.3-0.3-0-0 --fold 1 --input-data-dir '/mnt/e/icgc' --output-prefix 'fullpcawgfold1_11100_wpos_TripletPosition_bs5000_nl2_nh1_ne512_cl3' --output-dir '/mnt/g/experiment/visualizations_old/finalnew/icgc_res' --load-ckpt-dir '/mnt/g/experiment/muat/bestckpt/fullpcawgfold1_11100_wpos_TripletPosition_bs5000_nl2_nh1_ne512_cl3/' --all-pred