forked from opain/GenoPred
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Pipeline_prep.Rmd
2705 lines (2035 loc) · 116 KB
/
Pipeline_prep.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Preparing reference files for genotype-based scoring"
output:
html_document:
toc: true
theme: united
toc_depth: 3
number_sections: true
toc_float: true
fig_caption: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error=TRUE)
```
<style>
p.caption {
font-size: 1.5em;
}
</style>
```{css, echo=F}
pre code, pre, code {
white-space: pre !important;
overflow-x: auto !important;
word-break: keep-all !important;
word-wrap: initial !important;
}
```
***
Genotype-based scores are typically based on several target sample specific features of the data including:
* Linkage disequilibrium (LD) structure
* Available genetic variation
* Minor allele frequecy (MAF)
Basing these features on the target sample lead to differences in genotype-based scores across samples separately processed, meaning they cannot be directly compared. This is not a limitation when performing inference within a single dataset, however in framework of prediction it is important that the predictors are derived in a way that can be replicated in any future dataset to optimise external validity of the prediction model.
To ensure consistency in genotye-based scoring we can base each of these features on a reference dataset, such as the 1000 Genomes reference, an approach herein refered to as 'reference-standardised'. Another advantage of reference-standardised scoring is that a single sample can be realiably processed, enabling reliable prediction for n individual.
This page provides instructions for preparing the reference data required for reference-standardised genotype-based scoring. Here I provide code used when preparing on the KCL Rosalind cluster, with the intention of deriving scores in the UK Biobank sample and Twins Early Development Study (TEDS) sample.
<br/>
***
# Pre-requisites
The following software is required for the prediction pipeline:
* PLINK v1.9 (https://www.cog-genomics.org/plink2/)
* PLINK v2 (https://www.cog-genomics.org/plink/2.0/)
* SHAPEIT (https://mathgen.stats.ox.ac.uk/genetics_softwae*hapeit/shapeit.html#download)
* IMPUTE2 (https://mathgen.stats.ox.ac.uk/impute/impute_v.*ml#download)
* QCTOOL v2 (https://www.well.ox.ac.uk/~gav/qctool/documet*ion/download.html)
* R (https://www.r-project.org/)
* R packages:
```{R, echo=T, eval=F}
install.packages(c('data.table','doMC','optparse','foreach','caret','ggplot2','cowplot'))
```
NOTE: Setting environmental variables
Throughout this script I use environmental variables to indicate where files are to be stored. These variables need to be set on the command line and in R. I set the environmental variables using a config file called 'Pipeline_prep.config'. You will need to create your own version of this file (See example [here](https://github.com/opain/GenoPred/tree/master/Example.config)).
<br/>
***
# Genotypic data
In this section we will download the required reference genotype data, and format it for use.
<details><summary>Set required variables for command line</summary>
```{bash, eval=F, echo=T}
# Set variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
```
</details>
<br/>
***
## IMPUTE2 1000 Genomes Phase 3
This dataset is the 1000 Genomes Phase 3 reference data in the format requird for imputation via the IMPUTE2 software. This is only required if you want to calculate genetic scores within target samples that have not undergone genotype imputation. Imputation is important as it will ensure a sufficient overlap with the HapMap3 SNP-list which all genotype-based scores are restricted to in subsequent stages of the genotype-based prediction pipeline.
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# Create directory for the data
mkdir -p ${Impute2_1KG_dir}
# Download data using wget
cd ${Impute2_1KG_dir}
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3_chrX.tgz
# Decompress data
tar -zxvf 1000GP_Phase3.tgz
tar -zxvf 1000GP_Phase3_chrX.tgz
```
</details>
<br/>
***
## HapMap3 SNP-list
Here we download a list of SNPs within the HapMap3 reference created for use in LD-score regression. We use the HapMap3 SNP-list as this is a collection of genetic variants as they are well captured by standard genotype imputation reference panels (incl. 1000 Genomes), meaning that these SNPs are typically available and accurately imputed in most GWAS and target samples. Using this common set of SNPs means that scores for individuals from different sources will be directly comparable, assuming a high proportion of these variants are available in the target sample. This relatively sparse subset of the genome provides decent coverage of genome-wide common genetic variation, however scores may explain less variation than those based on dense coverge of the genome or sequence data. Having said this, in my analyses HapMap3 restricted genotype-based scores perform similarly to those based on dense imputation.
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# Dowload snplist using wget and decompress
cd ${HapMap3_snplist_dir}
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2
bunzip2 w_hm3.snplist.bz2
```
</details>
<br/>
***
## 1000 Genomes populations
Here we download information on which population each individual in the 1000 Genomes reference is from. Individuals are grouped into 'populations' which are typically country specific, and 'super populations' which include a collection of ancetrally similar countries. Individuals are grouped into these populations if the last few generations of their family are all from one region. We need this population data so we can select individuals in the 1000 Genomes data to match those in our target samples. This is important for providing accurate information on LD structure and minor allele frequencies.
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# Download the populations file
cd ${Geno_1KG_dir}
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
# Create a version of population file that only contains sample ID, pop and super_pop
cut -f 1-3 ${Geno_1KG_dir}/integrated_call_samples_v3.20130502.ALL.panel > ${Geno_1KG_dir}/integrated_call_samples_v3.20130502.ALL.panel_small
# Create a keep file listing each population super population from the reference.
mkdir -p ${Geno_1KG_dir}/keep_files
module add general/R/3.5.0
R
# Read in environmental variables
source('/users/k1806347/brc_scratch/Software/MyGit/GenoPred/Pipeline_prep.config')
pop_data<-read.table(paste0(Geno_1KG_dir, '/integrated_call_samples_v3.20130502.ALL.panel'), header=T, stringsAsFactors=F)
for(i in unique(pop_data$pop)){
write.table(cbind(pop_data$sample[pop_data$pop == i],pop_data$sample[pop_data$pop == i]), paste0(Geno_1KG_dir,'/keep_files/',i,'_samples.keep'), col.names=F, row.names=F, quote=F)
}
for(i in unique(pop_data$super_pop)){
write.table(cbind(pop_data$sample[pop_data$super_pop == i],pop_data$sample[pop_data$super_pop == i]), paste0(Geno_1KG_dir,'/keep_files/',i,'_samples.keep'), col.names=F, row.names=F, quote=F)
}
# Create a file listing all ancestry groups
write.table(unique(pop_data$super_pop), paste0(Geno_1KG_dir,'/super_pop.list'), col.names=F, row.names=F, quote=F)
write.table(unique(pop_data$pop), paste0(Geno_1KG_dir,'/pop.list'), col.names=F, row.names=F, quote=F)
# Create a file listing the code of each population and the location of the keep file
pop_keep_loc<-data.frame(pop=unique(pop_data$pop))
pop_keep_loc$keep<-paste0(Geno_1KG_dir,'/keep_files/',pop_keep_loc$pop,'_samples.keep')
super_pop_keep_loc<-data.frame(pop=unique(pop_data$super_pop))
super_pop_keep_loc$keep<-paste0(Geno_1KG_dir, '/keep_files/',super_pop_keep_loc$pop,'_samples.keep')
write.table(super_pop_keep_loc, paste0(Geno_1KG_dir,'/super_pop_keep.list'), col.names=F, row.names=F, quote=F)
write.table(pop_keep_loc, paste0(Geno_1KG_dir,'/pop_keep.list'), col.names=F, row.names=F, quote=F)
write.table(rbind(super_pop_keep_loc,pop_keep_loc), paste0(Geno_1KG_dir,'/super_pop_and_pop_keep.list'), col.names=F, row.names=F, quote=F)
q()
n
```
</details>
<br/>
***
## 1000 Genomes PLINK files
Here we download the 1000 Genomes Phase 3 data in vcf format, convert into PLINK binary format, and extract variants in the HapMap3 SNP-list. This data is used to estimate LD-structure and minor allele frequencies. We also estimate genotype-based scores within the reference data to determine the mean and standard deviation of the genotype-based scores within specific ancestral groups. This can then be used to standardise genotype-based scores in the target sample which enables improved interpretation of the scores and ensure scores from diverse sources are on the same scale when performing prediction modelling.
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
cd ${Geno_1KG_dir}
# Download the vcf files, convert to plink, extract hapmap3 snps, and delete vcfs
for chr in $(seq 1 22); do
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
done
for chr in $(seq 1 22); do
qsub ${plink1_9} \
--vcf ${Geno_1KG_dir}/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--make-bed \
--out ${Geno_1KG_dir}/1KGPhase3.chr${chr}
done
# Delete the vcf files after conversion to plink format is complete
rm ${Geno_1KG_dir}/ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
# Use R to identify a list of SNPs that matches with the hapmap3 snplist
qrsh
module add general/R/3.5.0
R
# Read in environmental variables
source('/users/k1806347/brc_scratch/Software/MyGit/GenoPred/Pipeline_prep.config')
library(data.table)
hapmap3_snps<-fread(paste0(HapMap3_snplist_dir,'/w_hm3.snplist'))
# Generate IUPAC codes
hapmap3_snps$IUPAC[hapmap3_snps$A1 == 'A' & hapmap3_snps$A2 =='T' | hapmap3_snps$A1 == 'T' & hapmap3_snps$A2 =='A']<-'W'
hapmap3_snps$IUPAC[hapmap3_snps$A1 == 'C' & hapmap3_snps$A2 =='G' | hapmap3_snps$A1 == 'G' & hapmap3_snps$A2 =='C']<-'S'
hapmap3_snps$IUPAC[hapmap3_snps$A1 == 'A' & hapmap3_snps$A2 =='G' | hapmap3_snps$A1 == 'G' & hapmap3_snps$A2 =='A']<-'R'
hapmap3_snps$IUPAC[hapmap3_snps$A1 == 'C' & hapmap3_snps$A2 =='T' | hapmap3_snps$A1 == 'T' & hapmap3_snps$A2 =='C']<-'Y'
hapmap3_snps$IUPAC[hapmap3_snps$A1 == 'G' & hapmap3_snps$A2 =='T' | hapmap3_snps$A1 == 'T' & hapmap3_snps$A2 =='G']<-'K'
hapmap3_snps$IUPAC[hapmap3_snps$A1 == 'A' & hapmap3_snps$A2 =='C' | hapmap3_snps$A1 == 'C' & hapmap3_snps$A2 =='A']<-'M'
hapmap3_snps$SNP_IUPAC<-paste(hapmap3_snps$SNP,hapmap3_snps$IUPAC,sep=':')
# For each chr
for(chr in 1:22){
bim<-fread(paste0(Geno_1KG_dir,'/1KGPhase3.chr',chr,'.bim'))
bim$IUPAC[bim$V5 == 'A' & bim$V6 =='T' | bim$V5 == 'T' & bim$V6 =='A']<-'W'
bim$IUPAC[bim$V5 == 'C' & bim$V6 =='G' | bim$V5 == 'G' & bim$V6 =='C']<-'S'
bim$IUPAC[bim$V5 == 'A' & bim$V6 =='G' | bim$V5 == 'G' & bim$V6 =='A']<-'R'
bim$IUPAC[bim$V5 == 'C' & bim$V6 =='T' | bim$V5 == 'T' & bim$V6 =='C']<-'Y'
bim$IUPAC[bim$V5 == 'G' & bim$V6 =='T' | bim$V5 == 'T' & bim$V6 =='G']<-'K'
bim$IUPAC[bim$V5 == 'A' & bim$V6 =='C' | bim$V5 == 'C' & bim$V6 =='A']<-'M'
bim$SNP_IUPAC<-paste(bim$V2,bim$IUPAC,sep=':')
bim_hapmap3_snps<-merge(hapmap3_snps,bim,by='SNP_IUPAC')
sum(duplicated(bim_hapmap3_snps$V2))
bim$V2[!(bim$SNP_IUPAC %in% bim_hapmap3_snps$SNP_IUPAC)] <- paste0(bim$V2[!(bim$SNP_IUPAC %in% bim_hapmap3_snps$SNP_IUPAC)],'_excl')
bim[!(bim$SNP_IUPAC %in% bim_hapmap3_snps$SNP_IUPAC),]
fwrite(bim[,1:6], paste0(Geno_1KG_dir,'/1KGPhase3.chr',chr,'.bim'), sep='\t', col.names=F)
fwrite(as.list(bim_hapmap3_snps$V2), paste0(Geno_1KG_dir,'/1KGPhase3.chr',chr,'.extract'), sep='\n', col.names=F)
}
q()
n
for chr in $(seq 1 22); do
qsub ${plink1_9} \
--bfile ${Geno_1KG_dir}/1KGPhase3.chr${chr} \
--make-bed \
--extract ${Geno_1KG_dir}/1KGPhase3.chr$chr.extract \
--out ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr${chr}
done
rm ${Geno_1KG_dir}/1KGPhase3.chr*
# Create version that is all chromosomes merged
ls ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr*.bed | sed -e 's/\.bed//g' > ${Geno_1KG_dir}/merge_list.txt
${plink1_9} \
--merge-list ${Geno_1KG_dir}/merge_list.txt \
--make-bed \
--out ${Geno_1KG_dir}/1KGPhase3.w_hm3.GW
```
</details>
<br/>
***
## 1000 Genomes allele frequency files
Here we create files containing ancestry specific minor allele frequency estimates for the HapMap3 SNPs based on the 1000 Genomes Phase 3 data. This information is mainly used for mean-imputation of missing SNPs during genotype-based scoring. This avoids target sample specific minor allele frequencies being used for mean imputation which may not be available, and will vary between target samples.
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# The following script uses .sumstats.gz files produced by the LDSC munge_sumstats.py
for pop in $(cat ${Geno_1KG_dir}/super_pop.list); do
mkdir -p ${Geno_1KG_dir}/freq_files/${pop}
for chr in $(seq 1 22); do
qsub ${plink1_9} \
--bfile ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr${chr} \
--freq \
--keep ${Geno_1KG_dir}/keep_files/${pop}_samples.keep \
--out ${Geno_1KG_dir}/freq_files/${pop}/1KGPhase3.w_hm3.${pop}.chr${chr}
done
done
for pop in $(cat ${Geno_1KG_dir}/pop.list); do
mkdir -p ${Geno_1KG_dir}/freq_files/${pop}
for chr in $(seq 1 22); do
qsub ${plink1_9} \
--bfile ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr${chr} \
--freq \
--keep ${Geno_1KG_dir}/keep_files/${pop}_samples.keep \
--out ${Geno_1KG_dir}/freq_files/${pop}/1KGPhase3.w_hm3.${pop}.chr${chr}
done
done
# Calculate MAF across all populations
mkdir ${Geno_1KG_dir}/freq_files/AllAncestry
for chr in $(seq 1 22); do
qsub ${plink1_9} \
--bfile ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr${chr} \
--freq \
--out ${Geno_1KG_dir}/freq_files/AllAncestry/1KGPhase3.w_hm3.AllAncestry.chr${chr}
done
# Delete the log and nosex files
rm ${Geno_1KG_dir}/freq_files/*/1KGPhase3.w_hm3.*.chr*.log
rm ${Geno_1KG_dir}/freq_files/*/1KGPhase3.w_hm3.*.chr*.nosex
```
</details>
<br/>
***
# Ancestry scoring
In this section we perform principal components analysis of genotypic data in the 1000 Genomes reference to identify the main axes of population structure, which typically correspond to ancestral differences. We calculate these principal components (PCs) of ancestry across the full reference, and within super populations to detect broad and ancestry-specific axes of variance. After PCA, we idenitfy which variants are associated with the PCs (SNP-weights) to calculate ancestry scores on the same axes of variance in future target samples. This can be used to infer the ancestry of an individual which is an important factor to consider when performing genotype-baed prediction.
This section uses an R script called 'ancestry_score_file_creator.R'. Further information the usage of this script can be found [here](https://github.com/opain/GenoPred/tree/master/Scripts/ancestry_score_file_creator).
<details><summary>Set required variables</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# Set variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
```
</details>
<details><summary>Show code for ancestry scoring</summary>
```{bash, eval=F, echo=T}
# Create score files for European specific PCs, and scaling files for Europeans
qsub /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/ancestry_score_file_creator/ancestry_score_file_creator.R \
--ref_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
--ref_keep ${Geno_1KG_dir}/keep_files/EUR_samples.keep \
--plink ${plink1_9} \
--plink2 ${plink2} \
--n_pcs 100 \
--output ${Geno_1KG_dir}/Score_files_for_ancestry/EUR/1KGPhase3.w_hm3.EUR
# Create score files for all ancestry PCs, and scaling files for each super population
qsub /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/ancestry_score_file_creator/ancestry_score_file_creator.R \
--ref_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
--plink ${plink1_9} \
--plink2 ${plink2} \
--n_pcs 100 \
--ref_pop_scale ${Geno_1KG_dir}/super_pop_keep.list \
--output ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry
# For when predicting ancestry in target samples, create a file containing the path to scale files
ls ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.*.scale > ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.scalefiles.txt
```
</details>
<br/>
***
# Polygenic scoring
In this section we will create files that can be used for polygenic scoring (score files), estimate the mean and SD of polygenic scores within different ancestries for scaling target samples. This will be performed using MAF and LD in the 1000 Genomes European subset as we will include only GWAS that were performed within European individuals, and our target samples (UK Biobank and TEDS) are both of European ancestry.
Polygenic scores are derived using multiple methods to allow comparison. After the best method has been idenitified, it will ony be necessary to calculate scores using the one method.
<details><summary>Set required variables</summary>
```{bash, eval=F, echo=T}
# Set variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
```
</details>
<br/>
***
## Create repository of GWAS summary statistics
I will be using the GWAS repository on the Rosalind server at KCL, created by Helena Gaspar. Here I idenitfy GWAS that I would like to calculate polygenic scores from, selecting GWAS that do not overlap with my target samples, the largest sample size for each phenotype, and a sample size over 15K. I also remove groups of GWAS that I think are unnecessary to speed up the pipeline.
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
###
# Select which GWAS we want to include which don't include UKBB
###
module add general/R/3.5.0
R
# Read in environmental variables
source('/users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config')
pheno<-read.csv('/users/k1806347/brc_scratch/Data/GWAS_sumstats/QC_sumstats_list_031218.csv', stringsAsFactors=F)
pheno_brief<-pheno[c('Code','phenotype_category','trait','sample_size_discovery','Ncases','Ncontrols','sex','consortium','year','UK.Biobank','h2.observed','ancestry','PRS.created')]
pheno_brief$sample_size_discovery<-gsub(',','',pheno_brief$sample_size_discovery)
# Extract those that have LDSC sumstats (this removes those where sumstats are unavailable)
pheno_brief$sample_size_discovery<-as.numeric(pheno_brief$sample_size_discovery)
# Include only GWAS based on both sexes
pheno_brief<-pheno_brief[pheno_brief$sex == 'both' | pheno_brief$sex == "both?" | pheno_brief$sex == "both (children)" | pheno_brief$sex == "meta" ,]
# Exclude GWAS containing UKBB as this is our target sample
pheno_brief<-pheno_brief[which(pheno_brief$UK.Biobank == 'no'),]
pheno_brief<-pheno_brief[!is.na(as.numeric(pheno_brief$h2.observed)),]
# BLOO GWAS are mislabelled as not containing UKBB. Remove them.
pheno_brief<-pheno_brief[!(grepl('BLOO',pheno_brief$Code)),]
# INTE02, INTE03, all EDUC, GWAS are mislabelled as not containing UKBB. Remove them.
pheno_brief<-pheno_brief[!(grepl('BLOO|INTE02|INTE03|EDUC',pheno_brief$Code)),]
# COAD02 is based on a recessive model and should be excluded.
pheno_brief<-pheno_brief[!(grepl('COAD02',pheno_brief$Code)),]
# Edit DIAB06 trait info so it is different from DIAB05, and therefore retained.
pheno_brief$trait[grepl('DIAB06', pheno_brief$Code)]<-'diabetes_type_2_bmi_adj'
# Remove SCHI03 as this GWAS is not public yet
pheno_brief<-pheno_brief[!(grepl('SCHI03',pheno_brief$Code)),]
# Extract the largest GWAS for each phenotype
pheno_brief_bigest<-NULL
for(i in unique(pheno_brief$trait)){
tmp<-pheno_brief[pheno_brief$trait == i,]
tmp<-tmp[rev(order(tmp$sample_size_discovery)),]
pheno_brief_bigest<-rbind(pheno_brief_bigest,tmp[1,])
}
# Extract those with sample size >15000
pheno_brief_bigest_big<-pheno_brief_bigest[pheno_brief_bigest$sample_size_discovery > 15000,]
# Remove GWAS from MAGNETIC as we are not looking at cardiometabolic traits and is excessive.
pheno_brief_bigest_big_nomag<-pheno_brief_bigest_big[pheno_brief_bigest_big$consortium !='MAGNETIC',]
pheno_brief_bigest_big_nomag$gwas_list<-paste0(gwas_rep_cleaned,'/',pheno_brief_bigest_big_nomag$Code,'.gz')
write.table(pheno_brief_bigest_big_nomag[c('Code','gwas_list')], '~/brc_scratch/Data/GWAS_sumstats/Largest_GWAS_over15K_withoutUKBB.txt', col.names=F, row.names=F, quote=F)
q()
n
# NOTE. The overlapping with UKBB variable is unreliable and should be checked using the LDSC covariance intercept with outcome variables.
###
# Select which GWAS we want to include which don't include TEDS
###
module add general/R/3.5.0
R
# Read in environmental variables
source('/users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config')
pheno<-read.csv('/users/k1806347/brc_scratch/Data/GWAS_sumstats/QC_sumstats_list_031218.csv', stringsAsFactors=F)
pheno_brief<-pheno[c('Code','phenotype_category','trait','sample_size_discovery','Ncases','Ncontrols','sex','consortium','year','UK.Biobank','h2.observed','ancestry','PRS.created')]
pheno_brief$sample_size_discovery<-gsub(',','',pheno_brief$sample_size_discovery)
# Include only GWAS based on both sexes
pheno_brief<-pheno_brief[pheno_brief$sex == 'both' | pheno_brief$sex == "both?" | pheno_brief$sex == "both (children)" | pheno_brief$sex == "meta" ,]
# Extract those that have LDSC sumstats (this removes those where sumstats are unavailable)
pheno_brief$sample_size_discovery<-as.numeric(pheno_brief$sample_size_discovery)
# Remove GWAS containing TEDS individuals
pheno_brief<-pheno_brief[!grepl('EDUC04', pheno_brief$Code),]
# Remove BLOO GWASs as there are lot of them and unnecessary for our purposes.
pheno_brief<-pheno_brief[!grepl('BLOO', pheno_brief$Code),]
# Remove BLOP, CROH03 and DIAB01 GWASs
pheno_brief<-pheno_brief[!grepl('BLOP|CROH03|DIAB01', pheno_brief$Code),]
# Remove entries with missing sample size information
pheno_brief<-pheno_brief[!is.na(pheno_brief$sample_size_discovery),]
# Edit DIAB06 trait info so it is different from DIAB05, and therefore retained.
pheno_brief$trait[grepl('DIAB06', pheno_brief$Code)]<-'diabetes_type_2_bmi_adj'
# Remove SCHI03 as this GWAS is not public yet
pheno_brief<-pheno_brief[!(grepl('SCHI03',pheno_brief$Code)),]
# Extract the most recent and largest GWAS for each phenotype
pheno_brief_bigest<-NULL
for(i in unique(pheno_brief$trait)){
tmp<-pheno_brief[pheno_brief$trait == i,]
tmp<-tmp[rev(order(tmp$sample_size_discovery)),]
pheno_brief_bigest<-rbind(pheno_brief_bigest,tmp[1,])
}
# Extract those with sample size >15000
pheno_brief_bigest_big<-pheno_brief_bigest[pheno_brief_bigest$sample_size_discovery > 15000,]
# Remove GWAS from MAGNETIC as we are not looking at cardiometabolic traits and is excessive.
pheno_brief_bigest_big_nomag<-pheno_brief_bigest_big[pheno_brief_bigest_big$consortium !='MAGNETIC',]
pheno_brief_bigest_big_nomag$gwas_list<-paste0(gwas_rep_cleaned,'/',pheno_brief_bigest_big_nomag$Code,'.gz')
write.table(pheno_brief_bigest_big_nomag[c('Code','gwas_list')], '~/brc_scratch/Data/GWAS_sumstats/Largest_GWAS_over15K_withoutTEDS.txt', col.names=F, row.names=F, quote=F)
q()
n
```
</details>
<br/>
Now I will perform quality control of GWAS summary statistics for polygenic scoring. The QC paramters are based on those in LDSC, with addition of comparing reported MAF to reference MAF.
* Extract HapMap3 variants.
* Insert chromomsome and base pair position from reference.
* Remove variants that are not SNPs or were strand-ambiguous.
* Check allele match
* Remove SNPs with missing values.
* Remove SNPs with INFO < 0.9.
* Remove SNPs with reported MAF < 0.01.
* Remove SNPs with reference MAF < 0.01.
* Remove SNPs with discordant MAF with reference < 0.01.
* Insert reference MAF
* Remove SNPs with out-of-bounds p-values.
* Remove SNPs with duplicated rs numbers.
* Remove SNPs with N < (90th percentile N)/2.
* Recompute P value if genomic control detected
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# Create directory
mkdir ${gwas_rep_qcd}
# Create file listing GWAS that haven't been processed.
> ${gwas_rep_qcd}/todo.txt
for gwas in $(echo DEPR06 DEPR07 COLL01 HEIG03 BODY03 BODY04 DIAB05 COAD01 CROH01 SCLE03 RHEU02 RHEU01 EDUC03 ADHD04 BODY11 PRCA01 BRCA01 INFB01 CRPR01 STRK01 HDLP01 LDLP01 CHOL35 TRGL13);do
if [ ! -f ${gwas_rep_qcd}/${gwas}.cleaned.gz ]; then
echo $gwas >> ${gwas_rep_qcd}/todo.txt
fi
done
# Create shell script to run using sbatch
cat > ${gwas_rep_qcd}/sbatch.sh << 'EOF'
#!/bin/sh
#SBATCH -p shared,brc
#SBATCH --mem 5G
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
gwas=$(sed "${SLURM_ARRAY_TASK_ID}q;d" ${gwas_rep_qcd}/todo.txt)
echo ${gwas}
/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/sumstat_cleaner/sumstat_cleaner.R \
--sumstats ${gwas_rep_cleaned}/${gwas}.gz \
--ref_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
--ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
--output ${gwas_rep_qcd}/${gwas}.cleaned
EOF
sbatch --array 1-$(wc -l ${gwas_rep_qcd}/todo.txt | cut -d' ' -f1)%5 ${gwas_rep_qcd}/sbatch.sh
```
</details>
<br/>
***
## Prepare score files and scaling files for polygenic scoring (pT + clump)
Here I prepare reference files for typical polygenic scores derived using the p-value thresholding and LD-based clumping procedure.
### Sparse thresholding (nested)
Here we will only use 8 p-value thresholds.
This section uses an R script called 'polygenic_score_file_creator.R'. Further information the usage of this script can be found [here](https://github.com/opain/GenoPred/tree/master/Scripts/polygenic_score_file_creator).
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# Create directory
mkdir ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump
# Create file listing GWAS that haven't been processed.
> ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump/todo.txt
for gwas in $(echo DEPR06 DEPR07 COLL01 HEIG03 BODY03 BODY04 DIAB05 COAD01 CROH01 SCLE03 RHEU02 RHEU01 EDUC03 ADHD04 BODY11 PRCA01 BRCA01 INFB01);do
if [ ! -f ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale ]; then
echo $gwas >> ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump/todo.txt
fi
done
# Create shell script to run using sbatch
cat > ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump/sbatch.sh << 'EOF'
#!/bin/sh
#SBATCH -p shared,brc
#SBATCH --mem 5G
#SBATCH -J pt_clump
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
gwas=$(sed "${SLURM_ARRAY_TASK_ID}q;d" ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump/todo.txt)
echo ${gwas}
/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/polygenic_score_file_creator/polygenic_score_file_creator.R \
--ref_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
--ref_keep ${Geno_1KG_dir}/keep_files/EUR_samples.keep \
--sumstats ${gwas_rep_qcd}/${gwas}.cleaned.gz \
--plink ${plink1_9} \
--memory 3000 \
--output ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump/${gwas}/1KGPhase3.w_hm3.${gwas} \
--ref_pop_scale ${Geno_1KG_dir}/super_pop_keep.list
EOF
sbatch --array 1-$(wc -l ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump/todo.txt | cut -d' ' -f1)%5 ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump/sbatch.sh
```
</details>
<br/>
### Sparse thresholding (not nested)
Here we will only use 8 p-value thresholds.
This section uses an R script called 'polygenic_score_file_creator.R'. Further information the usage of this script can be found [here](https://github.com/opain/GenoPred/tree/master/Scripts/polygenic_score_file_creator).
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# Create directory
mkdir ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_nonnested
# Create file listing GWAS that haven't been processed.
> ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_nonnested/todo.txt
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY04 DIAB05 COAD01 CROH01 SCLE03 RHEU02 EDUC03 ADHD04 BODY11 PRCA01 BRCA01);do
if [ ! -f ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_nonnested/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale ]; then
echo $gwas >> ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_nonnested/todo.txt
fi
done
# Create shell script to run using sbatch
cat > ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_nonnested/sbatch.sh << 'EOF'
#!/bin/sh
#SBATCH -p shared,brc
#SBATCH --mem 5G
#SBATCH -J pt_clump_nn
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
gwas=$(sed "${SLURM_ARRAY_TASK_ID}q;d" ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_nonnested/todo.txt)
echo ${gwas}
/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/polygenic_score_file_creator/polygenic_score_file_creator.R \
--ref_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
--ref_keep ${Geno_1KG_dir}/keep_files/EUR_samples.keep \
--sumstats ${gwas_rep_qcd}/${gwas}.cleaned.gz \
--plink ${plink1_9} \
--nested F \
--memory 3000 \
--output ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_nonnested/${gwas}/1KGPhase3.w_hm3.${gwas} \
--ref_pop_scale ${Geno_1KG_dir}/super_pop_keep.list
EOF
sbatch --array 1-$(wc -l ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_nonnested/todo.txt | cut -d' ' -f1)%5 ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_nonnested/sbatch.sh
```
</details>
<br/>
### Dense thresholding
Here we will only use dense p-value thresholds, mimicking the PRSice approach.
This section uses an R script called 'polygenic_score_file_creator.R'. Further information the usage of this script can be found [here](https://github.com/opain/GenoPred/tree/master/Scripts/polygenic_score_file_creator_PRSice).
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# Create directory
mkdir ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_dense
# Create file listing GWAS that haven't been processed.
> ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_dense/todo.txt
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY04 DIAB05 COAD01 CROH01 SCLE03 RHEU02 EDUC03 ADHD04 BODY11 PRCA01 BRCA01);do
if [ ! -f ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_dense/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale ]; then
echo $gwas >> ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_dense/todo.txt
fi
done
# Create shell script to run using sbatch
cat > ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_dense/sbatch.sh << 'EOF'
#!/bin/sh
#SBATCH -p shared,brc
#SBATCH --mem 5G
#SBATCH -J pt_clump_dense
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
gwas=$(sed "${SLURM_ARRAY_TASK_ID}q;d" ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_dense/todo.txt)
echo ${gwas}
/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/polygenic_score_file_creator_dense/polygenic_score_file_creator_dense.R \
--ref_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
--ref_keep ${Geno_1KG_dir}/keep_files/EUR_samples.keep \
--sumstats ${gwas_rep_qcd}/${gwas}.cleaned.gz \
--plink ${plink1_9} \
--prsice_path ${prsice_path} \
--rscript ${rscript} \
--memory 3000 \
--output ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_dense/${gwas}/1KGPhase3.w_hm3.${gwas} \
--ref_pop_scale ${Geno_1KG_dir}/super_pop_keep.list
EOF
sbatch --array 1-$(wc -l ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_dense/todo.txt | cut -d' ' -f1)%5 ${Geno_1KG_dir}/Score_files_for_polygenic/pt_clump_dense/sbatch.sh
```
</details>
<br/>
***
## Prepare score and scale files for polygenic scoring using lassosum
Here we create reference files for polygenic scores calculated by lassosum, a method for performing lasso-based shrinkage to GWAS sumstats to account for LD and winners curse. More information on lassosum can be found [here](https://github.com/tshmak/lassosum). You will need to install the lassosum R package in advance.
This section uses an R script called 'polygenic_score_file_creator_lassosum.R'. Further information the usage of this script can be found [here](https://github.com/opain/GenoPred/tree/master/Scripts/polygenic_score_file_creator_lassosum).
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# Create directory
mkdir ${Geno_1KG_dir}/Score_files_for_polygenic/lassosum
# Create file listing GWAS that haven't been processed.
> ${Geno_1KG_dir}/Score_files_for_polygenic/lassosum/todo.txt
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY04 DIAB05 COAD01 CROH01 SCLE03 RHEU02 EDUC03 ADHD04 BODY11 PRCA01 BRCA01);do
if [ ! -f ${Geno_1KG_dir}/Score_files_for_polygenic/lassosum/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale ]; then
echo $gwas >> ${Geno_1KG_dir}/Score_files_for_polygenic/lassosum/todo.txt
fi
done
# Create shell script to run using sbatch
cat > ${Geno_1KG_dir}/Score_files_for_polygenic/lassosum/sbatch.sh << 'EOF'
#!/bin/sh
#SBATCH -p shared,brc
#SBATCH --mem 10G
#SBATCH -J lassosum
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
gwas=$(sed "${SLURM_ARRAY_TASK_ID}q;d" ${Geno_1KG_dir}/Score_files_for_polygenic/lassosum/todo.txt)
echo ${gwas}
/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/polygenic_score_file_creator_lassosum/polygenic_score_file_creator_lassosum.R \
--ref_plink_gw ${Geno_1KG_dir}/1KGPhase3.w_hm3.GW \
--ref_keep ${Geno_1KG_dir}/keep_files/EUR_samples.keep \
--sumstats ${gwas_rep_qcd}/${gwas}.cleaned.gz \
--output ${Geno_1KG_dir}/Score_files_for_polygenic/lassosum/${gwas}/1KGPhase3.w_hm3.${gwas} \
--plink ${plink1_9} \
--ref_pop_scale ${Geno_1KG_dir}/super_pop_keep.list
EOF
sbatch --array 1-$(wc -l ${Geno_1KG_dir}/Score_files_for_polygenic/lassosum/todo.txt | cut -d' ' -f1)%5 ${Geno_1KG_dir}/Score_files_for_polygenic/lassosum/sbatch.sh
```
</details>
<br/>
***
## Prepare score and scale files for polygenic scoring using PRScs
Here we create polygenic scores calculated by PRScs, a method for performing Bayesian continuous shrinkage to GWAS sumstats to account for LD and winners curse. More information on PRScs can be found [here](https://github.com/getian107/PRScs). The PRScs developers provide reference LD data based on European individuals in the 1KG sample (and recently UK Biobank). I will use the PRScs provided LD matrices based on 1KG as this should be comparable to other methods. There is minimal documentation of how th LD reference was created. PRScs developers also report a minimal improvement when using the larger UKB reference (although their recent addition of the UKB reference to GitHub may suggest is it offers some advantages).
<br/>
### Run PRScs
This section uses an R script called 'polygenic_score_file_creator_PRScs.R'. Further information the usage of this script can be found [here](https://github.com/opain/GenoPred/tree/master/Scripts/polygenic_score_file_creator_PRScs).
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# Create directory
mkdir ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs
# Create file listing GWAS that haven't been processed.
> ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/todo.txt
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY04 DIAB05 COAD01 CROH01 SCLE03 RHEU02 RHEU01 EDUC03 ADHD04 BODY11 PRCA01 BRCA01 INFB01 CRPR01 STRK01 HDLP01 LDLP01 CHOL35 TRGL13);do
if [ ! -f ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale ]; then
echo $gwas >> ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/todo.txt
fi
done
# Create shell script to run using sbatch
cat > ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/sbatch.sh << 'EOF'
#!/bin/sh
#SBATCH -p shared,brc
#SBATCH --mem 110G
#SBATCH -n 11
#SBATCH --nodes=1
#SBATCH -t 48:00:00
#SBATCH -J PRScs
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
gwas=$(sed "${SLURM_ARRAY_TASK_ID}q;d" ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/todo.txt)
echo ${gwas}
/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/polygenic_score_file_creator_PRScs/polygenic_score_file_creator_PRScs.R \
--ref_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
--ref_keep ${Geno_1KG_dir}/keep_files/EUR_samples.keep \
--sumstats ${gwas_rep_qcd}/${gwas}.cleaned.gz \
--plink ${plink1_9} \
--memory 5000 \
--output ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/${gwas}/1KGPhase3.w_hm3.${gwas} \
--ref_pop_scale ${Geno_1KG_dir}/super_pop_keep.list \
--PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
--PRScs_ref_path ${PRScs_dir}/ldblk_1kg_eur \
--n_cores 11 \
--phi_param 1e-6,1e-4,1e-2,1,auto
EOF
sbatch --array 1-$(wc -l ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/todo.txt | cut -d' ' -f1)%3 ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/sbatch.sh
# Create shell script to run using sbatch
cat > ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/sbatch_2.sh << 'EOF'
#!/bin/sh
#SBATCH -p shared,brc
#SBATCH --mem 5G
#SBATCH -n 1
#SBATCH --nodes=1
#SBATCH -t 48:00:00
#SBATCH -J PRScs_batch
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
gwas=$(sed "${SLURM_ARRAY_TASK_ID}q;d" ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/todo.txt)
echo ${gwas}
/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/polygenic_score_file_creator_PRScs/polygenic_score_file_creator_PRScs_2.R \
--ref_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
--ref_keep ${Geno_1KG_dir}/keep_files/EUR_samples.keep \
--sumstats ${gwas_rep_qcd}/${gwas}.cleaned.gz \
--plink ${plink1_9} \
--memory 5000 \
--output ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/${gwas}/1KGPhase3.w_hm3.${gwas} \
--ref_pop_scale ${Geno_1KG_dir}/super_pop_keep.list \
--PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
--PRScs_ref_path ${PRScs_dir}/ldblk_1kg_eur \
--n_cores 15 \
--phi_param 1e-6,1e-4,1e-2,1,auto
EOF
sbatch --array 2-$(wc -l ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/todo.txt | cut -d' ' -f1)%2 ${Geno_1KG_dir}/Score_files_for_polygenic/PRScs/sbatch_2.sh
```
</details>
<br/>
***
## Prepare score and scale files for polygenic scoring using S-BLUP
Here we create reference files for polygenic scores calculated by SBLUP, a method for performing genomic BLUP analysis with summary data and an LD-reference. More information on SBLUP can be found [here](https://cnsgenomics.com/software/gcta/#Genomicriskprediction). You will need to download the GCTA software in advance.
This section uses an R script called 'polygenic_score_file_creator_SBLUP.R'. Further information the usage of this script can be found [here](https://github.com/opain/GenoPred/tree/master/Scripts/polygenic_score_file_creator_SBLUP).
<details><summary>Show code</summary>
```{bash, eval=F, echo=T}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# Create directory
mkdir ${Geno_1KG_dir}/Score_files_for_polygenic/SBLUP
# Create file listing GWAS that haven't been processed.
> ${Geno_1KG_dir}/Score_files_for_polygenic/SBLUP/todo.txt
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY04 DIAB05 COAD01 CROH01 SCLE03 RHEU02 EDUC03 ADHD04 BODY11 PRCA01 BRCA01);do
if [ ! -f ${Geno_1KG_dir}/Score_files_for_polygenic/SBLUP/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale ]; then
echo $gwas >> ${Geno_1KG_dir}/Score_files_for_polygenic/SBLUP/todo.txt
fi
done
# Create shell script to run using sbatch
cat > ${Geno_1KG_dir}/Score_files_for_polygenic/SBLUP/sbatch.sh << 'EOF'
#!/bin/sh
#SBATCH -p shared,brc
#SBATCH --mem 80G
#SBATCH -n 6
#SBATCH -J SBLUP
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
gwas=$(sed "${SLURM_ARRAY_TASK_ID}q;d" ${Geno_1KG_dir}/Score_files_for_polygenic/SBLUP/todo.txt)
echo ${gwas}
/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/polygenic_score_file_creator_SBLUP/polygenic_score_file_creator_SBLUP.R \
--ref_plink ${Geno_1KG_dir}/1KGPhase3.w_hm3.GW \
--ref_keep ${Geno_1KG_dir}/keep_files/EUR_samples.keep \
--sumstats ${gwas_rep_qcd}/${gwas}.cleaned.gz \
--plink ${plink1_9} \
--gcta ${gcta} \
--munge_sumstats ${munge_sumstats} \
--ldsc ${ldsc} \
--ldsc_ref ${ldsc_ref} \
--hm3_snplist ${HapMap3_snplist_dir}/w_hm3.snplist \
--memory 50000 \
--n_cores 6 \
--output ${Geno_1KG_dir}/Score_files_for_polygenic/SBLUP/${gwas}/1KGPhase3.w_hm3.${gwas} \
--ref_pop_scale ${Geno_1KG_dir}/super_pop_keep.list
EOF
sbatch --array 1-$(wc -l ${Geno_1KG_dir}/Score_files_for_polygenic/SBLUP/todo.txt | cut -d' ' -f1)%3 ${Geno_1KG_dir}/Score_files_for_polygenic/SBLUP/sbatch.sh
```
</details>
<br/>
***
## Prepare score and scale files for polygenic scoring using SBayesR
Here we create reference files for polygenic scores calculated by SBayesR, a bayesian shrinkage method for GWAS summary data and an LD-reference. More information on SBayesR can be found [here](http://cnsgenomics.com/software/gctb/#SummaryBayesianAlphabet). You will need to download the GCTB software in advance.
First we need to create a specially formatted LD matrix for SBayesR. The GCTB authors compare the performance of SBayesR when using different datasets for LD matrix estimation. They show that using EUR 1KG data (N=378) leads to poorer prediction accuracy from S-BayesR. They then use LD matrices based on 50,000 random European individuals from UK Biobank to provide optimal prediction, however using 5,000 indivduals gave similar results. I think we should be comparing PRS methods based on the same reference data, so we should estimate LD matrices based on the EUR 1KG reference (N=503). A subsequent study (by me) can test these PRS methods based on LD estimates from 10,000 individuals.
<details><summary>Create LD reference based on EUR 1KG Phase 3 individuals</summary>
```{bash, eval=F}
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config
# When retaining only EUR, I am getting an error saying fixed SNP. Create list of variants with MAF > 0.001.
# Download the required genetic maps
cd ${genetic_map}/CEU
for chr in $(seq 1 22); do
wget https://github.com/joepickrell/1000-genomes-genetic-maps/raw/master/interpolated_OMNI/chr${chr}.OMNI.interpolated_genetic_map.gz
done
gunzip *.gz
module add apps/R/3.6.0
R
# Read in environmental variables
source('/users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Pipeline_prep.config')
# Create list of SNPs that have MAF above 0.001 in the EUR 1KG sample
for(i in 1:22){
frq<-read.table(paste0(Geno_1KG_dir, '/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr',i,'.frq'), header=T)
frq<-frq[frq$MAF > 0.001,]
write.table(frq$SNP,paste0(Geno_1KG_dir,'/LD_matrix/EUR/SNP_maf0.001_EUR_chr',i,'.txt'), col.names=F, row.names=F, quote=F)
}
# Stop scientific notation
options(scipen=999)
# Create shrunk LD matrix in 5000 SNP pieces
for(i in 1:22){
nsnp<-system(paste0('wc -l ',Geno_1KG_dir,'/LD_matrix/EUR/SNP_maf0.001_EUR_chr',i,'.txt'), intern=T)
nsnp<-as.numeric(unlist(strsplit(nsnp, ' '))[1])
nsnp_chunk<-ceiling(nsnp/5000)
for(j in 1:nsnp_chunk){
start<-(5000*(j-1))+1