-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
473 lines (378 loc) · 16.3 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
=== plink_python ===
Contributors: Jessica Bonnie (jkb4y)
Requires: Python 2.6, PLINK, locuszoom
Version: 1.1.0
Last Updated: 1/17/2012
Set of .py files to perform conditional analysis on data using PLINK.
=== Description ===
plink_python consists of a set of 8 Python program files:
plink_conditional.py
region_wrap.py
parse_log.py
meta_yank.py
assoc_yank.py
map_adapt.py
pc_workhorse.py
pc_toolbox.py
=== BASIC USAGE INFORMATION ===
plink_conditional.py
=== DESCRIPTION ===
This program takes a PLINK genotype file, runs it through PLINK,
illustrates the results via locuszoom, identifies the SNP with the
most significant (i.e. lowest) p-value, adds this SNP to a condition
list, and then repeats the process until there are no SNPs with p-values
below a given threshold of significance.
Required commands: --script <script path>
--outfolder <folder path>
--test <test name>
--chromosome <number>
--from-mb <start in Mb> --to-mb <end in Mb>
OR --from-kb <start in kb> --to-kb <end in kb>
OR --from-bp <start in bp> --to-bp <end in bp>
Optional commands: --chrband <chr band>
--refgene <gene name>
--flag <word>
--pbound <number>
--loop <number>
--condition-list <file path>
--pheno <file path>
=== INSTRUCTIONS ===
Argument and default information for this program can be accessed
with a call to the program using --help or -h.
== REQUIRED COMMANDS ===
SCRIPT:
--script <script path> OR -s <script path>
Commands to the program MUST include the path location of a
script containing the following constructions directly from PLINK:
--bfile <full path name of binary data files to be fed
into PLINK (without extension)>
--covar <full path name of covariate file (include extension)>
Optionally, one can also include the following flags:
--noweb
--sex
--hide-covar
Scripts CANNOT contain:
ANY output file information, constructed in PLINK using:
--out
ANY type of regional information, constructed in PLINK using:
--from...
--to...
--chromosome
--window
--gene
Association test commands (those will come later):
For examples, see:
http://pngu.mgh.harvard.edu/~purcell/plink/anal/ahtml#cc
Condition List information, constructed in PLINK using:
--condition-list
Phenotype commands, constructed in PLINK using
--pheno
--all-pheno
--pheno-name
OUTPUT FOLDER:
--outfolder <folder path> OR -o <folder path>
Commands to the program MUST include the path location of a Big Mama
results folder. Within this folder, the program will constuct
additional folders to separate results by chromosome.
TEST:
--test <plink test> OR -t <plink test>
Commands to the program MUST include an indication of which PLINK test
should be run. Current program configuration will accept the following:
--test logistic
--test linear
--test assoc
--test fisher
--test model
--test mh
CHROMOSOME:
--chromosome <number> OR -c <number>
Commands to the program MUST include the number of the chromosome on
which the analysis should be run.
REGIONAL BOUNDARIES:
--from-mb <start in Mb> --to-mb <end in Mb> OR
--from-kb <start in kb> --to-kb <end in kb> OR
--from-bp <start in bp> --to-bp <end in bp>
Commands to the program MUST include range information of a region
on the chromosome. This can be given in mega-basepairs, kilo-basepairs,
or basepairs (or a combination thereof).
__________________________________________
=== OPTIONAL COMMANDS ===
Without these commands, the program will use current default values,
which can be learned by using the --help command. To change the
defaults, see the advanced user section.
CHRBAND:
--chrband <chr band>
This command uses a chromosomal band name instead of range information
to name output files.
REFGENE:
--refgene <gene name>
This command uses a reference gene name instead of range information
to name output files.
PBOUND:
--pbound <highest p-value of interest>
This command sets the highest p-value that is considered significant.
When there are no more SNPs with p-values below this number,
the program terminates.
MAXLOOPS:
--loop <number> OR -l <number>
This command sets the maximum number of times the program should loop
through PLINK.
FLAG:
--flag <word> OR -f <word>
This command adds a word or phrase (NO SPACES) to the titles of all
output files produced during the run.
CONDITION-LIST:
--condition-list <file path>
This command takes the path location of a list of SNPs which should
be included in the condition list from the very first loop through
PLINK. All SNPs in the file will be copied to ~leastP_SNPs.txt, and
any additional SNPs the program identifies will be added there.
ANY output files produced through this command will contain '~'
in the name.
PHENO:
--pheno <file path>
This command takes the path location of a PLINK phenotype file. The
program will perform separate loops for each of the phenotypes in
the pheno-file.
=== EXAMPLES ===
Example Script Text:
--bfile /home/mst3k/data/soandso
--covar /home/mst3k/data/soandso.cov
--hide-covar --sex --noweb
Example Commandline Call:
python plink_conditional.py --script /home/mst3k/data/soandso.txt
--outfolder /home/mst3k/results/SoAndSo/ --test logistic --chromosome 11
--refgene INS --from-mb 2.1 --to-mb 2.3
_______________________________________________
region_wrap.py
=== DESCRIPTION ===
This program reads from a file containing regional information and
performs plink_conditional.py on each of the regions.
Required Commands: --region-list <file path>
+ ALL REQUIRED COMMANDS FOR PLINK_CONDITIONAL.PY
Optional Commands: --cfolder <folder path>
=== INSTRUCTIONS ===
Argument and default information for this program can be accessed
with a call to the program using --help or -h.
=== REQUIRED COMMANDS ===
REGION-LIST:
--region-list <file path>
Commands to this program MUST include the path to a file containing
regional information for the regions on which plink_conditional
should be run. The column headings of the file must include: "gene_chr",
"region_start", "region_end", and "gene_symbol". The positional
information MUST be given in megabase-pairs (Mb).
PLINK_CONDITIONAL COMMANDS:
Commands to this program MUST include any commands that would normally
be given to plink_conditional.
=== OPTIONAL COMMANDS ===
CONDITIONAL LIST FOLDER:
--cfolder <folder path>
This command takes the path location of a folder of condition-lists
of SNPs, for naming purposes these should be produced by either
meta_yank.py or assoc_yank.py. Each region (e.g. chromosomal band
6q72.1 or region JKB on chromosome 6) in the region-list must have
a corresponding condition-list in the folder (i.e. 6q72.1.txt or
Chr6_JKB.txt). The SNPs in each condition-list will be copied to
~leastP_SNPs.txt and included in the condition-list from the very first
loop through PLINK for the associated region. Any additional SNPs the
program identifies will be added to the second location
(~leastP_SNPs.txt), leaving the original lists in the condition-list
folder unchanged. ANY output files produced through this command will
contain '~' in the name.
_____________________________________________________
parse_log.py
=== DESCRIPTION ===
This program reads the log files produced during plink_conditional.py
and creates a summary file containing key information from the logs.
It also produces a second text file likely of interest only to the
programmer containing information about the first run through PLINK
from each log.
Required commands: --logfolder <folder path>
--map <path to ORIGINAL map file>
--freq <file path>
Optional commands: --summary <filename NO EXTENSION>
--runinfo <filename NO EXTENSION>
=== INSTRUCTIONS ===
Argument and default information for this program can be accessed
with a call to the program using --help or -h.
=== REQUIRED COMMANDS ===
LOG FOLDER:
--logfolder <folder path>
Commands to this program MUST include the path to the logs folder
which plink_conditional.py created within your Big Mama results folder.
So, the address of this folder will be Big_Mama's_Path/logs.
MAP FILE:
--map <file path>
Commands to this program MUST include the path to the original map file,
unaltered by map_adapt.
FREQUENCY FILE:
--freq <file path>
Commands to this program MUST include the path to the a frequency file for
the control population produced by plink. For your convenience:
plink --bfile "data" --filter-controls --freq --make-bed --noweb --out "data"_controls
=== OPTIONAL COMMANDS ===
SUMMARY FILE NAME:
--summary <file name NO EXTENSION>
The program will write the summary file to the logs folder under the
name 'log_summary.txt', unless another basename is given.
RUN INFO NAME:
--runinfo <file name NO EXTENSION>
The program will write the run info to the logs folder under the name
'run_info.txt', unless another basename is given.
_____________________________________________________
meta_yank.py
=== DESCRIPTION ===
This program takes a region-list file and a meta-analysis file, both
with the specific column heading requirements detailed below.
It produces a file containing the information of the most significant
SNP for each of the regions in the region-list. It also produces
a folder of condition-lists, each containing the most significant
SNP in a region and named for that region, and a list for each
region whose lowest p-value is shared by multiple SNPs.
Required Commands: --outfile <filepath>
--meta <filepath>
--region-list <filepath>
=== INSTRUCTIONS ===
Argument and default information for this program can be accessed
with a call to the program using --help or -h.
=== REQUIRED COMMANDS ===
OUTPUT FILE:
--out <file path>
Commands to this program MUST include the path to the file where the
results should be written.
META-ANALYSIS FILE:
--meta <file path>
Commands to this program MUST include the path to a meta-analysis file
containing the relevant SNP information. The column headings of the file
must include: "CHR", "MarkerName", "P-value","POS" (bp-position),
"Zscore".
REGION-LIST:
--region-list <file path> or -r <file path>
Commands to this program MUST include the path to a region-list file. The
column headings of the file must include: "gene_chr", "region_start",
"region_end", and "gene_symbol". The positional information MUST be given
in megabase-pairs (Mb).
_____________________________________________________
assoc_yank.py
=== DESCRIPTION ===
This program takes a region-list file and a PLINK association file, both
with the specific column heading requirements detailed below.
It produces a file containing the information of the most significant
SNP for each of the regions in the region-list. It also produces
a folder of condition-lists, each containing the most significant
SNP in a region (if it could be determined)
and named for that region, and a list for each region whose lowest
p-value is shared by multiple SNPs.
Required Commands: --outfile <filepath>
--assoc <filepath>
--region-list <filepath>
--bp-form <'mb' OR 'kb' OR 'bp'>
=== INSTRUCTIONS ===
Argument and default information for this program can be accessed
with a call to the program using --help or -h.
=== REQUIRED COMMANDS ===
OUTPUT FILE:
--out <file path>
Commands to this program MUST include the path to the file where the
results should be written.
ASSOCIATION FILE:
--assoc <file path>
Commands to this program MUST include the path to a PLINK association file
containing the relevant SNP information. The column headings of the file
must include: "CHR", "SNP", "P", "BP" (bp-position).
REGION-LIST:
Commands to this program MUST include the path to a region-list file. The
column headings of the file must include: "gene_chr", "region_start",
"region_end", and "gene_symbol".
POSITIONAL UNITS:
--bp-form <'mb' OR 'kb' OR 'bp'>
Commands to this program MUST include one of three flags indicating
what unit is used to list positional information in the region list.
_____________________________________________________
map_adapt.py
=== DESCRIPTION ===
This program takes a map file (e.g.'.bim') and produces a copy of
the file replacing any names of non-duplicate SNPs that are not
in 'rs' form with chr<chromosome>:<basepair position> names. The
original map file is saved to another name, while the new one
assumes the name of the original. The program writes a list of the
duplicate SNPs to a new file in the same folder at the map file.
Required commands: --map <file path>
=== INSTRUCTIONS ===
Argument information for this program can be accessed with a call
to the program using --help or -h.
=== REQUIRED COMMANDS ===
MAP FILE:
--map <path to map file>
Commands to the program MUST include the path to a map file.
__________________________________________________
pc_workhorse.py
=== DESCRIPTION ===
This program does all the work that is instantiated by a call to
plink_conditional.py.
=== INSTRUCTIONS ===
DO NOT MAKE CALLS TO PC_WORKHORSE.PY
pc_toolbox.py
=== DESCRIPTION ===
This program contains functions which are used by multiple programs
in the suite.
=== INSTRUCTIONS ===
DO NOT MAKE CALLS TO PC_TOOLBOX.PY
=== ADVANCED USAGE INFORMATION ===
ONLY Users who know what they are doing:
CHANGING DEFAULTS
It might be easier for you to change some of the defaults at
the top of the program file than type in all of the command
flags every time. If you would like, you may do this for
plink_conditional.py or parse_log.py.
DO NOT CHANGE DEFAULTS IN PC_WORKHORSE.PY.
Open the program file in a text editor. The default values are
set beneath the list of globals. The ones in which you may be
interested are marked between rows of asterisks (*). DO NOT
alter anything (i.e. the name of the default) on the left side
of the equals sign, only change the value on the right side.
PLINK Constructions:
plink_conditional.py only cares about 2 things:
1) The column names of the PLINK results file. The program
will look for columns titled 'P' and 'SNP' in order to
perform its search for the most significant SNP and then
use those column titles to instruct locus zoom. If a PLINK
construction results in differently titled columns, there
will be problems.
2) The file extension of the PLINK results file containing the
SNP and P-value information. If a PLINK construction alters
the file extension, there will be problems.
=== CHANGE LOG ===
**10/26/2011:
Changes made to condition-list functionality to avoid overwriting output produced during a
"traditional" run of plink_conditional without providing a condition-list.
Bug Fixed: condition-list SNPs accidentally overwritten in leastP_SNPs file after first run.
**10/27/2011:
Changes made to logging functionality to permit condition_list existance and flag information
to be extracted during parse_log.
**10/28/2011
Changed name of worker program from plink_association.py to pc_workhorse.py to better deter
user from opening it.
**10/30/2011
Updated Advanced Usage section of README file to further explain PLINK construction options.
**11/1/2011 - 11/13/2011
Additional program, region_wrap, added to package in order to facilitate the
reading of regional information by plink_conditional from a region-list.
Changes made to plink_conditional to allow for use of phenotype related PLINK
functionality during tests.
**11/22/2011 - NOW v1.0.3
Bug Fixed: region_wrap misinterpreting command flags.
Bug Fixed: filenames of locuszoom pdfs contained '.assoc'
Additional programs, meta_yank and assoc_yank, added to package to facilitate production of
condition-lists.
**1/17/2012 - NOW v1.1.0
Additional program, pc_toolbox, added to package to improve modularity
Changes made to pc_workhorse to allow for identification of most significant SNP
after final run.
Changes made to entire suite to allow use of chromosomal band information during naming.
Changes made to pc_workhorse to include y-axis label in locuszoom plots
Changes made to meta_yank to include frequency information for selected SNPs
Changes made to parse_log to include frequency information for significant SNPs
**2/2/2012
Changes made to pc_workhorse to allow choice between to SNPs with p=0 using |t-statistic|.