High precision gene, pathway and cross scoring for GWAS summary statistics in python
- Internal quad and multi-precision arithmetic support for high precision gene scoring via exact (up to 100 digits) or approximate CDF calculations
- Fast random access to SNP reference panel genomic data with minimal memory footprint
- Parallelization over chromosomes and/or genes
- GPU acceleration support via cupy
- Gene-wise coherence test between two GWAS
- Support for SNP selection via external data
The full documentation can be found here. For quickstart, continue below.
If you make use of PascalX for your research, please cite the PascalX publication
Krefl D., Brandulas Cammarata A., Bergmann S.
PascalX: a python library for GWAS gene and pathway enrichment tests
Bioinformatics, btad296
doi.org/10.1093/bioinformatics/btad296
If you make use of the X-scorer (gene-wise coherence test between two GWAS based on the product-normal distribution), please also cite the work
Krefl D., Bergmann S.
Cross-GWAS coherence test at the gene and pathway level
PLOS Computational Biology 18(9): e1010517
doi.org/10.1371/journal.pcbi.1010517
Requirements:
- Python3 with development headers
- g++, make
- BOOST libraries
Install of requirements on Debian/Ubuntu:
sudo apt install python3 python3-dev python3-setuptools python3-pip g++ make libboost-all-dev wget
Export library path:
export LD_LIBRARY_PATH="/yourpath/PascalX/build/lib:$LD_LIBRARY_PATH"
Install PascalX via terminal:
git clone https://github.com/BergmannLab/PascalX.git
cd PascalX
make all
cd python
python3 setup.py install
Build the docker image
git clone https://github.com/BergmannLab/PascalX.git
cd PascalX
docker build . -t pascalx:latest
Run the image in interactive mode with the host directory /your/workdir
mounted as /data
using the command
docker run --mount src=/your/workdir,target=/data,type=bind -p 8888:8888 -it pascalx bash
Jupyter notebook comes pre-installed and is listening on port 8888
.
PascalX can be utilized directly from the command line via the python script pascalx
. To display available options and settings, execute
./pascalx -h
The script has global settings, and command specific settings for genescoring
and xscoring
. The latter can be displayed after specifying the required global (positional) arguments, for example
./pascalx ensemble.txt demo/EUR.simulated out.txt genescoring -h
The first global argument specifies the gene annotation file (will be downloaded automatically from ensembl biomart if the specified file does not exist), the second the reference panel (will be imported automatically from .vcf files if not imported yet), the third argument the file to store results in and the fourth the operation to perform (genescoring
or xscoring
). For example
./pascalx -c [1] -p 4 ensemble.txt demo/EUR.simulated out.txt genescoring -sh False -cr 0 -cp 4 gwasA.tsv.gz
The -c
options specifies a list of chromosomes to score (drop to score all) and -p
how many parallel compute threads to use. -sh
specifies if the GWAS file contains a header line, -cr
the column with rsids and -cp
the column with p-values.
Further examples on how to utilize the command-line interface are given in the demo/demo.sh
script.
Note: The command-line interface uses the saddle-point approximation to calculate CDFs.
Import:
from PascalX import genescorer
Scorer = genescorer.chi2sum()
Options:
window=50000
: Window to add to gene start and end position
varcutoff=0.99
: Variance to keep
MAF=0.05
: MAF threshold
Set reference panel:
1000 Genome Project reference data for the european subpopulation can be downloaded and converted via executing the script in the PascalX/misc folder as below (for GRCh37 replace 38 with 37).
bash get1KGGRCh38.sh pathtostore/ EUR 4 tped
The third parameter specifies the # of cpu cores to utilize. Note that the download and conversion can take several hours.
The reference data has to be loaded by the gene scorer using the method
Scorer.load_refpanel("path/filename",keepfile=None)
The filename is required to not contain the ending .chr#....
. If the corresponding reference data has not been imported yet, PascalX will try to import the data from filename.chr#.tped.gz
or filename.chr#.vcf.gz
files in path/
. For .tped import the genotype information has to be supplied in gzip compressed 1-2-coded plink tped files. The following plink options should do the job:
--recode 12 transpose
By default PascalX uses only one cpu core for the import. The number of cores to utilize can be set via the option parallel=
. Note that the import can take more than an hour per chromosome. A high parallel setting is therefore recommended. For import of allele information raw .vcf files have to be used. In this case the plink step can be skipped. Note that later calls of load_refpanel
will be fast as the converted reference data will be stored on disk and reloaded on the fly.
In order to import allele information into the reference panel, raw .vcf files have to be used for the import. Replace for this the tped
import script option above with vcf
. Note that to keep only a subset of samples under .vcf import, the keepfile=
option has to be set.
Load gene annotation:
Scorer.load_genome("path/filename",ccol=1,cid=5,csymb=7,cstx=3,cetx=4,cs=2,chrStart=3,NAgeneid='n/a',header=True)
filename
: Text file containing gene information. Columns need to be separated by tabs (\t
delimiter)
ccol
: Column containing chromosome
cid
: Column containing gene id
csymb
: Column containing gene symbol
cstx
: Column containing gene start position
cetx
: Column containing gene end position
cs
: Column containing strand
chrStart
: # of first characters to skip in ccol
NAgeneid
: Identifier for not available gene id
header
: First line is header ( True | False )
Note that PascalX can download human genome annotation automatically from ensembl.org BioMart.
from PascalX import genome
D = genome.genome()
D.get_ensembl_annotation(filename,genetype='protein_coding',version='GRCh38')
filename
: File to store the downloaded annotation
genetype
: String of comma separated ensembl gene types to download
version
: Annotation version to download ( GRCh37 | GRCh38 )
WARNING:
PascalX matches genes with rsids via position overlap in the loaded reference panel. Both datasets should be based on the same annotation (for instance both hg19) for consistency.
Load GWAS summary statistics to score:
Scorer.load_GWAS("path/filename",rscol=0,pcol=1,a1col=None,a2col=None)
filename
: Text file containing variant ids and p-values with columns separated by tabs (\t
delimiter)
rscol
: Column containing rsids
pcol
: Column containing p-values
a1col
: Column containing alternate allele information (None for ignoring)
a2col
: Column containing reference allele information (None for ignoring)
Exampe 1: Score all genes in annotation
R = Scorer.score_all()
Example 2: Score genes on chromosomes 21 and 22
R = Scorer.score_chr(chrs=[21,22])
Example 3: Score genes WDR12 and FARP2
R = Scorer.score(['WDR12','FARP2'])
Options:
method='saddle'
: Method to use for scoring
parallel=1
: Number of cores to use
nobar=False
: Disable progress bar
Return:
R = [R_SUCCESS,R_FAIL,R_TOTALFAIL]
R_SUCCESS = [ ['Symbol',p-value,Nsnps],...]
R_FAIL = [ ['Symbol',[infos] ] ,...]
R_TOTALFAIL = [ ['Symbol','Reason'] ,...]
R_SUCCESS
: List of successfully scored genes
R_FAIL
: List of genes for which the scoring failed due to non-convergence of the scoring algorithm
R_TOTALFAIL
: List of genes for which the scoring failed for other reasons
Re-scoring:
The genes in R_FAIL can be scored again with a manual choice of algorithm:
R = Scorer.rescore(R,method='ruben',mode='128b',reqacc=1e-32,intlimit=100000)
method= 'auto' | 'ruben' | 'davies' | 'satterthwaite' | 'saddle'
: Algorithm to use
mode='' | '128b' | '100d'
: internal precision (double | quad | 100 digits)
reqacc=float
: Requested accuracy.
intlimit=integer
: Cutoff
NOTE:
ruben
and davies
compute exactly up to requested precision (reqacc
). satterthwaite
is a second order approximation and pearson
a third order approximation. saddle
uses a saddle-point approximation and is for most use cases a good choice as it is fast and its accuracy comes close to the exact calculation. auto
tries to automatically select for given gene between davies and ruben to maximize throughput. If auto
fails, it is recommended to rescore the missing genes with ruben
and a very high intlimit
setting. saddle
is the default setting for the gene scorer.
TIP:
Ruben converges very slowly if the ratio between the largest and smallest eigenvalue is large. Try to reduce the varcutoff
parameter in this case.
Output:
The scoring results R_SUCCESS
can be saved in a tab separated file via:
scorer.save_scores('filename')
PascalX offers two different pathway scorers.
Initialization:
Define a gene scorer as above and score or load scored genes for a GWAS. The scorers are then initialized as follows:
Rank based scoring
Pscorer = pathway.chi2rank(scorer)
The rank scorer uniformizes the gene p-value distribution via ranking and aggregates p-values via inverse transform to chi^2 distributed random variables.
Monte-Carlo based scoring
Pscorer = pathway.chi2perm(scorer)
Gene p-values are chi^2 inverse transformed and the sum of chi^2 values for a given pathway is compared against a randomly generated gene sets of equal size.
Loading modules:
Sets of modules to be scored can be loaded from a tab-separated file via the command
M = P.load_modules("filename.tsv",ncol=0,fcol=2)
ncol=0
: Column of name/id
fcol=2
: Column of first gene symbol
Scoring:
RESULT = P.score(M)
RESULT = [ [[name,[symbols],p-value],...], R_FAIL ]
If the list R_FAIL is not empty, a gene re-scoring for the listed genes and re-scoring of affected pathways is recommended. Note that the genes and meta-genes with out gene score are removed from the pathway before pathway scoring.
PascalX offers two different GWAS cross scorers.
Coherence scorer:
from PascalX import xscorer
X = xscorer.zsum(leftTail=False)
X.load_genome('path/filename')
Note that the default initialization of the gene scoring above are used. leftTail=
sets the side to test. False corresponds to anti-coherence and True to coherence. A gene annotation has to be loaded as for the standard Genescorer.
X.load_GWAS('path/filenameA',name='GWAS A',rscol=0,pcol=1,bcol=2,a1col=3,a2col=4,header=False)
X.load_GWAS('path/filenameB',name='GWAS B',rscol=0,pcol=1,bcol=2,a1col=3,a2col=4,header=False)
In the GWAS data loading routine, we have to set in addition a name for each GWAS to be loaded via the name=
argument, and it is necessary to specify the column with the raw betas bcol=
.
It is recommended to perform the scoring after filtering for matching alleles
X.matchAlleles('GWAS A','GWAS B')
This requires that allele information has been loaded for both GWAS, and for the reference panel (using .vcf import).
Note that it is also recommended to jointly QQ normalize p-values:
X.jointlyRank('GWAS A','GWAS B')
The scoring is started via calling
R = X.score_all(E_A='GWAS A',E_B='GWAS B')
The return R
is as for the Genescorer class.
Ratio scorer:
As above, but with
X = xscorer.rsum(leftTail=False)
Scorer.plot_Manhattan(R[0])
Options:
sigLine=pval
: Draw significance threshold line at p-value (0 for off)
logsigThreshold=logpval
: Threshold above which to label genes (log p-value)
labelSig=True
: Plot names of genes above logsigThreshold (True | False)
labelList=[]
: List of gene symbols to label
PASCALX IS A RESEARCH LEVEL TOOL. NO WARRANTY OR GUARANTEE WHATSOEVER FOR ITS CORRECT FUNCTIONALITY IS GIVEN. YOU SHOULD PERFORM YOUR OWN CONSISTENCY CHECKS ON RESULTS PASCALX IMPLIES.