This is the memory lite version of VarGeno
- A modern, C++11 ready compiler, such as
g++
version 4.9 or higher. - The cmake build system (only necessary to install SDSL library. If SDSL library already installed, cmake is not needed)
- A 64-bit operating system. Either Mac OS X or Linux are currently supported.
Please install VarGeno before installing VarGeno-Lite
To install VarGeno Lite version:
cd vargeno/vargeno_lite
make all
You should then see vargeno_lite
, gbf_lite
in vargeno/vargeno_lite directory.
VarGeno takes as input:
- A reference genome sequence in FASTA file format.
- A list of SNPs to be genotyped, in UCSC text file format. VCF format support coming soon
- Sequencing reads from the donor genome in FASTQ file format. If you have multiple FASTQ files, please
cat
them into one file.
Before genotyping an individual, you must construct indices for the reference using the following commands:
vargeno_lite ucscd ref.fa snp.txt ref.dict snp.dict
vargeno_lite filt ref.dict snp_pos ref.filt.dict
gbf_lite snp ref.fa snp.txt snp.bf
This constructs the reference dictionaries ref.dict
and snp.dict
, the reference Bloom filters ref.bf
and snp.bf
, and also a file with the chromosome lengths ref.fa.chrlens
.
To perform the genotyping:
vargeno_lite geno ref.filt.dict snp.dict reads.fq ref.fa.chrlens ref.filt.bf snp.bf result.out
VarGeno variant genotyping output files contains 4 tab-separated fields for each SNP:
- chromosome id
- genome position (1-based): The first two fields together uniquely identify a SNP in the input SNP list.
- genotypes:
0/0
,0/1
or1/1
- quality score in [0,1]: higher quality score means more confident genotyping result
In this example, we genotype 100 SNPs on human chromosome 22 with a small subset of 1000 Genome Project Illumina sequencing reads. The whole process should finish in around a minute and requries XX GB RAM. Suppose VarGeno-Lite is installed in directory $VARGENO_LITE
.
- go to test data directory
cd $VARGENO_LITE/../test
- pre-process the reference and SNP list to generate indices:
$VARGENO_LITE/vargeno_lite ucscd chr22.fa snp.txt ref.dict snp.dict
- generate lite version dictionary and Bloom filter
$VARGENO_LITE/vargeno_lite filt ref.dict snp_pos ref.filt.dict
Note this command will automatically generate the lite version Bloom filter named ref.filt.bf
- generate SNP Bloom filter
$VARGENO_LITE/gbf_lite snp chr22.fa snp.txt snp.bf
- genotype variants:
$VARGENOLITE/vargeno_lite geno ref.filt.dict snp.dict reads.fq chr22.fa.chrlens ref.filt.bf snp.bf result.out
[Warning] The dictionaries and Bloom filters generated by VarGeno is not compatible with VarGeno-Lite.
If you use VarGeno in your research, please cite
- Chen Sun and Paul Medvedev, Accelerating SNP genotyping from whole genome sequencing data for bedside diagnostics
VarGeno's algorithm is built on top of LAVA's. Its code is built on top of LAVA's and it reuses a lot of LAVA's code. It uses some code from the AllSome project.
- Shajii A, Yorukoglu D, William Yu Y, Berger B, Fast genotyping of known SNPs through approximate k-mer matching, Bioinformatics. 2016 32(17):i538-i544. Code is available here.