Pipeline Somático - Anotação ensembl-vep
brew install aria2
aria2c -x 8 https://storage.googleapis.com/puga-reference/homo_sapiens_merged_110_GRCh37.zip
unzip homo_sapiens_merged_110_GRCh37.zip
- download UCSC hg19.fa.gz
aria2c -x 5 https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
- descompactar
gunzip hg19.fa.gz
- mover para o diretório homo_sapiens_merged
mv hg19.fa homo_sapiens_merged
chmod 777 homo_sapiens_merged/
docker pull ensemblorg/ensembl-vep
- WP312.filtered.vcf.gz
- WP312.filtered.vcf.gz.tbi
https://drive.google.com/drive/u/0/folders/1m2qmd0ca2Nwb7qcK58ER0zC8-1_9uAiE
mkdir -p vep_output
chmod 777 vep_output
docker run -it --rm -v $(pwd):/data ensemblorg/ensembl-vep vep \
-i /data/WP312.filtered.vcf.gz \
-o /data/vep_output/WP312.filtered.vep.tsv \
--assembly GRCh37 \
--merged -pick \
--pick_allele \
--force_overwrite \
--tab --symbol --distance 0 \
--fields "Location,SYMBOL,Consequence,Feature,Amino_acids,CLIN_SIG" \
--individual all \
--dir_cache /data/ \
--cache --offline \
--check_existing \
--fork 10 \
--fasta /data/homo_sapiens_merged/hg19.fa
Todas as colunas de anotação padrão do VEP usando --everything
, maior uso dos cpus do gitpod --fork 16
e diminuição do buffer de 5k para 1k --buffer_size 200
.
docker run -it --rm -v $(pwd):/data ensemblorg/ensembl-vep vep \
-i /data/lite.vcf.gz \
-o /data/vep_output/lite.vep.vcf \
--assembly GRCh37 \
--merged \
--fork 16 \
--buffer_size 200 \
--force_overwrite \
--dir_cache /data/ \
--offline \
--cache \
--no_intergenic \
--distance 0 \
--pick \
--individual all \
--vcf \
--symbol \
--biotype \
--hgvs \
--numbers \
--af \
--af_gnomadg \
--variant_class \
--sift b \
--polyphen b \
--check_existing \
--fields "Location,SYMBOL,Consequence,Feature,BIOTYPE,HGVSc,HGVSp,EXON,INTRON,VARIANT_CLASS,SIFT,PolyPhen,AF,gnomADg_AF,CLIN_SIG,SOMATIC,PHENO" \
--fasta /data/homo_sapiens_merged/hg19.fa
- git clone Htslib
git clone --recurse-submodules https://github.com/samtools/htslib.git
- git clone cftools
git clone https://github.com/samtools/bcftools.git
- entrar no diretório bcftools
cd bcftools/
- compilar
make
- testar
./bcftools
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
Version: 1.18-25-g44deedcd (using htslib 1.18-52-g2140d03e)
Usage: bcftools [--version|--version-only] [--help] <command> <argument>
Commands:
-- Indexing
index index VCF/BCF files
-- VCF/BCF manipulation
annotate annotate and edit VCF/BCF files
concat concatenate VCF/BCF files from the same set of samples
convert convert VCF/BCF files to different formats and back
head view VCF/BCF file headers
isec intersections of VCF/BCF files
merge merge VCF/BCF files files from non-overlapping sample sets
norm left-align and normalize indels
plugin user-defined plugins
query transform VCF/BCF into user-defined formats
reheader modify VCF/BCF header, change sample names
sort sort VCF/BCF file
view VCF/BCF conversion, view, subset and filter VCF/BCF files
-- VCF/BCF analysis
call SNP/indel calling
consensus create consensus sequence by applying VCF variants
cnv HMM CNV calling
csq call variation consequences
filter filter VCF/BCF files using fixed thresholds
gtcheck check sample concordance, detect sample swaps and contamination
mpileup multi-way pileup producing genotype likelihoods
roh identify runs of autozygosity (HMM)
stats produce VCF/BCF stats
-- Plugins (collection of programs for calling, file manipulation & analysis)
0 plugins available, run "bcftools plugin -l" for help
Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
automatically even when streaming from a pipe. Indexed VCF and BCF will work
in all situations. Un-indexed VCF and BCF and streams will work in most but
not all situations.
export
- colocando o plugin do bcftools no path
export BCFTOOLS_PLUGINS=//workspace/somatico-vep/bcftools/plugins/
- testando o bcftools +split-vep
cd /workspace/somatico-vep
./bcftools/bcftools +split-vep
About: Query structured annotations such INFO/CSQ created by bcftools/csq or VEP. For more
more information and pointers see http://samtools.github.io/bcftools/howtos/plugin.split-vep.html
Usage: bcftools +split-vep [Plugin Options]
Plugin options:
-a, --annotation STR INFO annotation to parse [CSQ]
-A, --all-fields DELIM Output all fields replacing the -a tag ("%CSQ" by default) in the -f
filtering expression using the output field delimiter DELIM. This can be
"tab", "space" or an arbitrary string.
-c, --columns [LIST|-][:TYPE] Extract the fields listed either as 0-based indexes or names, "-" to extract all
fields. See --columns-types for the defaults. Supported types are String/Str,
Integer/Int and Float/Real. Unlisted fields are set to String. Existing header
definitions will not be overwritten, remove first with `bcftools annotate -x`
--columns-types -|FILE Pass "-" to print the default -c types or FILE to override the presets
-d, --duplicate Output per transcript/allele consequences on a new line rather rather than
as comma-separated fields on a single line
-f, --format STR Create non-VCF output; similar to `bcftools query -f` but drops lines w/o consequence
-g, --gene-list [+]FILE Consider only features listed in FILE, or prioritize if FILE is prefixed with "+"
--gene-list-fields LIST Fields to match against by the -g list, by default gene names [SYMBOL,Gene,gene]
-H, --print-header Print header
-l, --list Parse the VCF header and list the annotation fields
-p, --annot-prefix STR Before doing anything else, prepend STR to all CSQ fields to avoid tag name conflicts
-s, --select TR:CSQ Select transcripts to extract by type and/or consequence severity. (See also -S and -x.)
TR, transcript: worst,primary(*),all [all]
CSQ, consequence: any,missense,missense+,etc [any]
(*) Primary transcripts have the field "CANONICAL" set to "YES"
-S, --severity -|FILE Pass "-" to print the default severity scale or FILE to override
the default scale
-u, --allow-undef-tags Print "." for undefined tags
-x, --drop-sites Drop sites without consequences (the default with -f)
-X, --keep-sites Do not drop sites without consequences (the default without -f)
Common options:
-e, --exclude EXPR Exclude sites and samples for which the expression is true
-i, --include EXPR Include sites and samples for which the expression is true
--no-version Do not append version and command line to the header
-o, --output FILE Output file name [stdout]
-O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v]
-r, --regions REG Restrict to comma-separated list of regions
-R, --regions-file FILE Restrict to regions listed in a file
--regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
-t, --targets REG Similar to -r but streams rather than index-jumps
-T, --targets-file FILE Similar to -R but streams rather than index-jumps
--targets-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0]
--write-index Automatically index the output files [off]
Examples:
# List available fields of the INFO/CSQ annotation
bcftools +split-vep -l file.vcf.gz
# List the default severity scale
bcftools +split-vep -S -
# Extract Consequence, IMPACT and gene SYMBOL of the most severe consequence into
# INFO annotations starting with the prefix "vep". For brevity, the columns can
# be given also as 0-based indexes
bcftools +split-vep -c Consequence,IMPACT,SYMBOL -s worst -p vep file.vcf.gz
bcftools +split-vep -c 1-3 -s worst -p vep file.vcf.gz
# Same as above but use the text output of the "bcftools query" format
bcftools +split-vep -s worst -f '%CHROM %POS %Consequence %IMPACT %SYMBOL\n' file.vcf.gz
# Print all subfields (tab-delimited) in place of %CSQ, each consequence on a new line
./bcftools/bcftools +split-vep -l vep_output/lite.vep.vcf | cut -f2 | tr '\n\r' '\t' | awk '{print("CHROM\tPOS\tREF\tALT\t"$0"FILTER\tGT\tDP\tAD\tGT\tDP\tAD")}' > vep_output/lite.vep.tsv
./bcftools/bcftools +split-vep -f '%CHROM\t%POS\t%REF\t%ALT\t%CSQ\t%FILTER\t[%GT\t%DP\t%AD\t]\n' -d -A tab vep_output/lite.vep.vcf -p x >> vep_output/lite.vep.tsv