Infraestrutura | CPU | Memória (GB) | Storage (GB) | Etapa | Tempo |
---|---|---|---|---|---|
gitpod | 16 | 64 | ?DD 30 | FASTQ -> SAM | 55-61min (chr9) |
SAM -> BAM | 13-18min | ||||
BAM -> SORT_BAM | ~6-12min | ||||
SORT_BAM -> RMDUP | ~12 min | ||||
servidor local | 16 | 32 | SSD 30 | todas as etapas | ~2h (hg19) |
-
O arquivo no NCBI da amostra WP312 (tumor)
-
Projeto SRA: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP190048&o=bases_l%3Aa
-
ID: SRR8856724
-
Precisa instalar o sratoolkit (dentro do gitpod)
-
# instalar utilizando o brew install (gitpod) brew install sratoolkit # parallel fastq-dump # https://github.com/rvalieris/parallel-fastq-dump pip install parallel-fastq-dump # caso o vdb-config não funcione wget -c https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-ubuntu64.tar.gz tar -zxvf sratoolkit.3.0.0-ubuntu64.tar.gz export PATH=$PATH://workspace/somaticoEP1/sratoolkit.3.0.0-ubuntu64/bin/ echo "Aexyo" | sratoolkit.3.0.0-ubuntu64/bin/vdb-config
-
# fastq-dump: comando que faz o download do arquivo utilizando o SRR ID da amostra # --sra-id: SRR # --threads 4: paraleliza em 4 partes o download # --gzip: fastq compactados # --split-files: ele vai separar os arquivos fastq em 1 e 2 (paired) time parallel-fastq-dump --sra-id SRR8856724 --threads 4 --outdir ./ --split-files --gzip
-
Google Colab sratoolkit (modo alternativo)
-
-
# download do binário ubuntu 64bits
%%bash
wget -c https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-ubuntu64.tar.gz
# -z unzip
# -x extract
# -v verbose
# -f force?
tar -zxvf sratoolkit.3.0.0-ubuntu64.tar.gz
# validate
./sratoolkit.3.0.0-ubuntu64/bin/vdb-config
# adicionando o diretorio dos programas no path
export PATH=$PATH:/content/sratoolkit.3.0.0-ubuntu64/bin/
# downlaod parallel
parallel-fastq-dump --sra-id SRR8856724 --threads 6 --outdir ./ --split-files --gzip --tmpdir /content/
-
AS Referências do Genoma hg38 (FASTA, VCFs)
-
Os arquivos de Referência: Panel of Normal (PoN), Gnomad AF e Exac common:
wget -c https://storage.googleapis.com/gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz
wget -c https://storage.googleapis.com/gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz.tbi
wget -c https://storage.googleapis.com/gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz
wget -c https://storage.googleapis.com/gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz.tbi
wget -c https://storage.googleapis.com/gatk-best-practices/somatic-hg38/small_exac_common_3.hg38.vcf.gz
wget -c https://storage.googleapis.com/gatk-best-practices/somatic-hg38/small_exac_common_3.hg38.vcf.gz.tbi
-
Arquivo no formato FASTA do genoma humano hg38
-
Diretório Download UCSC hg38: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/
-
chr9.fa.gz: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr9.fa.gz
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr9.fa.gz
-
-
-
-
BWA para mapeamento dos arquivos FASTQ (gitpod)
- Instalando o BWA no gitpod
brew install bwa
- Google Colab
Abri um CODE
!sudo apt-get install bwa
-
BWA index do arquivo chr9.fa.gz
# descompacta o arquivo gz gunzip chr9.fa.gz # bwa index para indexar o arquivo .fa (~5min) bwa index chr9.fa
-
Samtools faidx
- Install Gitpod
brew install samtools
- Install Google Colab
!sudo apt-get install samtools
- Samtools faidx chr9.fa
samtools faidx chr9.fa
-
BWA-mem para fazer o alinhamento (FASTQ -> BAM)
NOME=WP312; Biblioteca=Nextera; Plataforma=illumina;
bwa mem -t 16 -M -R "@RG\tID:$NOME\tSM:$NOME\tLB:$Biblioteca\tPL:$Plataforma" \
chr9.fa \
SRR8856724_1.fastq.gz \
SRR8856724_2.fastq.gz > WP312.sam
samtools: fixmate, sort e index (~min)
# -@ numero de cores utilizados
time samtools fixmate -@10 WP312.sam WP312.bam
# ordenando o arquivo fixmate
time samtools sort -O bam -@6 -m2G -o WP312_sorted.bam WP312.bam
# indexando o arquivo BAM ordenado (sort)
time samtools index WP312_sorted.bam
# abordagem de target sequencing utilizamos o rmdup para remover duplicata de PCR
time samtools rmdup WP312_sorted.bam WP312_sorted_rmdup_F4.bam
# indexando o arquivo BAM rmdup
time samtools index WP312_sorted_rmdup_F4.bam
Alternativa: combinar com pipes: bwa + samtools view e sort
bwa mem -t 10 -M -R "@RG\tID:$NOME\tSM:$NOME\tLB:$Biblioteca\tPL:$Plataforma" chr9.fa SRR8856724_1.fastq.gz SRR8856724_2.fastq.gz | samtools view -F4 -Sbu -@2 - | samtools sort -m4G -@2 -o WP312_sorted.bam
NOTA: se utilizar a opção alternativa, não esquecer de rodar o samtools para as etapas: rmdup e index (do rmdup).
- Cobertura - make BED files
Instalação do bedtools
Gitpod
brew install bedtools
Google Colab
!sudo apt-get install bedtools
Gerando BED do arquivo BAM
bedtools bamtobed -i WP312_sorted_rmdup_F4.bam > WP312_sorted_rmdup.bed
bedtools merge -i WP312_sorted_rmdup.bed > WP312_sorted_rmdup_merged.bed
bedtools sort -i WP312_sorted_rmdup_merged.bed > WP312_sorted_rmdup_merged_sorted.bed
Cobertura Média
Pegando arquivo view -F4
do BAM WP312.
git clone https://github.com/circulosmeos/gdown.pl.git
./gdown.pl/gdown.pl https://drive.google.com/file/d/1pTMpZ2eIboPHpiLf22gFIQbXU2Ow26_E/view?usp=drive_link WP312_sorted_rmdup_F4.bam
./gdown.pl/gdown.pl https://drive.google.com/file/d/10utrBVW-cyoFPt5g95z1gQYQYTfXM4S7/view?usp=drive_link WP312_sorted_rmdup_F4.bam.bai
bedtools coverage -a WP312_sorted_rmdup_merged_sorted.bed \
-b WP312_sorted_rmdup_F4.bam -mean \
> WP312_coverageBed.bed
Filtro por total de reads >=20
cat WP312_coverageBed.bed | \
awk -F "\t" '{if($4>=20){print}}' \
> WP312_coverageBed20x.bed
- GATK4 instalação (MuTect2 com PoN)
Download
wget -c https://github.com/broadinstitute/gatk/releases/download/4.2.2.0/gatk-4.2.2.0.zip
Descompactar
unzip gatk-4.2.2.0.zip
Gerar arquivo .dict
./gatk-4.2.2.0/gatk CreateSequenceDictionary -R chr9.fa -O chr9.dict
Gerar interval_list do chr9
./gatk-4.2.2.0/gatk ScatterIntervalsByNs -R chr9.fa -O chr9.interval_list -OT ACGT
Converter Bed para Interval_list
./gatk-4.2.2.0/gatk BedToIntervalList -I WP312_coverageBed20x.bed \
-O WP312_coverageBed20x.interval_list -SD chr9.dict
GATK4 - CalculateContamination
./gatk-4.2.2.0/gatk GetPileupSummaries \
-I WP312_sorted_rmdup_F4.bam \
-V af-only-gnomad.hg38.vcf.gz \
-L WP312_coverageBed20x.interval_list \
-O WP312.table
./gatk-4.2.2.0/gatk CalculateContamination \
-I WP312.table \
-O WP312.contamination.table
GATK4 - MuTect2 Call
./gatk-4.2.2.0/gatk Mutect2 \
-R chr9.fa \
-I WP312_sorted_rmdup_F4.bam \
--germline-resource af-only-gnomad.hg38.vcf.gz \
--panel-of-normals 1000g_pon.hg38.vcf.gz \
-L WP312_coverageBed20x.interval_list \
-O WP312.somatic.pon.vcf.gz
GATK4 - MuTect2 FilterMutectCalls
./gatk-4.2.2.0/gatk FilterMutectCalls \
-R chr9.fa \
-V WP312.somatic.pon.vcf.gz \
--contamination-table WP312.contamination.table \
-O WP312.filtered.pon.vcf.gz
Download dos arquivos VCFs da versão hg19 da análise antiga do Projeto LMA Brasil:
https://drive.google.com/drive/folders/1m2qmd0ca2Nwb7qcK58ER0zC8-1_9uAiE?usp=sharing
WP312.filtered.vcf.gz
WP312.filtered.vcf.gz.tbi
wget -c https://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/liftOver
wget -c https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver
Alterar a permissão de execução do arquivo liftOver
chmod +x liftOver
Para testar execute:
./liftOver
liftOver - Move annotations from one assembly to another
usage:
liftOver oldFile map.chain newFile unMapped
oldFile and newFile are in bed format by default, but can be in GFF and
maybe eventually others with the appropriate flags below.
The map.chain file has the old genome as the target and the new genome
as the query.
***********************************************************************
WARNING: liftOver was only designed to work between different
assemblies of the same organism. It may not do what you want
if you are lifting between different organisms. If there has
been a rearrangement in one of the species, the size of the
region being mapped may change dramatically after mapping.
***********************************************************************
options:
-minMatch=0.N Minimum ratio of bases that must remap. Default 0.95
-gff File is in gff/gtf format. Note that the gff lines are converted
separately. It would be good to have a separate check after this
that the lines that make up a gene model still make a plausible gene
after liftOver
-genePred - File is in genePred format
-sample - File is in sample format
-bedPlus=N - File is bed N+ format (i.e. first N fields conform to bed format)
-positions - File is in browser "position" format
-hasBin - File has bin value (used only with -bedPlus)
-tab - Separate by tabs rather than space (used only with -bedPlus)
-pslT - File is in psl format, map target side only
-ends=N - Lift the first and last N bases of each record and combine the
result. This is useful for lifting large regions like BAC end pairs.
-minBlocks=0.N Minimum ratio of alignment blocks or exons that must map
(default 1.00)
-fudgeThick (bed 12 or 12+ only) If thickStart/thickEnd is not mapped,
use the closest mapped base. Recommended if using
-minBlocks.
-multiple Allow multiple output regions
-noSerial In -multiple mode, do not put a serial number in the 5th BED column
-minChainT, -minChainQ Minimum chain size in target/query, when mapping
to multiple output regions (default 0, 0)
-minSizeT deprecated synonym for -minChainT (ENCODE compat.)
-minSizeQ Min matching region size in query with -multiple.
-chainTable Used with -multiple, format is db.tablename,
to extend chains from net (preserves dups)
-errorHelp Explain error messages
Converter as posição do hg19 para hg38 hg19ToHg38.over.chain.gz
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
gunzip hg19ToHg38.over.chain.gz
NOTA: Nos arquivos VCFs antigos do projeto LMA Brasil, a descrição do nome da referência não tem chr
apenas o número ou letra do cromossomo humano, isso pode causar conflito na hora do GATK verificar se a sua referência é igual a referência do arquivo VCF.
- Vamos adicionar o caracter
chr
no arquivo VCF antigo e salvar um novo.
# pegando apenas o cabeçalho
zgrep "\#" WP312.filtered.vcf.gz > header.txt
zgrep -v "\#" WP312.filtered.vcf.gz | awk '{print("chr"$0)}' > variants.txt
cat header.txt variants.txt > WP312.filtered.chr.vcf
- Fazer download do arquivo completo do genoma hg38.fa
# Arquivos com todos os cromossomos
# https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip
gunzip hg38.fa.gz
- Gear o .dict para o hg38.fa.
./gatk-4.2.2.0/gatk CreateSequenceDictionary -R hg38.fa -O hg38.dict
- Rodar o LiftOverVCF do GATK.
# lembre que o VCF tem que ser o com chr
./gatk-4.2.2.0/gatk LiftoverVcf \
-I WP312.filtered.chr.vcf \
-O liftOver_WP312_hg19Tohg38.vcf.gz \
--CHAIN hg19ToHg38.over.chain \
--REJECT liftOver_Reject_WP312_hg19Tohg38.vcf \
-R hg38.fa
Vamos pegar os dois VCFs que estão na mesma versão do hg38 e comparar o número de variantes e a % de match entre os arquivos.
Nota: a comparação vai ser apenas com base nas Referências que derem match (tem que estar no dois arquivos vcf).
Gitpod install
brew install vcftools
Colab install
!sudo apt-get install vctfools
Nota: Para utilizadr o vcf-compare
precisamos que os arquivos VCFs estejam bgzip
e com o index dotabix
Fiz o download do arquivo WP312.filtered.pon.vcf.gz
que está o compartilhamento do Google Drive e coloquei no Gtipod dentro de um diretório chamado hg38-vcf-EP1
mkdir hg38-vcf-EP1
mv WP312.filtered.pon.vcf.gz WP312.filtered.pon.vcf.gz.tbi hg38-vcf-EP1/
Rodar o vcf-compare
- vcf-compare file1.vcf file2.vcf ... fileN.vcf
- liftOver_WP312_hg19Tohg38.vcf.gz: arquivo que convertemos do hg19 para hg38
- hg38-vcf-EP1/WP312.filtered.pon.vcf.gz: Arquivo da aula EP01 (primeira parte)
vcf-compare liftOver_WP312_hg19Tohg38.vcf.gz hg38-vcf-EP1/WP312.filtered.pon.vcf.gz
Resultado vcf-compare
#VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
#VN The columns are:
#VN 1 .. number of sites unique to this particular combination of files
#VN 2- .. combination of files and space-separated number, a fraction of sites in the file
VN 166 hg38-vcf-EP1/WP312.filtered.pon.vcf.gz (0.2%) liftOver_WP312_hg19Tohg38.vcf.gz (1.0%)
VN 16971 liftOver_WP312_hg19Tohg38.vcf.gz (99.0%)
VN 78215 hg38-vcf-EP1/WP312.filtered.pon.vcf.gz (99.8%)
#SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN Number of REF matches: 165
SN Number of ALT matches: 163
SN Number of REF mismatches: 1
SN Number of ALT mismatches: 2
SN Number of samples in GT comparison: 0
# Number of sites lost due to grouping (e.g. duplicate sites): lost, %lost, read, reported, file
SN Number of lost sites: 2 0.0% 17139 17137 liftOver_WP312_hg19Tohg38.vcf.gz
zgrep para separar cabeçalho #
e variantes do cromossomo 9 do arquivo lifOver:
zgrep "^\#\|chr9" liftOver_WP312_hg19Tohg38.vcf.gz > liftOver_WP312_hg19Tohg38_chr9.vcf
Compactar o arquivo VCF com o comando bgzip
:
bgzip liftOver_WP312_hg19Tohg38_chr9.vcf
Rodar o comando tabix
para indexar o arquivo VCF compactado:
tabix -p vcf liftOver_WP312_hg19Tohg38_chr9.vcf.gz
Fazer nova comparação com o arquivo vcf do liftOver apenas das variantes do chr9:
vcf-compare liftOver_WP312_hg19Tohg38_chr9.vcf.gz hg38-vcf-EP1/WP312.filtered.pon.vcf.gz
- Documentação SRA-ToolKit: https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit
- fast-dump: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump
- fastq-dump parallel: https://github.com/rvalieris/parallel-fastq-dump
- LMA Brasil - VCFs hg19 - https://drive.google.com/drive/folders/1m2qmd0ca2Nwb7qcK58ER0zC8-1_9uAiE?usp=sharing
- UCSC liftOver - https://hgdownload.cse.ucsc.edu/admin/exe/
- liftOver Doc - https://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#Liftover
- hg38ToHg19 (hg38ToHg19.over.chain.gz) - http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/
- hg19ToHg38 (hg19ToHg38.over.chain.gz) - https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver
- Vcftools - https://vcftools.github.io/index.html