Skip to content

Convert NGS reads to analysis ready variants with GATK's ReadsPipelineSpark

Notifications You must be signed in to change notification settings

lajd/reads2variants

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reads to Analysis Ready Variants

Overview

This repository provides both an example and a scripted procedure for taking raw genomics reads to analysis ready variants following GATK's best practices, using Python.

In particular, we follow the GATK's steps for germline cohort data as outlined in Germline-short-variant-discovery-SNPs-Indels.

This pipeline uses the current latest versions of GATK and hail:

  • GATK: 4.2.6.1
  • Hail: 0.2.95

Methodology

Obtaining data

All raw data was obtained from 2 resources:

including unmapped read files for each sample (BAM), reference genome (Fasta) and known sites (VCF) data.

Executing GATK steps in docker

In order to keep our experiment reproducible, we utilize the GATK docker image available from dockerhub at broadinstitute/gatk:4.2.6.1. We use the latest version, which is currently 4.2.6.1

Scalability

We emphasize the use of tools which promote scalability, including the use of GATK's BETA Spark features (e.g. ReadsPipelineSpark) and Hail for scalable downstream analytics. Both of these tools leverage Spark to enable large scale data processing, which allows for scaling from a local workstation to a large HPC or cloud cluster.

Example Pipeline as Markdown

To see an example of the pipeline as markdown with bash commands executed from within the GATK docker image, please see the pipeline_mds directory.

Helper classes for automating the pipeline with Python

The main contribution of this repository is to script the pipeline using Python, running each of the GATK steps in docker and automatically handling the stages of the workflow. For example, the following python code snippet will perform the following:

  1. Download all data
    1. Sample reads (i.e., a cohort of reads)
    2. Known sites
    3. Reference
  2. Perform data preprocessing
    1. Filtering, sorting and indexing cohort reads
    2. Creating indices for known sites
    3. Creating an index for the reference
  3. Perform ReadsPipelineSpark for each sample, obtaining analysis-ready variants for each sample of the cohort
  4. Combine cohort samples into a GenomicsDB
  5. Perform joint genotyping on the cohort with HaplotypeCaller
  6. Fit/applies a variant recalibration model to obtain variant quality scores
  7. Applies VQSR to filter low quality variants

Python package installation

Prerequisites

  • Docker (tested 20.10.14)
  • ~32 GB RAM
  • ~20GB SSD space
  • Linux recommended (Ubuntu 20.04 tested)

Setup environment

Create the conda environment

CONDA_ENV_NAME=reads_to_variants
conda create -n $CONDA_ENV_NAME python=3.7 -y

Activate the conda environment

conda activate $CONDA_ENV_NAME

Install the python package

pip install .

Install Jupyter Notebook

conda install ipykernel                                    

Add the created conda environment to the ipython kernel

ipython kernel install --user --name=$CONDA_ENV_NAME

(Optional) Open the example notebook Start the Jupyter server from the root of this repository and navigate to the localhost shown.

jupyter notebook

Navigate to the example notebook located at examples/trio_variant_analysis_pipeline.ipynb

Python Getting Started

Below we duplicate some python code from the example notebook found examples/trio_variant_analysis_pipeline.ipynb. In particular, we download raw reads from a father/mother/child trio dataset obtained from the 1000 genomes project, and use the Python helpers to obtain analysis ready variants that can be used by Hail.

from os.path import join

PHASE_3_1KG_ROOT_FTP = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data"
BROAD_INSTITUTE_B37_ROOT_FTP = "ftp://[email protected]/bundle/b37"

# Reads for father/mother/child
READS_BAM_URIS=[
    # Father
    join(PHASE_3_1KG_ROOT_FTP, "HG00731/alignment/HG00731.chrom20.ILLUMINA.bwa.PUR.low_coverage.20130422.bam"),
    # Mother
    join(PHASE_3_1KG_ROOT_FTP, "HG00732/alignment/HG00732.chrom20.ILLUMINA.bwa.PUR.low_coverage.20130422.bam"),
    # Child
    join(PHASE_3_1KG_ROOT_FTP, "HG00733/alignment/HG00733.chrom20.ILLUMINA.bwa.PUR.low_coverage.20130415.bam"),
]

# Reference genome
REF_FA_GZ_URI = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz"

# Known sites for variant extraction
KNOWN_EXTRACTION_VCF_GZ_URIS = [
    join(BROAD_INSTITUTE_B37_ROOT_FTP, "dbsnp_138.b37.vcf.gz")
]

# Known sites for VQSR model
KNOWN_SITES_VQSR_URIS = [
    join(BROAD_INSTITUTE_B37_ROOT_FTP, "hapmap_3.3.b37.vcf.gz"),
    join(BROAD_INSTITUTE_B37_ROOT_FTP, "1000G_omni2.5.b37.vcf.gz"),
    join(BROAD_INSTITUTE_B37_ROOT_FTP, "1000G_phase1.snps.high_confidence.b37.vcf.gz"),
    join(BROAD_INSTITUTE_B37_ROOT_FTP, "dbsnp_138.b37.vcf.gz")
]

Perform variant discovery to obtain analysis-ready variants

from src.gatk.extract_sample_variants import ExtractGenomicVariants

variant_discovery = ExtractGenomicVariants(
    volume_hostpath='/data/Genomics',
    paired_reads_file_bam_uris=READS_BAM_URIS,
    ref_file_fasta_uri=REF_FA_GZ_URI,
    known_sites_vcf_uris=KNOWN_EXTRACTION_VCF_GZ_URIS,
    dry_run=False,
    spark_driver_memory= '5g',
    spark_executor_memory= '8g',
    java_options= '-Xmx48g',
)

# Run the pipeline to obtain the GVCF path for each sample
sample_gvcf_paths = variant_discovery.run_pipeline()
print(sample_gvcf_paths)
# ['/data/sample_gvcfs/human/HG00731.g.vcf', '/data/sample_gvcfs/human/HG00732.g.vcf', '/data/sample_gvcfs/human/HG00733.g.vcf']

Curate a genotype dataset and filter low quality variants

from src.gatk.curate_genotype_dataset import CurateGenotypeDataset

curated_genotyped_dataset = CurateGenotypeDataset(
    volume_hostpath='/data/Genomics',
    intervals=['20'],
    gvcf_paths=sample_gvcf_paths,
    annotations=[
        "QD",
        "MQ",
        "MQRankSum",
        "ReadPosRankSum",
        "FS",
        "SOR"
    ],
    known_sites_vqsr=KNOWN_SITES_VQSR_URIS,
    resources=[
        ('hapmap,known=false,training=true,truth=true,prior=15.0', 'hapmap_3.3.b37.vcf'),
        ('omni,known=false,training=true,truth=false,prior=12.0', '1000G_omni2.5.b37.vcf'),
        ('1000G,known=false,training=true,truth=false,prior=10.0', '1000G_phase1.snps.high_confidence.b37.vcf'),
        ('dbsnp,known=true,training=false,truth=false,prior=2.0', 'dbsnp_138.b37.vcf'),
    ],
    reference_path=variant_discovery.raw_data_container_paths_map[ExtractGenomicVariants.REFERENCE_DIR],
    dry_run=False
)

# Combine sample GVCFs and extract filtered variants for analysis
filtered_vcf_save_path = curated_genotyped_dataset.run_pipeline()

print(filtered_vcf_save_path)
# /data/curated_gvcfs/human/gvcf_sample_HG00731_HG00732_HG00733__n_gvcfs_3__intervals_20__annotations_QD_MQ_MQRankSum_ReadPosRankSum_FS_SOR.filtered_variants.g.vcf

Load the cohort VCF file into Hail as a MatrixTable

from src.hail.utils import vcf_path_to_mt, hl

mt = vcf_path_to_mt(filtered_vcf_save_path)

Perform analysis

mt.GT.show(n_rows=100)
'HG00731'
'HG00732'
'HG00733'
locus
alleles
GT
GT
GT
locus<GRCh37>array<str>callcallcall
20:65900["G","A"]1/11/11/1
20:66370["G","A"]1/11/11/1
20:67500["T","TTGGTATCTAG"]1/11/11/1
20:68749["T","C"]0/10/11/1
20:69094["G","A"]0/10/10/1
20:69481["CT","C"]0/00/10/0
20:69506["G","GACACAC","GACAC"]0/11/20/1
20:72104["TA","T"]0/00/10/0
20:74347["G","A"]0/10/00/0
20:74729["G","GT"]0/00/10/0

showing top 10 rows

Filter missing rows

dataset = mt.filter_rows(hl.agg.any(hl.is_missing(mt.GT)) == False)
print(f"Num NA rows dropped: {mt.count()[0] - dataset.count()[0]}")
mt.filter_rows(hl.agg.any(hl.is_missing(mt.GT)) == True).GT.show()
# Num NA rows dropped: 3
'HG00731'
'HG00732'
'HG00733'
locus
alleles
GT
GT
GT
locus<GRCh37>array<str>callcallcall
20:37584126["AAAG","A"]NA0/01/1
20:44386269["GTGTATATATATGCATATGTATGTATATATATC","G"]0/0NA1/1
20:50252339["A","AGGAG"]0/01|1NA

Find known_sites where parents are both homozygous reference

parents_hom_ref = mt.filter_rows((hl.literal([True, True]) == hl.agg.collect(mt.GT.is_hom_ref())[0:2]) == True)

parents_hom_ref.GT.show()
print(f"Percentage of total: {round(parents_hom_ref.count()[0]/mt.count()[0] * 100, 2)}%")
'HG00731'
'HG00732'
'HG00733'
locus
alleles
GT
GT
GT
locus<GRCh37>array<str>callcallcall
20:114916["G","A"]0/00/01/1
20:142262["C","G"]0/00/01|1
20:142339["T","C"]0/00/01/1
20:221118["G","GAAA"]0/00/00/1
20:248667["G","A"]0/00/00/1
20:254517["C","CAAGAA"]0/00/00|1
20:259563["C","A"]0/00/01/1
20:259818["G","A"]0/00/00/1
20:261059["C","T"]0/00/00/1
20:285148["CA","C"]0/00/00/1

showing top 10 rows

Percentage of total: 1.29%

Find known_sites where parents are both heterozygous

parents_het = mt.filter_rows((hl.literal([True, True]) == hl.agg.collect(mt.GT.is_het())[0:2]) == True)

parents_het.GT.show()
print(f"Percentage of total: {round(parents_het.count()[0]/mt.count()[0] * 100, 2)}%")
'HG00731'
'HG00732'
'HG00733'
locus
alleles
GT
GT
GT
locus<GRCh37>array<str>callcallcall
20:68749["T","C"]0/10/11/1
20:69094["G","A"]0/10/10/1
20:69506["G","GACACAC","GACAC"]0/11/20/1
20:76962["T","C"]0/10/10/0
20:77965["G","GT","GTT"]0/10/20/0
20:82012["T","C"]0/10/10/0
20:87416["A","C"]0/10/10/1
20:106703["C","CT","CTTT"]0/11/21/1
20:126529["G","A"]0/10/10/0
20:126600["C","A"]0/10/10/0

showing top 10 rows

Percentage of total: 17.52%

Sites where child is homozygous variant

child_hom_var = mt.filter_rows((hl.agg.collect(mt.GT.is_hom_var())[2]==True) == True)

child_hom_var.GT.show()
print(f"Percentage of total: {round(child_hom_var.count()[0]/mt.count()[0] * 100, 2)}%")
'HG00731'
'HG00732'
'HG00733'
locus
alleles
GT
GT
GT
locus<GRCh37>array<str>callcallcall
20:65900["G","A"]1/11/11/1
20:66370["G","A"]1/11/11/1
20:67500["T","TTGGTATCTAG"]1/11/11/1
20:68749["T","C"]0/10/11/1
20:80655["A","G"]1/11/11/1
20:106703["C","CT","CTTT"]0/11/21/1
20:114916["G","A"]0/00/01/1
20:127952["C","T"]0/10/11/1
20:128863["G","C"]0/10/11/1
20:129063["A","G"]0/10/11/1

showing top 10 rows

Percentage of total: 27.36%

Releases

No releases published

Packages

No packages published

Languages