Ferlab-Ste-Justine/Post-processing-Pipeline: Reference Data

Reference files are essential at various steps of the pipeline, including joint-genotyping, VQSR, the Variant Effect Predictor (VEP), and exomiser.

These files must be correctly downloaded and specified through pipeline parameters. This document provides a comprehensive list of the required reference files and explains how to set the pipeline parameters appropriately.

Reference Genome

The referenceGenome parameter specifies the directory containing the reference genome files.

This directory should contain the following files:

The reference genome FASTA file (e.g., Homo_sapiens_assembly38.fasta). This filename must be specified with the referenceGenomeFasta parameter.
The reference genome FASTA file index (e.g., Homo_sapiens_assembly38.fasta.fai). Its location will be automatically derived by appending .fai to the referenceGenomeFasta parameter.
The reference genome dictionary file (e.g., Homo_sapiens_assembly38.dict). Its location will be automatically derived by replacing the .fasta file extension of the referenceGenomeFasta parameter with .dict.

Broad reference data (VQSR)

The broad parameter specifies the directory containing the reference data files for VQSR. We chose the name broad because this data is from the Broad Institute, a collaborative research institution known for its contributions to genomics and biomedical research.

Files can be downloaded using this link: GATK Ressource Bundle

The broad directory must contain the following files:

Intervals Files: The genomic interval(s) over which we operate (WES, WGS or targeted sequencing). The filename of this list must be defined with the intervalsFile parameter (e.g., wgs_calling_regions.hg38.interval_list). For more details, see Gatk documentation.
Highly validated variance ressources currently required by VQSR. These are currently hard coded in the pipeline:
- HapMap file : hapmap_3.3.hg38.vcf.gz
- 1000G omni2.5 file : 1000G_omni2.5.hg38.vcf.gz
- 1000G reference file : 1000G_phase1.snps.high_confidence.hg38.vcf.gz
- SNP database : Homo_sapiens_assembly38.dbsnp138.vcf.gz

Extra settings (ex: resource prior probabilities, tranches, etc.) required to run the different VQSR steps are injected through pipeline parameters or hard coded in the vqsr modules. The values chosen for these settings are based on NIH Biowulf

VEP Cache Directory

The vepCache parameter specifies the directory for the vep cache. It is only required if vep is specified via the tools parameter.

The vep cache is not automatically populated by the pipeline. It must be pre-downloaded. You can obtain a copy of the data by following the vep installation procedure. Generally, we only need the human files obtainable from Ensembl.

Exomiser reference data

The exomiser reference data is only required if exomiser is specified via the tools parameter.

The exomiser_data_dir parameter specifies the path to the directory containing the exomiser reference files. This directory will be passed to the exomiser tool via the exomiser option --exomiser.data-directory.

It's content should look like this:

2402_hg19/
2402_hg38/
2402_phenotype/
remm/
  ReMM.v0.3.1.post1.hg38.tsv.gz
  ReMM.v0.3.1.post1.hg38.tsv.gz.tbi
cadd/1.7/
  gnomad.genomes.r4.0.indel.tsv.gz
  gnomad.genomes.r4.0.indel.tsv.gz.tbi
  whole_genome_SNVs.tsv.gz
  whole_genome_SNVs.tsv.gz.tbi

2402_hg19/ and 2402_hg38/: These folders contain data associated with the hg19 and hg38 genome assemblies, respectively. The number 2402 corresponds to the exomiser data version.
remm/: This folder is required only if REMM is used as a pathogenicity source in the exomiser analysis. In this case, additional parameters must be provided to specify the REMM data version (here 0.3.1.post1) and the name of the .tsv.gz file to be used within this folder. See below.
cadd/: This folder is required only if CADD is used as a pathogenicity source in the exomiser analysis. Here 1.7 is the CADD data version. As for REMM, additionnal parameters must be provided. See below.

To prepare the exomiser data directory, follow the instructions in the exomiser installation documentation

Together with the exomiser_data_dir parameter, these parameters must be provided to exomiser and should match the reference data available

exomiser_genome: The genome assembly version to be used by exomiser. Accepted values are hg38 or hg19.
exomiser_data_version: The exomiser data version. Example: 2402.
exomiser_cadd_version: The version of the CADD data to be used by exomiser (optional). Example: 1.7.
exomiser_cadd_indel_filename: The filename of the exomiser CADD indel data file (optional). Example: gnomad.genomes.r4.0.indel.tsv.gz
exomiser_cadd_snv_filename: The filename of the exomiser CADD snv data file (optional). Example: whole_genome_SNVs.tsv.gz
exomiser_remm_version: The version of the REMM data to be used by exomiser (optional). Example:0.3.1.post1
exomiser_remm_filename: The filename of the exomiser REMM data file (optional). Example: ReMM.v0.3.1.post1.hg38.tsv.gz

Exomiser analysis files

In addition to the reference data, exomiser requires an analysis file (.yml/.json) that contains, among others things, the variant frequency sources for prioritization of rare variants, variant pathogenicity sources to consider, the list of filters and prioretizers to apply, etc.

Typically, different analysis settings are used for whole exome sequencing (WES) and whole genome sequencing (WGS) data. Defaults analysis files are provided for each sequencing type in the assets folder:

assets/exomiser/default_exomiser_WES_analysis.yml
assets/exomiser/default_exomiser_WGS_analysis.yml

You can override these defaults and provide your own analysis file(s) via parameters exomiser_analyis_wes and exomiser_analysis_wgs. Note that the default analysis files do not include REMM or CADD pathogenicity sources.

The exomiser analysis file format follows the phenopacket standard and is described in detail here. There are typically multiple sections in the analysis file. To be compatible with the way we run the exomiser command, your analysis file should contain only the analysis section.

Reference data parameters summary

Parameter name	Required?	Description
`referenceGenome`	Required	Path to the directory containing the reference genome data
`referenceGenomeFasta`	Required	Filename of the reference genome .fasta file, within the specified `referenceGenome` directory
`broad`	Required	Path to the directory containing Broad reference data
`intervalsFile`	Required	Filename of the genome intervals list, within the specified `broad` directory
`vepCache`	Optional	Path to the vep cache data directory
`exomiser_data_dir`	Optional	Path to the exomiser reference data directory
`exomiser_genome`	Optional	Genome assembly version to be used by exomiser(`hg19` or `hg38`)
`exomiser_data_version`	Optional	Exomiser data version (e.g., `2402`)
`exomiser_cadd_version`	Optional	Version of the CADD data to be used by exomiser (e.g., `1.7`)
`exomiser_cadd_indel_filename`	Optional	Filename of the exomiser CADD indel data file (e.g., `gnomad.genomes.r4.0.indel.tsv.gz`)
`exomiser_cadd_snv_filename`	Optional	Filename of the exomiser CADD snv data file (e.g., `whole_genome_SNVs.tsv.gz`)
`exomiser_remm_version`	Optional	Version of the REMM data to be used by exomiser (e.g., `0.3.1.post1`)
`exomiser_remm_filename`	Optional	Filename of the exomiser REMM data file (e.g., `ReMM.v0.3.1.post1.hg38.tsv.gz`)
`exomiser_analysis_wes`	Optional	Path to the exomiser analysis file for WES data, if different from the default
`exomiser_analysis_wgs`	Optional	Path to the exomiser analysis file for WGS data, if different from the default

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reference_data.md

reference_data.md

Ferlab-Ste-Justine/Post-processing-Pipeline: Reference Data

Reference Genome

Broad reference data (VQSR)

VEP Cache Directory

Exomiser reference data

Exomiser analysis files

Reference data parameters summary

Files

reference_data.md

Latest commit

History

reference_data.md

File metadata and controls

Ferlab-Ste-Justine/Post-processing-Pipeline: Reference Data

Reference Genome

Broad reference data (VQSR)

VEP Cache Directory

Exomiser reference data

Exomiser analysis files

Reference data parameters summary