Here's a README.md draft for your GitHub repository that describes the analysis pipeline for PTA and WGS data:
This repository contains a workflow description and tools for analyzing Primary Template-directed Amplification (PTA) and Whole Genome Sequencing (WGS) data. The pipeline includes all steps necessary for high-quality variant detection, including mapping, duplicate marking, realignment, variant calling, and filtration.
This pipeline processes WGS reads mapped against the GRCh38 reference genome, focusing on:
- Handling PTA-specific artifacts using advanced machine learning models (PTATO Random Forest Model).
- High-quality variant calling with GATK.
- Comprehensive variant filtration for somatic and germline mutation detection.
- Support for both FASTQ and CRAM inputs.
- Modular WDL workflow for running on Terra or locally.
- Incorporation of existing GATK tools and custom PTA-specific artifact handling.
- FASTQ or CRAM files: Raw sequencing data.
- Reference Genome: GRCh38.
- Additional inputs: BED files, BAM indices, and genome-specific annotations.
- Tool:
BWA mem
- Output: SAM/BAM files.
- Duplicate marking using Sambamba.
- Realignment with GATK BaseRecalibrator.
- Tool:
GATK HaplotypeCaller
- Multi-sample mode with
EMIT_ALL_CONFIDENT_SITES
.
- Tool:
GATK VariantFiltration
- Filters:
QD < 2.0
MQ < 40.0
FS > 60.0
HaplotypeScore > 13.0
SOR > 4.0
and more.
- PTATO Random Forest Model trained with 26 genomic features:
- Allelic imbalance (most important).
- DNA replication timing.
- Distance to the nearest gene.
- Repeat regions and sequence context.
- Docker
- Cromwell/WDL runtime (for Terra)
- Tools included:
- BWA (v0.7.17)
- Sambamba (v0.6.8)
- GATK (v4.1.3.0)
- Samtools (v1.10)
-
Clone the repository:
git clone [email protected]:broadinstitute/PTA_Analysis.git cd PTA_Analysis
-
Install dependencies:
- Use Docker containers for all tools.
-
Prepare input files (FASTQ or CRAM, Reference Genome).
- Import the WDL workflows into Terra.
- Set up input JSON files for each step.
Run the pipeline using Cromwell:
java -jar cromwell.jar run main.wdl -i inputs.json
workflows/
: WDL workflows for each pipeline step.tasks/
: Individual WDL tasks for mapping, duplicate marking, realignment, and variant calling.docker/
: Dockerfiles for each tool used in the pipeline.examples/
: Example input files and configuration JSONs.
- GATK Documentation: https://gatk.broadinstitute.org/
- PTATO Tool: https://github.com/ToolsVanBox/PTATO
- Reference Genome (GRCh38): Ensembl