Oncoanalyser is a Nextflow implementation of the Hartwig pipeline, and is the recommended way to run components from WiGiTS. To get started with Oncoanalyser, please jump section Usage.
Some key features of Oncoanalyser that simplify running the Hartwig pipeline are:
- Pre-defined (but flexible) configuration for individual tools
- Automated on-demand staging of reference genomes and resource files
- Pre-built Docker images retrieved at runtime for each process
- Resume capability for each process
- Supports a range of compute environments including AWS, Azure, GCP, and HPC
- Integration with Seqera Platform, a user-friendly monitoring and management service for Nextflow pipelines
Further information on Nextflow can be found here and generic configuration options are well described in the Nextflow documentation.
Oncoanalyser supports the below analysis workflows:
- Whole genome and/or transcriptome sequencing (WGTS)
- Targeted sequencing
Each workflow can run in the below sample modes:
- Paired tumor/normal
- Tumor-only
The targeted sequencing workflow has built-in support for the TSO500 panel, but can be run on any custom panel after creating panel-specific normalisation data.
Below is a detailed schematic of how each WiGiTS component is involved in each workflow
- Nextflow >=22.10.5 (instructions)
- Docker (instructions)
Note
Docker on Windows and macOS can perform poorly. Running oncoanalyser on Linux is recommended.
BAM or FASTQ files are the starting inputs for Oncoanalyser.
Note
BAM files are expected to meet the below criteria
All BAM files should be aligned to the Hartwig-distributed GRCh37 or GRCh38 reference genomes.
For DNA BAM files:
- Aligned with bwa-mem, bwa-mem2 or DRAGMAP, with supplementary alignment soft-clipping enabled (i.e.
-Y
argument)
For RNA BAM files:
- Aligned with STAR with some essential settings
- Duplicates marked with the Picard
- Use Ensembl v74 annotations for GRCh37
- Use Ensembl v105 annotations for GRCh38
The sample sheet is a comma separated table where each row represents an input file along with its associated metadata.
Column | Description |
---|---|
group_id | Group ID for a set of samples and inputs |
subject_id | Subject/patient ID |
sample_id | Sample ID |
sample_type | Sample type: tumor , normal |
sequence_type | Sequence type: dna , rna |
filetype | File type: bam , fastq , bai , bam_redux , etc |
info | For fastq file types, specify library id and lane, e.g. library_id:COLO829_library;lane:001 |
filepath | Absolute filepath to input file (can be local filepath, URL, S3 URI) |
The identifiers provided in the sample sheet are used to set output file paths:
group_id
: top-level output directory for analysis files e.g.output/COLO829/
- tumor
sample_id
: output prefix for most filenames e.g.COLO829T.purple.sv.vcf.gz
- normal
sample_id
: output prefix for some filenames e.g.COLO829R.cobalt.ratio.pcf
Below is an example sample sheet with BAM inputs for the whole genome and transcriptome (WGTS) workflow:
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath
COLO829,COLO829,COLO829T,tumor,dna,bam,/path/to/COLO829T.dna.bam
COLO829,COLO829,COLO829R,normal,dna,bam,/path/to/COLO829R.dna.bam
COLO829,COLO829,COLO829T_RNA,tumor,rna,bam,/path/to/COLO829T.rna.bam
Note
Values in sample_id
and filepath
columns must be unique in any sample sheet
Note
Input filepaths can be absolute local paths, URLs, or S3 URIs
Warning
BAM indexes are expected to exist alongside the respective input BAM but can also be provided as a separate
samplesheet entry by using the bai
filetype
Below is an example sample sheet with FASTQ inputs for the WGTS workflow:
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath
COLO829,COLO829,COLO829T,tumor,dna,fastq,library_id:COLO829T_library;lane:001,/path/to/COLO829T.dna.001_R1.fastq.gz;/path/to/COLO829T.dna.001_R2.fastq.gz
COLO829,COLO829,COLO829T,tumor,dna,fastq,library_id:COLO829T_library;lane:002,/path/to/COLO829T.dna.002_R1.fastq.gz;/path/to/COLO829T.dna.002_R2.fastq.gz
COLO829,COLO829,COLO829T,tumor,dna,fastq,library_id:COLO829T_library;lane:003,/path/to/COLO829T.dna.003_R1.fastq.gz;/path/to/COLO829T.dna.003_R2.fastq.gz
COLO829,COLO829,COLO829T,tumor,dna,fastq,library_id:COLO829T_library;lane:004,/path/to/COLO829T.dna.004_R1.fastq.gz;/path/to/COLO829T.dna.004_R2.fastq.gz
COLO829,COLO829,COLO829R,normal,dna,fastq,library_id:COLO829R_library;lane:001,/path/to/COLO829R.dna.001_R1.fastq.gz;/path/to/COLO829R.dna.001_R2.fastq.gz
COLO829,COLO829,COLO829T_RNA,tumor,rna,fastq,library_id:COLO829T_RNA_library;lane:001,/path/to/COLO829T.rna.001_R1.fastq.gz;/path/to/COLO829T.rna.001_R2.fastq.gz
The additional info
column provides the required lane and library info for FASTQ entries with each field delimited by a semicolon.
The forward and reverse FASTQ files are set in the filepath
column and are also separated by a semicolon, and are strictly ordered
with forward reads in position one and reverse in position two.
When starting from FASTQ files, reads will be aligned against the selected reference genome using bwa-mem2 (DNA reads) or STAR (RNA reads).
Note
Only gzipped compressed, non-interleaved pair-end FASTQs are currently supported
The above examples have provided inputs for the WGTS workflow using paired tumor/normal. However, the below example sample sheets show how
different workflow and/or sample modes can be from BAM files (but also applies to other sample_type
s e.g. FASTQ files).
Tumor-only DNA:
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath
COLO829,COLO829,COLO829T,tumor,dna,bam,/path/to/COLO829T.dna.bam
Tumor-only DNA and RNA:
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath
COLO829,COLO829,COLO829T,tumor,dna,bam,/path/to/COLO829T.dna.bam
COLO829,COLO829,COLO829T_RNA,tumor,rna,bam,/path/to/COLO829T.rna.bam
RNA only:
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath
COLO829,COLO829,COLO829T_RNA,tumor,rna,bam,/path/to/COLO829T.rna.bam
Multiple sample groups can also be provided in a single sample sheet. All rows with the same group_id
value will be grouped together for
processing.
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath
COLO829,COLO829,COLO829T,tumor,dna,bam,/path/to/COLO829T.dna.bam
COLO829,COLO829,COLO829R,normal,dna,bam,/path/to/COLO829R.dna.bam
COLO829,COLO829,COLO829T_RNA,tumor,rna,bam,/path/to/COLO829T.rna.bam
SEQC,SEQC,SEQCT,tumor,dna,bam,/path/to/SEQCT.dna.bam
Here the SEQC
has been added. Since only a tumor DNA BAM is provided for this additional group, just a tumor-only WGS analysis is run
for the SEQC sample.
It is possible to run Oncoanalyser from any tool or stage as shown in the schematic in Supported workflows.
For example, you may already have the inputs data from the WiGiTS pipeline to run CUPPA.
You would then provide a sample sheet by providing rows with purple_dir
, linx_anno_dir
and isofox_dir
for column filetype
:
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,filepath
COLO829,COLO829,COLO829T,tumor,dna,purple_dir,/path/to/purple/dir/
COLO829,COLO829,COLO829T,tumor,dna,linx_anno_dir,/path/to/linx/dir/
COLO829,COLO829,COLO829T,tumor,rna,isofox_dir,/path/to/isofox/dir/
Below are all possible values for filetype
:
- Raw inputs:
bam
,bai
,bam_redux
fastq
- Intermediate outputs:
amber_dir
bamtools
,bamtools_dir
cobalt_dir
esvee_vcf
,esvee_vcf_tbi
isofox_dir
lilac_dir
linx_anno_dir
pave_vcf
purple_dir
,sage_vcf
,sage_vcf_tbi
,sage_append_vcf
virusinterpreter_dir
- Running ORANGE:
chord_dir
sigs_dir
cuppa_dir
linx_plot_dir
sage_dir
To launch oncoanalyser you must provide at least the input samplesheet, the reference genome used for read alignment, and the desired workflow. When running the targeted sequencing workflow the applicable panel name is also required.
Note
Setting -revision
to use a specific version of oncoanalyser is strongly recommended to improve reproducibility and
stability.
Warning
It is recommended to only run oncoanalyser with Docker, which is done by with -profile docker
.
nextflow run nf-core/oncoanalyser \
-profile docker \
-revision 0.4.5 \
--mode wgts \
--genome GRCh38_hmf \
--input samplesheet.csv \
--outdir output/
nextflow run nf-core/oncoanalyser \
-profile docker \
-revision 0.4.5 \
--mode targeted \
--panel tso500 \
--genome GRCh38_hmf \
--input samplesheet.csv \
--outdir output/
Argument | Group | Description |
---|---|---|
-profile |
Nextflow | Profile name: docker (no other profiles supported at this time) |
-revision |
Nextflow | Specific oncoanalyser version to run |
-resume |
Nextflow | Use cache from existing run to resume |
--input |
oncoanalyser | Samplesheet filepath |
--outdir |
oncoanalyser | Output directory path |
--mode |
oncoanalyser | Workflow name: wgts , targeted |
--panel |
oncoanalyser | Panel name (only applicable with --mode targeted ): tso500 |
--genome |
oncoanalyser | Reference genome: GRCh37_hmf , GRCh38_hmf |
--max_cpus |
oncoanalyser | Enforce an upper limit of CPUs each process can use |
--max_memory |
oncoanalyser | Enforce an upper limit of memory available to each process |
The selected results files are written to the output directory and arranged into their corresponding groups by
directories named with the respective group_id
value from the input samplesheet. Within each group directory, outputs
are further organised by tool.
All intermediate files used by each process are kept in the Nextflow work directory (default: work/
). Once an analysis
has completed this directory can be removed.
Report | Path | Description |
---|---|---|
ORANGE | <group_id>/orange/<tumor_sample_id>.orange.pdf |
PDF summary report of key finding of the HMF pipeline |
LINX | <group_id>/linx/MDX210176_linx.html |
Interactive HMTL report of all SV plots |
Report | Path | Description |
---|---|---|
Execution | pipeline_info/execution_report_*.html |
HTML report of execution metrics and details |
Timeline | pipeline_info/execution_timeline_*.html |
Timeline diagram showing process execution (start/duration/finish) |
The following improvements are planned for the next few releases:
- longitudinal analysis of patient samples including ctDNA samples
- cloud-specific instructions and optimisations (ie for AWS, Azure and GCP)
The oncoanalyser pipeline was written by Stephen Watts at the University of Melbourne Centre for Cancer Research with the support of Oliver Hofmann and the Hartwig Medical Foundation Australia.