Skip to content
Martin Aryee edited this page Mar 16, 2015 · 14 revisions

These tools allow 1) sample demultiplexing of Illumina sequencing reads that are indexed with both sample-specific and molecule-specific (UMI) barcodes, and 2) consolidation of reads corresponding to the same original (pre-PCR) template molecule into a single representative read.

flowchart

It is currently implemented for dual-index paired-end data and requires four input FASTQ files: forward and reverse reads (R1, R2) and index reads (I1, I2). Sample and molecular barcodes are extracted from the index reads. E.g:

barcode

This example shows a dual sample indexing scheme where the 16-base sample barcode consists of two parts - 8 bases from index read1 and 8 bases from index read 2. Index read 2 also contains an 8-base molecular barcode. We create a 'molecular index' by concatenating the molecular barcode with the first few (typically 6) bases of read 1. All read pairs with the same molecular index are presumed to represent PCR products of the same original template molecule and should be consolidated to a single representative read.

Features:

  • Demultiplex reads based on sample barcodes
  • Consolidate reads with the same molecular index (representing the same template molecule) into a single consensus read.

Depedencies

  • argparse
  • HTSeq

Input data

Four FASTQ files corresponding to forward and reverse reads (R1, R2) and index reads (I1, I2). The default MiSeq settings do not generate index reads. See Configuring a MiSeq to output index reads.

Usage example

The example directory contains undemultipexed data from an Illumina MiSeq run:

  • example/undemux.r1.fastq.gz - Forward read
  • example/undemux.r2.fastq.gz - Reverse read
  • example/undemux.i1.fastq.gz - Index read 1
  • example/undemux.i2.fastq.gz - Index read 2

1. Demultiplex reads based on the sample index contained in the I1 and I2 reads

cd example
python ../demultiplex.py --min_reads 1000 --read1 undemux.r1.fastq.gz --read2 undemux.r2.fastq.gz --index1 undemux.i1.fastq.gz --index2 undemux.i2.fastq.gz --sample_barcodes samplekey.txt

2. Add a molecular index (UMI) tag to the header of the R1 and R2 reads

The UMI tag is added as the third field of the read name line. It consists of the molecular barcode extracted from the index read concatenated with the first six bases of R1.

python ../umitag.py --read1_in mysample.r1.fastq --read2_in mysample.r2.fastq --read1_out mysample.r1.umitagged.fastq --read2_out mysample.r2.umitagged.fastq --index1 mysample.i1.fastq --index2 mysample.i2.fastq

3. Consolidate reads with the same molecular index

python ../consolidate.py mysample.r1.umitagged.fastq mysample.r1.consolidated.fastq 15 0.9
python ../consolidate.py mysample.r2.umitagged.fastq mysample.r2.consolidated.fastq 15 0.9