Skip to content

Latest commit

 

History

History
81 lines (58 loc) · 5.22 KB

README.md

File metadata and controls

81 lines (58 loc) · 5.22 KB

CPP-BWT-Aligner

This is a C++ implementation of this repository (https://github.com/RodenLuo/BWT-Aligner). It is nowhere near perfect, but serves its purpose. If you're interested in this work and want to run it in your own environment, the only way to do it right now is to copy the cpp files into your own IDE (Sorry but I do not know how to right the right make file) (such as CLION by jetbrains) and change the absolute path to your own path then hit run button.

Above that repository, I added a function which reverse the read so that if the read can not be mapped onto the positive chain, we can try to map this read to the reverse chain. Also, I added some naive code to support pair end sequencing.(Or should I say, only pair end sequencing data is allowed in this version?)

The files in my code can be found in the test_files folder under this repository and I have included a pre-build tally index file for the genome. (the tally builder cpp file was a previous work.)

The rest of this README is basically just a copy from the original repository which can be found here https://github.com/RodenLuo/BWT-Aligner

Thanks a lot and all credit should go to RodenLuo

Perl implementation of Burrows-Wheeler Transform and the Alignment process, which is the core algorithm for Bowtie2 and BWA.

This is the course project for Bioinformatics(BI3204 2015.03-2015.07) at SUSTC.

Table of Contents

##Introduction to BWT and the alignment algorithms Use Bowtie2 as example:

Bowtie2_1

Burrows-Wheeler transform. (a) The Burrows-Wheeler matrix and transformation for 'acaacg'. (b) Steps taken by EXACTMATCH to identify the range of rows, and thus the set of reference suffixes, prefixed by 'aac'. (c) UNPERMUTE repeatedly applies the last first (LF) mapping to recover the original text (in red on the top line) from the Burrows-Wheeler transform (in black in the rightmost column).

Langmead et al. Genome Biology 2009 10:R25 doi:10.1186/gb-2009-10-3-r25

Bowtie2_2

Exact matching versus inexact alignment. Illustration of how EXACTMATCH (top) and Bowtie's aligner (bottom) proceed when there is no exact match for query 'ggta' but there is a one-mismatch alignment when 'a' is replaced by 'g'. Boxed pairs of numbers denote ranges of matrix rows beginning with the suffix observed up to that point. A red X marks where the algorithm encounters an empty range and either aborts (as in EXACTMATCH) or backtracks (as in the inexact algorithm). A green check marks where the algorithm finds a nonempty range delimiting one or more occurrences of a reportable alignment for the query.

Langmead et al. Genome Biology 2009 10:R25 doi:10.1186/gb-2009-10-3-r25

##BWT-Aligner-in-Perl: For Lambda_virus #####Usages:

perl rotation.pl lambda_virus.fa lambda_rot
# Rotate the reference sequence and sort it.
# This will output two files: lambda_rot.sort and lambda_rot.sort.ref_name.bdx. One temporary file called lambda_rot.out will be generated and then deleted.
perl tally.pl lambda_rot.sort lambda.tally
# Build tally index for the rotated_sorted file.
# This will output two files: lambda.tally.bdx and lambda.tally. The files generated in the first step, lambda_rot.sort and lambda_rot.sort.ref_name.bdx, will be deleted in this step.
perl BWT-Aligner.pl lambda.tally reads_1.fq reads_1.sam
# Since the index is built, you can now align the raw reads onto the reference.
# This will output the alignment in SAM format, reads_1.sam.

#####View results

samtools sort -o reads_1.srt.bam -O bam -T temp reads_1.sam  # Convert it to sorted bam
 samtools index reads_1.srt.bam # Index it by samtools index
# View in IGV or other alignments viewer.

File source specification:

The reference file, lambda_virus.fa, and the reads file, reads_1.fq are originally from example files in Bowtie2. IGV_Figure

IGV_Figure

Zoom_in_1

zoom_in

Zoom_in_2

zoom_in

BWT-Aligner-in-Perl: For Human genome

Building BWT index for huge genome like human is a little more complicated because of the sorting process. This perl implementation uses 140bp header to sort in lambda virus reference. For human genome 1000bp has been used to sort but there are still some identical sequences. So I have sorted it step by step. I will push this part after I clean up my code.