Tool for:
- Indexing records in large BAM files by their QNAME (run once)
- Efficiently retrieving records by their QNAME (use index many times)
When there are multiple records having the same QNAME, such as for read pairs and supplementary alignments, this tool retrieves all records having the requested QNAME.
The tool can be accessed as JAR file (java -jar atlantool.jar
) or as a
native Linux executable (atlantool-linux
).
Check out releases section to get the latest build.
The command line tool provides two sub commands: index
and view
for the operations
mentioned above. The basic usage format is:
$ atlantool-linux index <bam-path>
$ atlantool-linux view <bam-path> -n <qname-to-search>
or
$ atlantool-linux view <bam-path> -f <file-containing-qnames>
There are advanced options available for each sub command, detailed help can be seen by executing the sub command.
The following command indexes 1G.bam
file and places index files near BAM file. Please note, that the process takes time on large BAM files.
$ atlantool-linux index 1G.bam --thread-count=8
After the index has been built successfully, search requests can be executed on a QNAME string.
$ atlantool-linux view 1G.bam -n SOLEXA-1GA-1_0047_FC62472:5:52:15203:7914#0
SOLEXA-1GA-1_0047_FC62472:5:52:15203:7914#0 0 chr1 10158 25 36M * 0 0 AACCCTAACCCTAACCCTAACCTAACCCTAACCCTA ED?EEGDG?EEGGG4B@ABB@BD:49+=:=@;=;;D X0:i:1 MD:Z:36 NM:i:0
The output follows SAM specification, and it should be recognised by samtools
.
$ atlantool-linux view 1G.bam -n SOLEXA-1GA-1_0047_FC62472:5:52:15203:7914#0 -h | samtools view
SOLEXA-1GA-1_0047_FC62472:5:52:15203:7914#0 0 chr1 10158 25 36M * 0 0 AACCCTAACCCTAACCCTAACCTAACCCTAACCCTA ED?EEGDG?EEGGG4B@ABB@BD:49+=:=@;=;;D X0:i:1 MD:Z:36 NM:i:0
Indexing time depends on the size and the number of records in BAM file. At the moment the indexing time for a 140 GB BAM file of 1.2 billion records using 8 threads is around 1 hour (depending on hardware). It generates index files of 12 GB. Query time is sub second.
There are two index files, qname.index.bgz
and qname.data.bgz
.
Both are in BGZF format as described in SAMv1.pdf.
The data is in the following format (numbers are in little endian):
xx 1 byte: length of QNAME (N)
xx... N bytes: QNAME (key)
xx xx xx xx xx xx xx xx: 8 bytes: virtual offset (pointer)
The pointer is encoded as (coffset << 16) | uoffset
(same as in BAM format).
This file contains all the QNAME to virtual offset mappings, sorted by QNAME. The offset is the position of the corresponding record in the BAM file.
This file contains a subset of QNAMEs. The pointer is an offset into
qname.data.bgz
for where the first record with that QNAME is stored.
Because the file is sorted, that means records starting from that position have a QNAME that is equal or greater.
Given the above index files, a search for input
is performed like this:
- Iterate through
qname.index.bgz
to find the last record whereQNAME <= input
. - Starting from the offset from 1, iterate through
qname.data.bgz
to findQNAME == input
records. Stop when we hitQNAME > input
(we won't find more records). - Using the offsets from 2, look up the records in the BAM file.
The number of records in the index file is calculated using the square root of the total number of records. That means the amount of data that needs to be linearly scanned for a lookup is about the same between the two levels of indexes.
Dr. Emma Rath - Victor Chang Cardiac Research Institute
The following Atlassian employees have participated in writing this software through a collaboration project between the Atlassian Foundation and the Victor Chang Cardiac Research Institute:
- Huy Le
- Efim Pyshnograev
- Amitdev Ranjitdev
- Robin Stocker