You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Writing this up in an issue so that it's not lost: As of writing this, the index size for a 120 GB file is 11 GB. In order to reduce that more, I looked into storing only block-level pointers.
Index changes
Currently, in the .data.bgz file we store the virtual offset (location to record in BAM file) with each QNAME. That means each offset is a different 8 byte number, and hard to compress.
Instead of storing individual offsets, we can instead:
On the initial scan, remember the position of the first record per block. Write those positions to a .blocks file, 8 bytes per pointer, one after the other. Can be compressed or uncompressed.
When storing the value for QNAME in .data.bgz, instead of storing 8 bytes, just store a block index instead (block number). In the 120 GB file, there are ~9 million blocks, so that means a block index takes up 3 bytes at worst. Because we don't know how many blocks we have, using a variable encoding like Varint makes sense. So for earlier indexes, we only use 1, 2, 3 bytes, while not putting a (potentially too low) limit on the number of blocks.
Search changes
Now to look up a record in the BAM, we need to:
Find the QNAME in data (same as before)
Read its block number
Look up the block position in the blocks index. If it's uncompressed, all you need is to seek to byte position block_number * 8
Go to that position in the BAM file and do a linear scan until that QNAME is found
Results
I did the indexing changes, and the resulting file sizes were:
That's almost a 50% reduction in index size, so pretty good.
I didn't have time to check the effect on search performance yet, but don't think it would be too bad. In the average case, we'd have to scan a single block in the BAM which has a maximum size of 64 KB.
Thoughts
I didn't compress the .blocks file to allow for seeking, and 71 MB is small enough. But we could compress it, read it all into memory and then use that to retrieve the block positions.
Instead of having an additional .blocks file, could we just store the block pointer in .data.bgz and hope that compression takes care of things? Answer: I don't think it would work because the compression is done on chunks of QNAMEs, which don't necessarily contain the same block pointers.
The text was updated successfully, but these errors were encountered:
Writing this up in an issue so that it's not lost: As of writing this, the index size for a 120 GB file is 11 GB. In order to reduce that more, I looked into storing only block-level pointers.
Index changes
Currently, in the
.data.bgz
file we store the virtual offset (location to record in BAM file) with each QNAME. That means each offset is a different 8 byte number, and hard to compress.Instead of storing individual offsets, we can instead:
.blocks
file, 8 bytes per pointer, one after the other. Can be compressed or uncompressed..data.bgz
, instead of storing 8 bytes, just store a block index instead (block number). In the 120 GB file, there are ~9 million blocks, so that means a block index takes up 3 bytes at worst. Because we don't know how many blocks we have, using a variable encoding like Varint makes sense. So for earlier indexes, we only use 1, 2, 3 bytes, while not putting a (potentially too low) limit on the number of blocks.Search changes
Now to look up a record in the BAM, we need to:
blocks
index. If it's uncompressed, all you need is to seek to byte positionblock_number * 8
Results
I did the indexing changes, and the resulting file sizes were:
That's almost a 50% reduction in index size, so pretty good.
I didn't have time to check the effect on search performance yet, but don't think it would be too bad. In the average case, we'd have to scan a single block in the BAM which has a maximum size of 64 KB.
Thoughts
.blocks
file to allow for seeking, and 71 MB is small enough. But we could compress it, read it all into memory and then use that to retrieve the block positions..blocks
file, could we just store the block pointer in.data.bgz
and hope that compression takes care of things? Answer: I don't think it would work because the compression is done on chunks of QNAMEs, which don't necessarily contain the same block pointers.The text was updated successfully, but these errors were encountered: