Smaller index size by storing block-level pointers only #22

robinst · 2020-09-25T06:09:36Z

Writing this up in an issue so that it's not lost: As of writing this, the index size for a 120 GB file is 11 GB. In order to reduce that more, I looked into storing only block-level pointers.

Index changes

Currently, in the .data.bgz file we store the virtual offset (location to record in BAM file) with each QNAME. That means each offset is a different 8 byte number, and hard to compress.

Instead of storing individual offsets, we can instead:

On the initial scan, remember the position of the first record per block. Write those positions to a .blocks file, 8 bytes per pointer, one after the other. Can be compressed or uncompressed.
When storing the value for QNAME in .data.bgz, instead of storing 8 bytes, just store a block index instead (block number). In the 120 GB file, there are ~9 million blocks, so that means a block index takes up 3 bytes at worst. Because we don't know how many blocks we have, using a variable encoding like Varint makes sense. So for earlier indexes, we only use 1, 2, 3 bytes, while not putting a (potentially too low) limit on the number of blocks.

Search changes

Now to look up a record in the BAM, we need to:

Find the QNAME in data (same as before)
Read its block number
Look up the block position in the blocks index. If it's uncompressed, all you need is to seek to byte position block_number * 8
Go to that position in the BAM file and do a linear scan until that QNAME is found

Results

I did the indexing changes, and the resulting file sizes were:

6.0G  qname.data.bgz
531K  qname.index.bgz
71M   qname.blocks

That's almost a 50% reduction in index size, so pretty good.

I didn't have time to check the effect on search performance yet, but don't think it would be too bad. In the average case, we'd have to scan a single block in the BAM which has a maximum size of 64 KB.

Thoughts

I didn't compress the .blocks file to allow for seeking, and 71 MB is small enough. But we could compress it, read it all into memory and then use that to retrieve the block positions.
Instead of having an additional .blocks file, could we just store the block pointer in .data.bgz and hope that compression takes care of things? Answer: I don't think it would work because the compression is done on chunks of QNAMEs, which don't necessarily contain the same block pointers.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smaller index size by storing block-level pointers only #22

Smaller index size by storing block-level pointers only #22

robinst commented Sep 25, 2020

Smaller index size by storing block-level pointers only #22

Smaller index size by storing block-level pointers only #22

Comments

robinst commented Sep 25, 2020

Index changes

Search changes

Results

Thoughts