Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smaller index size by storing block-level pointers only #22

Open
robinst opened this issue Sep 25, 2020 · 0 comments
Open

Smaller index size by storing block-level pointers only #22

robinst opened this issue Sep 25, 2020 · 0 comments

Comments

@robinst
Copy link
Collaborator

robinst commented Sep 25, 2020

Writing this up in an issue so that it's not lost: As of writing this, the index size for a 120 GB file is 11 GB. In order to reduce that more, I looked into storing only block-level pointers.

Index changes

Currently, in the .data.bgz file we store the virtual offset (location to record in BAM file) with each QNAME. That means each offset is a different 8 byte number, and hard to compress.

Instead of storing individual offsets, we can instead:

  • On the initial scan, remember the position of the first record per block. Write those positions to a .blocks file, 8 bytes per pointer, one after the other. Can be compressed or uncompressed.
  • When storing the value for QNAME in .data.bgz, instead of storing 8 bytes, just store a block index instead (block number). In the 120 GB file, there are ~9 million blocks, so that means a block index takes up 3 bytes at worst. Because we don't know how many blocks we have, using a variable encoding like Varint makes sense. So for earlier indexes, we only use 1, 2, 3 bytes, while not putting a (potentially too low) limit on the number of blocks.

Search changes

Now to look up a record in the BAM, we need to:

  • Find the QNAME in data (same as before)
  • Read its block number
  • Look up the block position in the blocks index. If it's uncompressed, all you need is to seek to byte position block_number * 8
  • Go to that position in the BAM file and do a linear scan until that QNAME is found

Results

I did the indexing changes, and the resulting file sizes were:

6.0G  qname.data.bgz
531K  qname.index.bgz
71M   qname.blocks

That's almost a 50% reduction in index size, so pretty good.

I didn't have time to check the effect on search performance yet, but don't think it would be too bad. In the average case, we'd have to scan a single block in the BAM which has a maximum size of 64 KB.

Thoughts

  • I didn't compress the .blocks file to allow for seeking, and 71 MB is small enough. But we could compress it, read it all into memory and then use that to retrieve the block positions.
  • Instead of having an additional .blocks file, could we just store the block pointer in .data.bgz and hope that compression takes care of things? Answer: I don't think it would work because the compression is done on chunks of QNAMEs, which don't necessarily contain the same block pointers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant