Skip to content

Bloom filter: what is the optimal number of bits to be set?

Frédéric Mahé edited this page Mar 25, 2024 · 3 revisions

That internal parameter applies when using the --fastidious option.

For each entry in the Bloom filter, a certain number of bits are used, represented by the variable bits (ranging from 2 to 64). By default it is 16. Some of these bits, k, are set when storing an item in the Bloom filter. The fraction of the bits that should be set is in theory 0.693 (log(2)) of all the bits, i.e. 11 of 16 by default. Now, the fraction is just to 0.4. Perhaps it should be adjusted.

The fraction of bits to set in the Bloom filter could also be determined experimentally on some typical datasets. It only influences the performance (speed) of the algorithm, not the results. It might be that 0.4 is better than 0.693 in practice.

The Bloom filter code divides the whole memory area (the bitmap) into 64-bit values. It uses pseudo-random 64-bit patterns with k bits set to indicate the presence of certain sequences (1,024 patterns in bloompat vs 65,536 patterns in bloomflex). The hash of each sequence is used to select a pattern and a region of the bitmap. The optimal values for both k and the number of patterns could be investigated.

With default settings the number of bits to set is 6 (floor(16 * 4 / 10)). Would swarm be faster if a different number of bits was set?

Let's find out!

experimental design:

  • test both d = 1 and fastidious,
  • three different datasets (18S V4, 18S V9, 16S V3V4) with the same number of nucleotides but different lengths and number of sequences,
  • two compilers (gcc and clang),
  • test all conditions (all possible number of bits set, for all possible bit numbers),
  • number of conditions = 2 + 3 + ... + 64 = 2,016
  • times the number of replicates (x 30?)
  • check output validity once per condition,
  • record memory footprint and duration
  • modify swarm's code: replace 4.0 / 10 with n / d where n is our target number of bits to set, and d is our number of bits (16 by default),
  • produce 4,032 binaries (ventilate into different folders)
  • example binary name: swarm_gcc_16_11
  • first pass to check that they all produce the expected results,
  • start testing default (16 bits) first, then extend to adjacent values,
  • final number of tests is: 4,032 * 3 * 31 (including the first verification pass)

open questions:

  • multithreaded or not?