-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Description
I have tested rbloom, and it is really fast. However, it would be beneficial if it provided a vectorized insert and query API. For example, it should accept an array of NumPy arrays or PyArrow arrays and return an array as well.
import time
import uuid
from rbloom import Bloom
print("generating data")
N = 1000000
data = [uuid.uuid4() for i in range(N)]
testdata = [uuid.uuid4() for i in range(N)]
print("Number of keys", len(data))
bf = Bloom(len(data), 0.00001)
for d in data:
bf.add(d)
for d in data:
assert d in bf
count = 0
start = time.time()
for x in testdata:
count += x in bf
end = time.time()
querytime = end - start
fpp = count / N * 100.0
print(
"false positive rate",
"{:.5f}".format(fpp),
"%",
", memory per key",
"{:.1f}".format(bf.size_in_bits / N),
"bits",
", millions of queries per second: ",
"{:.2f}".format(N / querytime / 1000000),
", total memory",
"{:.2f}".format(bf.size_in_bits / 8 / 1024.0 / 1024.0),
"MiB",
)Output
generating data
Number of keys 1000000
false positive rate 0.00100 % , memory per key 24.0 bits , millions of queries per second: 8.89 , total memory 2.86 MiB
My test env is Apple M4 Pro.
Metadata
Metadata
Assignees
Labels
No labels