A disk-backed in-memory database of HyperLogLog data structures for estimating cardinality of many distinct sets. It uses memory mapped files to keep recently used data in memory and let's the OS layer sync data between disk and memory as space allows. A HyperLogLog is a near-optimal cardinality estimation algorithm. I built this key-value style database to facilitate a need we have at pubnub for tracking continuously growing sets for each customer for each granular period of time. HyperLogLog's ability to add values and estimate in cardinality in constant time and memory is essential. Sparse data sets also compress quite well.
This library contains a modified version of the HyperLogLog implementation by Vasiliy Evseenko: https://github.com/svpcom/hyperloglog
The original description of the HyperLogLog data structure can be found here: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.76.4286
Either:
sudo pip install hyperloglogdb
Or:
sudo python setup.py install
- numpy is used for efficient merging of sets
>>> import hyperloglogdb
>>>
>>> my_hlldb = hyperloglogdb.HyperLogLogDB(file_path='my_hlldb.db', error_rate=0.01)
>>>
>>> # or:
>>> f = open('my_hlldb.db', 'r+b')
>>> my_hlldb = hyperloglogdb.HyperLogLogDB(fileobj=f, error_rate=0.01)
>>>
>>> # Add 'test_val' to the set stored at key 'test_key'
>>> my_hlldb.add('test_key', 'test_val')
>>> print my_hlldb.count('test_key')
1
>>> print len(my_hlldb.get_hll('test_key'))
1
>>> my_hlldb.flush()
>>> # you can now copy or compress my_hlldb.db and open it at a later time
>>>
>>> my_hlldb2 = hyperloglogdb.HyperLogLogDB(file_path='my_hlldb2.db', error_rate=0.01)
>>> my_hlldb2.add('test_key', 'test_val2')
>>> my_hlldb.merge(my_hlldb2)
>>> print my_hlldb.count('test_key')
2
A module abstracting a slice of a larger memory mapped file (python mmap
)
- mmap_file - the
mmap
file object - length - length in bytes of this slice
- offset - offset in bytes from the start of the mmap_file
Returns the number of occurrences of val in the slice
- val - a byte to search for
A single instance of a HyperLogLog data structure
- error_rate - ( float ) the approx. percentage error rate. This determines the size of the data.
- data - ( MmapSlice ) the data slice where this hyper log log should be stored
- bitcount_arr - ( list ) optionally include a pre-generated bitcount array so it doesn't need to be regenerated for each HLL in the DB.
Adds a single value to the set
- val - ( string ) to add to the set
Merges either a single
HyperLogLog
or a list ofHyperLogLog
s into the current data structure
- others - ( HyperLogLog or list of HyperLogLogs ) to merge into the set
Returns the estimated cardinality of the set
A disk-backed key-value stores of HyperLogLog
data structures
- file_path - ( string ) a relative path to the location of the file storing the data. If the file does not exist it will be created. Either file_path or fileobj must be provided.
- fileobj - ( file object ) a file object containing the file for storing data. Either file_path or fileobj must be provided.
- error_rate - ( float ) the approx. percentage error rate. This determines the size of each
HyperLogLog
.
Syncs any in-memory updates to disk
Creates an empty
HyperLogLog
data structure and returns it.
- key - ( string ) the key that the HyperLogLog is associated with
Returns the
HyperLogLog
associated with key orNone
if the key does not exist
- key - ( string ) the key that the HyperLogLog is associated with
Merges either a single
HyperLogLogDB
or a list ofHyperLogLogDB
s into the current database. If a key in others does not exist in the current structure it will be created.
- others - ( HyperLogLogDB or list of HyperLogLogDBs ) to merge into the database
Merges either a single
HyperLogLog
or a list ofHyperLogLog
s into the HLL associated with key. If key does not exist in the current structure it will be created.
- key - ( string ) the key that the HyperLogLog is associated with
- others - ( HyperLogLog or list of HyperLogLogs ) to merge into the HLL associated with key
Copies the data from from_hll over the data in to_hll
- from_hll - ( HyperLogLog ) the HyperLogLog instance to copy data from
- to_hll - ( HyperLogLog ) the HyperLogLog instance to copy data to
Add a value to the
HyperLogLog
associated with key and create key if it does not exist.
- key - ( string ) the key that the HyperLogLog is associated with
- val - ( string ) the value to add to the set
Returns the estimated cardinality of the
HyperLogLog
associated with key or 0 if key does not exist
- key - ( string ) the key that the HyperLogLog is associated with