Skip to content

A disk-based database of HyperLogLog data structures for estimating cardinality of many distinct sets

License

Notifications You must be signed in to change notification settings

zacwitte/HyperLogLogDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HyperLogLogDB

A disk-backed in-memory database of HyperLogLog data structures for estimating cardinality of many distinct sets. It uses memory mapped files to keep recently used data in memory and let's the OS layer sync data between disk and memory as space allows. A HyperLogLog is a near-optimal cardinality estimation algorithm. I built this key-value style database to facilitate a need we have at pubnub for tracking continuously growing sets for each customer for each granular period of time. HyperLogLog's ability to add values and estimate in cardinality in constant time and memory is essential. Sparse data sets also compress quite well.

This library contains a modified version of the HyperLogLog implementation by Vasiliy Evseenko: https://github.com/svpcom/hyperloglog

The original description of the HyperLogLog data structure can be found here: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.76.4286

Install

Either:

sudo pip install hyperloglogdb

Or:

sudo python setup.py install

Dependencies

  • numpy is used for efficient merging of sets

Examples

>>> import hyperloglogdb
>>>
>>> my_hlldb = hyperloglogdb.HyperLogLogDB(file_path='my_hlldb.db', error_rate=0.01)
>>>
>>> # or:
>>> f = open('my_hlldb.db', 'r+b')
>>> my_hlldb = hyperloglogdb.HyperLogLogDB(fileobj=f, error_rate=0.01)
>>>
>>> # Add 'test_val' to the set stored at key 'test_key'
>>> my_hlldb.add('test_key', 'test_val')
>>> print my_hlldb.count('test_key')
1
>>> print len(my_hlldb.get_hll('test_key'))
1
>>> my_hlldb.flush()
>>> # you can now copy or compress my_hlldb.db and open it at a later time
>>>
>>> my_hlldb2 = hyperloglogdb.HyperLogLogDB(file_path='my_hlldb2.db', error_rate=0.01)
>>> my_hlldb2.add('test_key', 'test_val2')
>>> my_hlldb.merge(my_hlldb2)
>>> print my_hlldb.count('test_key')
2

Documentation

class hyperloglogdb.MmapSlice( mmap_file, length, offset=0 )

A module abstracting a slice of a larger memory mapped file (python mmap)

  • mmap_file - the mmap file object
  • length - length in bytes of this slice
  • offset - offset in bytes from the start of the mmap_file

count( val )

Returns the number of occurrences of val in the slice

  • val - a byte to search for

class hyperloglogdb.HyperLogLog( error_rate, data, bitcount_arr=None )

A single instance of a HyperLogLog data structure

  • error_rate - ( float ) the approx. percentage error rate. This determines the size of the data.
  • data - ( MmapSlice ) the data slice where this hyper log log should be stored
  • bitcount_arr - ( list ) optionally include a pre-generated bitcount array so it doesn't need to be regenerated for each HLL in the DB.

add( val )

Adds a single value to the set

  • val - ( string ) to add to the set

update( others )

Merges either a single HyperLogLog or a list of HyperLogLogs into the current data structure

  • others - ( HyperLogLog or list of HyperLogLogs ) to merge into the set

length()

Returns the estimated cardinality of the set

class hyperloglogdb.HyperLogLogDB( file_path=None, fileobj=None, error_rate=0.01 )

A disk-backed key-value stores of HyperLogLog data structures

  • file_path - ( string ) a relative path to the location of the file storing the data. If the file does not exist it will be created. Either file_path or fileobj must be provided.
  • fileobj - ( file object ) a file object containing the file for storing data. Either file_path or fileobj must be provided.
  • error_rate - ( float ) the approx. percentage error rate. This determines the size of each HyperLogLog.

flush()

Syncs any in-memory updates to disk

create( key )

Creates an empty HyperLogLog data structure and returns it.

  • key - ( string ) the key that the HyperLogLog is associated with

get_hll( key )

Returns the HyperLogLog associated with key or None if the key does not exist

  • key - ( string ) the key that the HyperLogLog is associated with

merge( others )

Merges either a single HyperLogLogDB or a list of HyperLogLogDBs into the current database. If a key in others does not exist in the current structure it will be created.

  • others - ( HyperLogLogDB or list of HyperLogLogDBs ) to merge into the database

update( key, others )

Merges either a single HyperLogLog or a list of HyperLogLogs into the HLL associated with key. If key does not exist in the current structure it will be created.

  • key - ( string ) the key that the HyperLogLog is associated with
  • others - ( HyperLogLog or list of HyperLogLogs ) to merge into the HLL associated with key

copy_hll( from_hll, to_hll )

Copies the data from from_hll over the data in to_hll

  • from_hll - ( HyperLogLog ) the HyperLogLog instance to copy data from
  • to_hll - ( HyperLogLog ) the HyperLogLog instance to copy data to

add( key, val )

Add a value to the HyperLogLog associated with key and create key if it does not exist.

  • key - ( string ) the key that the HyperLogLog is associated with
  • val - ( string ) the value to add to the set

count( key )

Returns the estimated cardinality of the HyperLogLog associated with key or 0 if key does not exist

  • key - ( string ) the key that the HyperLogLog is associated with

About

A disk-based database of HyperLogLog data structures for estimating cardinality of many distinct sets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages