Skip to content

A stream sampler maintains one or more simple random samples, each with a fixed number of elements. As stream elements become available, the samples are updated to remain simple random samples.

License

Notifications You must be signed in to change notification settings

LiorKogan/StreamSampler

Repository files navigation

StreamSampler

A header-only C++11 library

Copyright © 2015 Lior Kogan (koganlior1 [at] gmail [dot] com)

Released under the Apache License, Version 2.0

--

A stream is a sequence of data elements made available over time. The number of elements in the stream is usually unknown a priori and can be very large.

A simple random sample of a stream is a subset of the stream elements, such that each stream element (from the start of the sampling till the latest available element) has an equal probability of being included in the sample.

A stream sampler maintains one or more simple random samples, each with a fixed number of elements. As stream elements become available, the samples are updated to remain simple random samples. Stream samplers are implemented using online algorithms: The size of the stream is unknown, and only one pass over the stream is possible. The time complexity of stream samplers is linear or sub-linear and the space complexity is constant.

The following seven unweighted sampling without replacement reservoir randomized algorithms are implemented:

Algorithm R is the standard 'textbook algorithm'. Algorithms X, Y, Z, K, L, and M offer huge performance improvement by drawing the number of stream elements to skip at each stage, so much fewer random numbers need to be generated, especially for large streams (hence the sub-linear time complexity). Z, K, L, and M are typically much faster than R, while M is usually the most performant.

In all these papers, the algorithms were formulated to control the element fetching from the stream (An external function, GetNextElement(), is called by the algorithms). Such flow control is usually less suitable for real-world scenarios. In this implementation, the algorithms were reformulated such that a process can fetch elements from the stream and call a member function of the stream sampler class - AddElement. This function returns the number of stream elements the caller should skip before calling it again.

This implementation also extends the algorithms by supporting the construction of multiple independent samples.

Two versions of AddElement are provided: one with copy semantics (AddElement(const ElementType& Element)) and one with move semantics (AddElement(ElementType&& Element)).

StreamSamplerTest contains a usage example: StreamSamplerExample(), a comparative performance benchmark function StreamSamplerPerformanceBenchmark() and a uniformity test function StreamSamplerTestUniformity().

About

A stream sampler maintains one or more simple random samples, each with a fixed number of elements. As stream elements become available, the samples are updated to remain simple random samples.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published