Skip to content

Distributed Random Reservoir Sampling Algorithm for Streaming Data

Notifications You must be signed in to change notification settings

hippalus/stream-sampler

Repository files navigation

stream-sampler

Distributed Random Reservoir Sampling Algorithm for Streaming Data

A simple in memory implementation of Reservoir Sampling with replacement and with only one pass through the input iteration whose size is unpredictable. In the first phase, we generate weights for each element K times, so that each element can get selected multiple times. This implementation refers to the algorithm described in Weighted random sampling with a reservoir.

But when we use on a single instance and single thread or thread parallelism they have some problem in terms of CPU and memory consumption. So they need to data partitioning and algorithm scalability. When the application is run with stdin, stdin is read with BufferedReader and each data is published to Kafka topic.

A KafkaStream topology is start, which consumes the topic created with sample size information and runs the mentioned above algorithm on the streaming data. This application is compatible for scalability.For example, if the character-kafka-topic has 10 partitions, this application can be scaled up to 10 instances. However, it needs some refactoring, such as separating the producer and the consumer (KStream Topology) into different processes. By using redis priority queue and redis lock for random sample, we ensured that the application can run stateless when scaled and the algorithm is consistent.

The KStream Application will subscribe to topic and continue to consume continuously, unless it is shut down or there is no exception. The output of the application (sample) log periodically , not in the end of KStream because of the end of the KStream is hard to predict.

TODO : Perhaps Rest endpoint can be developed to retrieve that the sample at any time.

System Design

g drawio (1)

Requirements and Run

  • docker, docker-compose
  • java >= 17
  • maven >= 3.8.1
  cd projectdirectory

  mvn clean install -DskipTests

  docker-compose up -d

Make sure the Docker containers are successfully installed and up.

  cat input.txt | ./stream-sampler -n SAMPLE_SIZE

  or

  dd if=/dev/urandom count=100 bs=10MB | base64 | ./stream-sampler -n SAMPLE_SIZE

  or

  ./stream-sampler -n SAMPLE_SIZE -g INPUT_SIZE
usage: cat file.txt | ./stream-sampler.sh -n SAMPLE SIZE
Creates a random representative sample of length SAMPLE SIZE out of the input.
Input is either STDIN, or randomly generated within application.

 -g,--generate <INPUT_SIZE>   Size of random input
    --help
 -n,--size <SAMPLE_SIZE>      Sample size
  

Unit and Integration Tests

Make sure docker containers are not running. Integration tests you may get the error Port Already in Used.

docker kill $(docker ps -q)
mvn clean test

Code Covarage

Screenshot from 2022-01-28 13-22-05

Screenshot from 2022-01-28 13-23-47

Used Technologies

Java 17, Spring Boot, KafkaStream, Redis, Docker

About

Distributed Random Reservoir Sampling Algorithm for Streaming Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages