stream-sampler

Distributed Random Reservoir Sampling Algorithm for Streaming Data

A simple in memory implementation of Reservoir Sampling with replacement and with only one pass through the input iteration whose size is unpredictable. In the first phase, we generate weights for each element K times, so that each element can get selected multiple times. This implementation refers to the algorithm described in Weighted random sampling with a reservoir.

But when we use on a single instance and single thread or thread parallelism they have some problem in terms of CPU and memory consumption. So they need to data partitioning and algorithm scalability. When the application is run with stdin, stdin is read with BufferedReader and each data is published to Kafka topic.

A KafkaStream topology is start, which consumes the topic created with sample size information and runs the mentioned above algorithm on the streaming data. This application is compatible for scalability.For example, if the character-kafka-topic has 10 partitions, this application can be scaled up to 10 instances. However, it needs some refactoring, such as separating the producer and the consumer (KStream Topology) into different processes. By using redis priority queue and redis lock for random sample, we ensured that the application can run stateless when scaled and the algorithm is consistent.

The KStream Application will subscribe to topic and continue to consume continuously, unless it is shut down or there is no exception. The output of the application (sample) log periodically , not in the end of KStream because of the end of the KStream is hard to predict.

TODO : Perhaps Rest endpoint can be developed to retrieve that the sample at any time.

System Design

Requirements and Run

docker, docker-compose
java >= 17
maven >= 3.8.1

  cd projectdirectory

  mvn clean install -DskipTests

  docker-compose up -d

Make sure the Docker containers are successfully installed and up.

  cat input.txt | ./stream-sampler -n SAMPLE_SIZE

  or

  dd if=/dev/urandom count=100 bs=10MB | base64 | ./stream-sampler -n SAMPLE_SIZE

  or

  ./stream-sampler -n SAMPLE_SIZE -g INPUT_SIZE

usage: cat file.txt | ./stream-sampler.sh -n SAMPLE SIZE
Creates a random representative sample of length SAMPLE SIZE out of the input.
Input is either STDIN, or randomly generated within application.

 -g,--generate <INPUT_SIZE>   Size of random input
    --help
 -n,--size <SAMPLE_SIZE>      Sample size

Unit and Integration Tests

Make sure docker containers are not running. Integration tests you may get the error Port Already in Used.

docker kill $(docker ps -q)

mvn clean test

Code Covarage

Used Technologies

Java 17, Spring Boot, KafkaStream, Redis, Docker

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
.mvn/wrapper		.mvn/wrapper
src		src
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
input.txt		input.txt
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml
stream-sampler		stream-sampler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stream-sampler

System Design

Requirements and Run

Unit and Integration Tests

Code Covarage

Used Technologies

About

Releases

Packages

Languages

hippalus/stream-sampler

Folders and files

Latest commit

History

Repository files navigation

stream-sampler

System Design

Requirements and Run

Unit and Integration Tests

Code Covarage

Used Technologies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages