GitHub - pythian/spark_streaming_percentile: This is the repository for my blog post on calculating percentile on a streaming dataset using spark streaming.

##Purpose This is a repo for my blog post at the Pythian bog

##Prerequisites

Have docker, python 2.7 and jre 1.7 installed, Scala and basic familiarity with Spark and the concept or RDD’s

Setup

Clone this repo.
Create a virtualenv for this project mkvirtualenv streaming_percentile (optional)
Install requirements using pip install -r requirements.pip
Install the kafka docker container. If you have a kafka cluster set up you can skip this step ..* On mac docker run -p 2181:2181 -p 9092:9092 --env ADVERTISED_HOST=boot2docker ip --env ADVERTISED_PORT=9092 spotify/kafka ..* On linux docker run -p 2181:2181 -p 9092:9092 --env ADVERTISED_HOST=127.0.0.1 --env ADVERTISED_PORT=9092 spotify/kafka ..* More info on the container can be found here. ..* Download and extract the kafka binaries. This location will now be refered to as KAFKA_HOME.
Install Spark ..* THis can either be run locally or using a docker container ..* To run locally Download the binaries here ..* wget http://apache.claz.org/spark/spark-1.4.0/spark-1.4.0-bin-hadoop2.6.tgz ..* tar -xvf spark-1.4.0-bin-hadoop2.6.tgz ..* Run the pyspark shell to confirm using ./bin/pyspark ..* You can also run this with IPython if you have IPython installed using IPYTHON=1 ./bin/pyspark ..* You can also run this as a docker container using docker run -i -t -h -p 8888:8888 -v my_code:/app sandbox anantasty/ubuntu_spark_ipython:1.0 bash

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
1342.txt		1342.txt
README.md		README.md
Streaming_percentile.ipynb		Streaming_percentile.ipynb
Streaming_producer.ipynb		Streaming_producer.ipynb
data_gen.py		data_gen.py
display_messages.ipynb		display_messages.ipynb
gen_words.ipynb		gen_words.ipynb
requirements.pip		requirements.pip
streaming_generator.py		streaming_generator.py
streaming_percentile.py		streaming_percentile.py
streaming_producer.py		streaming_producer.py
wordcount_sql.ipynb		wordcount_sql.ipynb