PySparkStreaming

Reads streaming data from twitter and displays aggregated bar plots based on hashtags

Twitter account setup:
Set up an account at developer.twitter.com, accept the agreement and click defaults, provide reason for your account etc
Then create an app, and setup the credentials for a twitter app at https://developer.twitter.com/en/apps, ex: https://anirbank.twitter.com

Following libraries are to be installed:

sudo apt-get install default-jre (this is linux default jre, make sure version 1.8.x is installed)
sudo apt-get install scala (check for version 2.11.6-6)
sudo pip3 install py4j
Download spark (version 2.1, with Hadoop 2.7 works best) and extract the tar file
wget -q https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz
pip install pyspark
pip install findspark
pip install python-twitter - a python library to connect your Python to the twitter dev account.
pip install tweepy

Setup following env vars:
export SPARK_HOME=.... (path where the spark-2.1.1-bin-hadoop2.7 is extracted)
export PATH = $SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON=OPTS="notebook"

In addition you might have to setup JAVA_HOME (since I am using default jre in Linux, I am not setting it)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
TwitterClient.ipynb		TwitterClient.ipynb
TwitterHashTagAgg.ipynb		TwitterHashTagAgg.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySparkStreaming

About

Releases

Packages

Languages

anirbankonar123/PySparkStreaming

Folders and files

Latest commit

History

Repository files navigation

PySparkStreaming

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages