Reads streaming data from twitter and displays aggregated bar plots based on hashtags
Twitter account setup:
Set up an account at developer.twitter.com, accept the agreement and click defaults, provide reason for your account etc
Then create an app, and setup the credentials for a twitter app at https://developer.twitter.com/en/apps, ex: https://anirbank.twitter.com
Following libraries are to be installed:
sudo apt-get install default-jre (this is linux default jre, make sure version 1.8.x is installed)
sudo apt-get install scala (check for version 2.11.6-6)
sudo pip3 install py4j
Download spark (version 2.1, with Hadoop 2.7 works best) and extract the tar file
wget -q https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz
pip install pyspark
pip install findspark
pip install python-twitter - a python library to connect your Python to the twitter dev account.
pip install tweepy
Setup following env vars:
export SPARK_HOME=.... (path where the spark-2.1.1-bin-hadoop2.7 is extracted)
export PATH = $SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON=OPTS="notebook"
In addition you might have to setup JAVA_HOME (since I am using default jre in Linux, I am not setting it)