Lyrics Project

Track favorite words of your favorite artist!

This project contains following components:

A Hadoop Cluster: HDFS is used to store raw lyrics and artists data in JSON format.
Hbase: used as a persistent database.
Hbase Thrift Server: used to serialize and diserialize the data flow into / out from Hbase.
Spark: to process raw data and save the results into Hbase.
Yarn: as the resource manger and job coordinator of the Hadoop Cluster.
Flask Backend Server: to communicate with the Hbase database and service queries from the front end.
Bootstrap, JQuery and D3.js: for powering the front end.

Dependency

Following dependencies are prerequisites and need to be installed first:

yum install python-devel

pip install happybase
pip install pyspark
pip install flask
pip install raven
pip install numpy
pip install gunicorn

Building the Cluster

Please refer to the official documentation of Hadoop, Spark and Hbase regarding how to build a cluster. This project uses the latest version of them. Please install them in /usr/local directory. To start the cluster, the following scripts need to be run:

/usr/local/hadoop/sbin/start-all.sh
/usr/local/hbase/bin/start-hbase.sh
/usr/local/hbase/bin/hbase-daemon.sh start thrift -p 9090 --infoport 9095
/usr/local/spark/sbin/start-all.sh

Running Spark Jobs

Before running spark jobs, the raw data has to be downloaded from s3://w251lyrics-project/lyric.json.gz and s3://w251lyrics-project/full_US.json.gz and saved to HDFS /resources/raw_data directory as raw_lyrics.txt and raw_artists.txt respectively. The environment variables should be exported as source ~/lyrics_project/scripts/env_setup.sh.

sh /usr/local/spark/bin/spark-submit --master yarn --deploy-mode cluster --num-executors 12 --executor-memory 6G --py-files /root/lyrics_project/scripts/pyspark/dep.zip /root/lyrics_project/scripts/pyspark/lyricid_to_artist_job.py

sh /usr/local/spark/bin/spark-submit --master yarn --deploy-mode cluster --num-executors 12 --executor-memory 6G --py-files /root/lyrics_project/scripts/pyspark/dep.zip /root/lyrics_project/scripts/pyspark/corpus_word_count_job.py

sh /usr/local/spark/bin/spark-submit --master yarn --deploy-mode cluster --num-executors 12 --executor-memory 6G --py-files /root/lyrics_project/scripts/pyspark/dep.zip /root/lyrics_project/scripts/pyspark/artist_to_word_count_job.py

sh /usr/local/spark/bin/spark-submit --master yarn --deploy-mode cluster --num-executors 12 --executor-memory 6G --py-files /root/lyrics_project/scripts/pyspark/dep.zip /root/lyrics_project/scripts/pyspark/artist_to_word_tfidf_job.py

Starting the Web Server:

This github repository needs to be checked out locally on the isntance or server and lyrics_project/app should be the present working directory. The web server in can be started in the background:

gunicorn -w 4 -b <host_ip_address>:80 server:app -D

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
app		app
common		common
conf		conf
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lyrics Project

This project contains following components:

Dependency

Building the Cluster

Running Spark Jobs

Starting the Web Server:

About

Releases

Packages

Contributors 2

Languages

jerrysong/lyrics_project

Folders and files

Latest commit

History

Repository files navigation

Lyrics Project

This project contains following components:

Dependency

Building the Cluster

Running Spark Jobs

Starting the Web Server:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages