GitHub - PerryXDeng/stack_exchange_text_mining: data mining and visualization: buzzword frequency, cyclical site traffic, sentiment stream, context prediction

PerryXDeng / stack_exchange_text_mining Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

data mining and visualization: buzzword frequency, cyclical site traffic, sentiment stream, context prediction

pdeng.student.rit.edu

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
common		common
webserver		webserver
.gitignore		.gitignore
10k_so_ac_SVM_model.joblib		10k_so_ac_SVM_model.joblib
10k_so_ac_bayes_model.joblib		10k_so_ac_bayes_model.joblib
10k_so_se_SVM_model.joblib		10k_so_se_SVM_model.joblib
10k_so_se_bayes_model.joblib		10k_so_se_bayes_model.joblib
1_month_model.joblib		1_month_model.joblib
ETL.py		ETL.py
README.txt		README.txt
_config.yml		_config.yml
fourier_analysis.ipynb		fourier_analysis.ipynb
index.html		index.html
mysql-connector-java-8.0.16.jar		mysql-connector-java-8.0.16.jar
post_size.ipynb		post_size.ipynb
predict_site.py		predict_site.py
slurm_submission.batch		slurm_submission.batch
spark_test.ipynb		spark_test.ipynb
train.ipynb		train.ipynb

Repository files navigation

#####################
EASY WAY
#####################

We have a pre-setup webpage for you to access if you would like here:
https://pdeng.student.rit.edu

If you do want to go through the setup process please continue below, otherwise enjoy!

#####################
Database
#####################

Database Proxy
1. Download the proxy from step 2 here: https://cloud.google.com/sql/docs/mysql/connect-admin-proxy
2. ./cloud_sql_proxy -instances=<INSTANCE_CONNECTION_NAME>=tcp:3306 -credential_file=<PATH_TO_KEY_FILE>

Credentials
Add username and password to the creds.py file in the form

USERNAME = <USERNAME>
PASSWORD = <PASSWORD>

Contact [email protected] for credentials if needed.

######################
Spark
######################
Download spark: https://spark.apache.org/

Download hadoop: https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1 (this is the windows version). https://hadoop.apache.org/docs/r2.5.2/hadoop-project-dist/hadoop-common/SingleCluster.html (I believe you can find the linux version here)

Download Scala: https://www.scala-lang.org/download/

Download a driver for MYSQL: https://dev.mysql.com/downloads/connector/j/

Install jupyter notebook packages for python if you don't already have them setup.

There is also a dependency on JDK 8 (spark appears to use a jdbc on the backend)

Here are the environment variables that are necessary on windows (they might differ slightly for linux, but are likely very similar)

HADOOP_HOME - the path to the hadoop binaries

JAVA_HOME  - path to jdk 8

SCALA_HOME - path to scala installation

SPARK_HOME - path to spark installation

PYSPARK_PYTHON - the path to your python executable

PYSPARK_DRIVER_PYTHON - the path to the location of the jupyter installation

PYSPARK_DRIVER_PYTHON_OPTS = notebook (this will let  pyspark launch in jupyter notebook)

you  can test if this all worked by typing spark-shell or pyspark (which will launch the notebook)

if you wan't to test a connection to our database, make sure to launch with the arguments:

--driver-class-path <PATH TO MYSQL CONNECTOR JAR>

--jars <PATH TO MYSQL CONNECTOR JAR>

#####################
BOKEH
#####################

Testing
To test an app over an unsecured connection, execute `bokeh serve <filename> --port <portnum>`.

Deployment
To securely deploy an app,
configure an nginx reverse proxy with SSL (https://bokeh.pydata.org/en/latest/docs/user_guide/server.html#reverse-proxying-with-nginx-and-ssl).

Then execute
`bokeh serve <filename> --use-xheaders --allow-websocket-origin=<domain> --port <portnum>`.

Example:
`bokeh serve test_app.py --use-xheaders --allow-websocket-origin=pdeng.student.rit.edu --port 5006`

Dependencies
pip3 install mysql-connector-python
pip3 install bokeh
pip3 install pandas
pip3 install joblib

###############
EXAMPLE DATA
###############

Example data can be found in worldbuilding.meta.stackexcahnge.

###############
NOTES
###############

You can adjust the visualizations by clicking on the wheel zone on the right
then placing the mouse and scrolling on the x axis and y axis to adjust x range and y range, respectively