Reddit Controversiality Analysis

Initial Attempts to read Data

Uncompressed:

json_to_sql

Compressed:

Read_DF_SQLite

PySpark Jupyter Integration

demo: spark

Part of this project utilizes PySpark and embedding of it into a Jupyter environment.

Requirements

Java 7+ Python 2.6+ Apache Spark 3.0 Hadoop 2.7 Winutils.exe for Hadoop 2.6 Configure SPARK_HOME and HADOOP_HOME as system variables, and add them to Path

Run PySpark Instance

To help Jupyter Lab* bind to spark you can use findspark library.

Option 2: Drive PySpark with Jupyter

To automatically open in Jupyter on startup add these system variables:

set PYSPARK_DRIVER_PYTHON to jupyter
set PYSPARK_DRIVER_PYTHON_OPTS to ‘lab’

In-cell SQL

Insert the functions from here

New Data: Big Query Extraction

reddit-data-prep

EDA

eda_vinay_controversial csv_analysis

Controversiality Analysis

one month data three month data

Controversiality Prediction

log_reg_controversy

Reddit Dataset 1

This dataset contains comments from Reddit.com, as collected by Reddit user Stuck_In_The_Matrix. One torrent per year, going from 2005 to 2017 (the last one is of course incomplete). Downloaded from https://files.pushshift.io/comments. Example code for working with the dataset can be found @ https://github.com/dewarim/reddit-data-tools Intened use is for scientific / non-commercial purposes.

Reddit Dataset 2

This dataset contains is three months of big query data from r/worldnews

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.ipynb_checkpoints		.ipynb_checkpoints
DataPrep		DataPrep
2005-12.db		2005-12.db
Database project_ Logistic regression( Body vs Controversiality) .ipynb		Database project_ Logistic regression( Body vs Controversiality) .ipynb
README.md		README.md
Read_DF_SQLite.ipynb		Read_DF_SQLite.ipynb
Topic_model_world_news_one_month.ipynb		Topic_model_world_news_one_month.ipynb
Topic_model_world_news_three_months.ipynb		Topic_model_world_news_three_months.ipynb
csv_analysis.ipynb		csv_analysis.ipynb
eda_vinay.ipynb		eda_vinay.ipynb
eda_vinay_controversial.ipynb		eda_vinay_controversial.ipynb
json_to_sql.ipynb		json_to_sql.ipynb
log_reg_controversy.ipynb		log_reg_controversy.ipynb
pys_sql.py		pys_sql.py
spark.ipynb		spark.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Controversiality Analysis

Initial Attempts to read Data

Uncompressed:

Compressed:

PySpark Jupyter Integration

Requirements

Run PySpark Instance

Option 2: Drive PySpark with Jupyter

In-cell SQL

New Data: Big Query Extraction

EDA

Controversiality Analysis

Controversiality Prediction

Reddit Dataset 1

Reddit Dataset 2

About

Releases

Packages

Contributors 3

Languages

skyler14/jupyter_databrick

Folders and files

Latest commit

History

Repository files navigation

Reddit Controversiality Analysis

Initial Attempts to read Data

Uncompressed:

Compressed:

PySpark Jupyter Integration

Requirements

Run PySpark Instance

Option 2: Drive PySpark with Jupyter

In-cell SQL

New Data: Big Query Extraction

EDA

Controversiality Analysis

Controversiality Prediction

Reddit Dataset 1

Reddit Dataset 2

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages