demo: spark
Part of this project utilizes PySpark and embedding of it into a Jupyter environment.
Java 7+ Python 2.6+ Apache Spark 3.0 Hadoop 2.7 Winutils.exe for Hadoop 2.6 Configure SPARK_HOME and HADOOP_HOME as system variables, and add them to Path
To help Jupyter Lab* bind to spark you can use findspark library.
To automatically open in Jupyter on startup add these system variables:
- set PYSPARK_DRIVER_PYTHON to jupyter
- set PYSPARK_DRIVER_PYTHON_OPTS to ‘lab’
Insert the functions from here
eda_vinay_controversial csv_analysis
one month data three month data
This dataset contains comments from Reddit.com, as collected by Reddit user Stuck_In_The_Matrix. One torrent per year, going from 2005 to 2017 (the last one is of course incomplete). Downloaded from https://files.pushshift.io/comments. Example code for working with the dataset can be found @ https://github.com/dewarim/reddit-data-tools Intened use is for scientific / non-commercial purposes.
This dataset contains is three months of big query data from r/worldnews