This repository contains the sample source code and presentation used in the ignite session I have on JFall 2015. I also wrote a blog post on the subject which you can find here.
The presentation (as PDF) can be found here.
A small runnable example of how to do do a word-count analysis is shown in HelloSparkWorld.java.
The 5GB dataset can be downloader using your favorite torrent client using this link.
You should end up with a RC_2015-01.bz2 file around 5GB in size.
The application.properties file has the default input set to /tmp/RC_2015-01.bz2. If you downloaded the file to a different location please change the properties file accordingly.
The application has two config settings that need to be set by you (if their defaults are incorrect), these settings are contained in application.properties.
The input property should point to RC_2015-01.bz2 you just downloaded. The output property should point to an empty directory. The application will create the full directory if possible.
You can run the analysis by simply starting running the Main class. It should start a spark context and start an analysis run. You can then connect to http://localhost:4040/ to see the progress. Keep in mind that this process will take quite some time, more than one hour on my machine.
First it reads all the JSON and parses it into internal comment structures and analyses these. The resulting data is stored in a temporary object store location. This isn't strictly needed at all but since this part takes by far the most amount of time it's done for convenience: running new reduce operations on this dataset takes a lot less time than going through the entire deserialization again.
The object file is then used to do the count and sentiment reductions which are then written to their corresponding files.