Spark

Summary for Learning Apache Spark

Big Data Computing Homework

Input.

The input text is an ebook from The Project Gutenberg, titled ”Plutarch: Lives of the noble Grecians and Romans by Plutarch” which is available at the URL:

https://www.gutenberg.org/ebooks/674.txt.utf-8

Output.

a) For each word in the input text, you must compute the occurrences of other words appear in every line that it appeared. For example, assume that we have the following line as an input text. “Theseus seemed to me to resemble Romulus in many particulars.” With the word “Theseus”: “to” occurs twice, “seemed” occurs once, “me” occurs once and so on. With the word “to”, “Theseus” occurs once, “to” occurs once (not zero times!), and so on. For this assignment, you are going to compute the occurrences of every word for all lines, not just for a single line.

b) Produce a list of words, with the top-10 words that appear with this word for all lines.

c) For each word in the input text, you must compute the occurrences of other words that appear in every line that it appeared, right after the appearance of the word. For example, assume that we have the following line as an input text “Theseus seemed to me to resemble Romulus in many particulars.” With the word “Theseus”, “seemed” occurs once. With the word “seemed”, “to” occurs once. Produce a list of words, with the top-10 words that appear right after this word for all lines.

d) Generalize step (c) so that you include all words that appear up to distance k after a given word.

Guidelines :

Convert every string to lower-case
Remove any punctuation, i.e., any character other than [a...z]
For each sub-question submit 2 files: text file with the results (any format you believe is more suitable) and the python code.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
flink		flink
README.md		README.md
Spark.ipynb		Spark.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark

About

Releases

Packages

Languages

P1Analytics/Spark-and-Flink

Folders and files

Latest commit

History

Repository files navigation

Spark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages