MapReduce and Pig Latin Query

To get some hands on experience with the pig/latin language and capabilities, you will implement a few different queries -- aggregating content from two different collections. For you to have a better appreciation for the features provided by Pig Latin, you will also implement a join between the two collections we provide using both MapReduce and Pig Latin.

As input you will read from two CSV files that contain descriptions about users and tweets, respectively will be upload into your blackboard (each group will have different users and tweets file).

Each line in the user collection contains login, name and state from a specific user. Each line in the collection of tweets has the tweet id, content, and a reference to the user who wrote that tweet.

Here are the problems you will solve:

Write a Pig Latin query that outputs the login of all users in NY state.
Write a Pig Latin query that returns all the tweets that include the word 'favorite', ordered by tweet id.
Write the equivalent join using Pig Latin.
Write a Pig Latin query that returns the number of tweets for each user name (not login). You should output one user per line, in the following format: user_name, number_of_tweets
Write a Pig Latin query that returns the number of tweets for each user name (not login), ordered from most active to least active users. You should output one user per line, in the following format: user_name, number_of_tweets
Write a Pig Latin query that returns the name of users that posted at least two tweets. You should output one user name per line.
Write a Pig Latin query that returns the name of users that posted no tweets. You should output one user name per line.

Requirements

You can write your join in Java.

Java: In addition to the source code, you will submit a JAR file that can be called using the following command:

hadoop -jar join.jar Join -input1 <file_name> -input2 <file_name> -output <file_name>

Note: Copy the input files to your HDFS directory. Your program should read the input files from the HDFS directory.

You must write the Pig Latin code as stand-alone scripts, one script per query. The name of the file should correspond to the name of the query. For example, using the first query, you should be able to run:

pix -x local 1.pig

For each query, the result should be in files whose names have as prefix the query name, and extension '.result'. For instance, for the first query the result should be in 1a.result

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
data		data
java		java
pig		pig
result		result
src		src
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MapReduce and Pig Latin Query

Requirements

About

Releases

Packages

Languages

likweitan/mapreduce-join-and-pig-latin-query

Folders and files

Latest commit

History

Repository files navigation

MapReduce and Pig Latin Query

Requirements

About

Topics

Resources

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages