Skip to content

Commit

Permalink
updates to taxi
Browse files Browse the repository at this point in the history
  • Loading branch information
herbertli committed Dec 10, 2018
1 parent 0644d61 commit 19435fe
Show file tree
Hide file tree
Showing 33 changed files with 124 additions and 28 deletions.
11 changes: 11 additions & 0 deletions PROJECT_README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
READ ME FIRST

In this folder we have provided readmes for each dataset:

taxi_readme.txt
weather_readme.txt
turnstile_readme.txt

the data should be browseable at: /user/hl1785/data/

see the individual readmes for more details
22 changes: 22 additions & 0 deletions data_ingest/taxi/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
Data Source:
-----------------------
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

Data Ingest:
-----------------------
* For Green Cab Data (screenshot: "ingest"):
> curl -o green_tripdata_2018-06.csv https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2018-06.csv

* For Yellow Cab Data:
> curl -o yellow_tripdata_2018-05.csv https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-05.csv

* For FHV:
> curl -o fhv_tripdata_2017-11.csv https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2017-11.csv

All of these commands get one month of data, I ran each of these mutliple times to get all the data I needed.

* For taxi-zones (used for ETL):
> curl -o taxi-zone.csv https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv

Finally, load all data into dumbo, e.g.
> hdfs dfs -put yellow* data/yellow/
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -13,20 +13,20 @@

public class DataProfiler {

public static class ProfileMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
public static class ProfileMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
int colInd = context.getConfiguration().getInt("colInd", 0);
String[] rowSplit = value.toString().split(",");
context.write(new Text(rowSplit[colInd]), new IntWritable(1));
context.write(new Text(rowSplit[colInd]), new LongWritable(1));
}
}

public static class ProfileReducer extends Reducer<Text, IntWritable, Text, LongWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
public static class ProfileReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
public void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
long sum = 0;
for (IntWritable value: values) {
for (LongWritable value: values) {
sum += value.get();
}
context.write(key, new LongWritable(sum));
Expand Down
4 changes: 4 additions & 0 deletions profiling_code/taxi/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
For analyzing NYC taxi data, I used maven to all of my MapReduce code

As a result, the code for profiling NYC taxi data is also bundled with the ETL code for taxi
in the directory /etl_code/taxi/nyc-taxi/
105 changes: 82 additions & 23 deletions taxi_readme.txt
Original file line number Diff line number Diff line change
@@ -1,14 +1,24 @@
-----------------------
Taxi Data:
-----------------------


-----------------------
Screenshots:
-----------------------
There are screenshots in the screenshots/taxi/ directory showing
output of running code in order to analyze green taxi data
the output of running code in order to analyze green taxi data (in particular)
the processes for analyzing yellow cab and FHV data are virtually identical so screenshots are not provided...


Source:
-----------------------
Data Source:
-----------------------
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml



-----------------------
Data Ingest:
-----------------------
* For Green Cab Data (screenshot: "ingest"):
Expand All @@ -24,17 +34,24 @@ All of these commands get one month of data, I ran each of these mutliple times

* For taxi-zones (used for ETL):
> curl -o taxi-zone.csv https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
---------- End Data Ingest -------------

Maven
------------------------
Finally, load all data into dumbo, e.g.
> hdfs dfs -put yellow* data/yellow/




-----------------------
Data ETL
-----------------------
See under etl_code/taxi/nyc-taxi/main/java/ :
IdToNeighborhoodJob, LocalTimeJob, LocalTimeMapper, LocalTimeReducer

I used maven to build/package all of my MapReduce source files,
so in order to run any of the below commands, please run (on dumbo):
> cd nyc-taxi
> cd etl_code/taxi/nyc-taxi
> mvn clean package

Data Cleaning/Profiling
------------------------
The following removes unnecessary columns and removes malformed data (screenshot: "cleaning"):
> cd nyc-taxi
> hadoop jar target/nyc-taxi-1.0.jar LocationTimeJob data/green/*.csv data/green/cleaned
Expand All @@ -43,23 +60,52 @@ Usage:
hadoop jar target/nyc-taxi-1.0.jar LocationTimeJob <input path> <output path>

The following adds borough and location/neighborhood information (screenshot: "addBoro"):
> cd nyc-taxi
> hadoop jar target/nyc-taxi-1.0.jar IdToNeighborhoodJob data/green/cleaned data/green/withBoro data/taxi_zone.csv

Usage:
hadoop jar target/nyc-taxi-1.0.jar IdToNeighborhoodJob <input path> <output path> <taxi zone path>
---------- End Data Cleaning/Profiling -------------

Spark
-------------------------
I used sbt to build/package all of my spark source files,

-----------------------
Data Profiling
-----------------------
See under etl_code/taxi/nyc-taxi/main/java/ :
DataProfiler

The following output <k, v> pairs counting the # of occurences of a particular key
for a specified column of the data (screenshot: "profiling"):
> hadoop jar target/nyc-taxi-1.0.jar DataProfiler data/green/*.csv data/green/profile 1

Usage:
hadoop jar target/nyc-taxi-1.0.jar IdToNeighborhoodJob <input path> <output path> <col Ind>

Where <col Ind> is the



-----------------------
Data Iterations
-----------------------
See under etl_code/taxi/nyc-taxi/main/java/old/ :

This directory contains code from previous iterations of the process




-----------------------
Data Joining
-----------------------
See under etl_code/taxi/nyc-spark/ :
DataSchema, JoinWeatherAndFHV, JoinWeatherAndGreen, JoinWeatherAndYellow

For the next steps, I used sbt to build/package all of my spark source files,
so in order to run any of the below commands, please run (on dumbo):
> module load sbt
> cd nyc-spark
> cd etl_code/taxi/nyc-spark
> sbt package

Data Joining
------------------------

The following joins taxi (Green cab) and weather data (screenshot: "join1" and "join2"):

> cd nyc-spark
Expand All @@ -69,11 +115,17 @@ Usage:
spark2-submit --class JoinWeatherAndGreen --master yarn target/scala-2.11/nyc-spark_2.11-0.1.jar <input path> <weather data> <output path>

Use JoinWeatherAndYellow and JoinWeatherAndFHV for yellow cab and FHV resp.
---------- End Data Joining -------------




-----------------------
Linear Regression
------------------------
The following create a prediction model and saves it to some directory (screenshot: "linear_reg1" and "linear_reg2"):
-----------------------
See under etl_code/taxi/nyc-spark/ :
PredGreen, RFGreen

The following creates a prediction model and saves it to some directory (screenshot: "linear_reg1" and "linear_reg2"):

> cd nyc-spark
> spark2-submit --class RFGreen --master yarn target/scala-2.11/nyc-spark_2.11-0.1.jar data/green/joined data/lr/output
Expand All @@ -90,11 +142,18 @@ Usage:
spark2-submit --class PredGreen --master yarn target/scala-2.11/nyc-spark_2.11-0.1.jar <model path> <input path> <output path>

As of right now, only green taxi works, and predictions are entirely accurate sorry about that!
---------- End Linear Regression -------------





-----------------------
Impala Queries
------------------------
See nyc-impala/ for the SQL commands used to create tables/views and queries for the taxi data
-----------------------
See under etl_code/taxi/nyc-impala/ :
fhv.sql, greentaxi.sql, yellowtaxi.sql

These files contain SQL commands used to create tables/views and queries for the taxi data
Screenshots (and what they show):
create_table - create table from joined data
create_view - create view with date fields
Expand All @@ -103,4 +162,4 @@ no_snow_by_boro - taxi usage on days where it didn't snow, grouped by borough
num_of_snow_days - count # of snow days
by_avg_temp - taxi usage vs. average temp
by_avg_temp_brooklyn - taxi usage vs. average temp for a particular borough (Brooklyn)
---------- End Impala Queries -------------

0 comments on commit 19435fe

Please sign in to comment.