forked from herbertli/NYC-Transit-and-Weather
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtaxi_readme.txt
214 lines (144 loc) · 7.16 KB
/
taxi_readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
-----------------------
Taxi:
-----------------------
-----------------------
HDFS Data:
-----------------------
Please email [email protected] if you need access
All the data is stored in dumbo, in the folder: /user/hl1785/data/
Here are the directories pertaining to taxi:
data/
fhv/
fhv_tripdata* - raw trip data from NYC TLC
cleaned/ - data with nulls and unneeded columns removed
joined/ - joined data with weather data
withBoro/ - data with borough and neighborhood added
green/
green_tripdata* - raw trip data from NYC TLC
cleaned/ - data with nulls and unneeded columns removed
joined/ - joined data with weather data
withBoro/ - data with borough and neighborhood added
prof/ - data generated from profiling drop-off time of green-cab data
lr/
output/ - folder containing the saved linear regression model
test_data.csv - example test data
output_data/ - predictions resulting from using the model on the test data
yellow/
yellow_tripdata* - raw trip data from NYC TLC
cleaned/ - data with nulls and unneeded columns removed
joined/ - joined data with weather data
withBoro/ - data with borough and neighborhood added
taxi_zone.csv - csv containing mappings from zone-ids -> borough and neighborhood
weatherdata - csv containing weather data for a particular date
-----------------------
Screenshots:
-----------------------
There are screenshots in the screenshots/taxi/ directory showing
the output of running code in order to analyze green taxi data (in particular)
the processes for analyzing yellow cab and FHV data are virtually identical
so screenshots are not provided...
-----------------------
Data Source:
-----------------------
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
-----------------------
Data Ingest:
-----------------------
* For Green Cab Data (screenshot: "ingest"):
> curl -o green_tripdata_2018-06.csv https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2018-06.csv
* For Yellow Cab Data:
> curl -o yellow_tripdata_2018-05.csv https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-05.csv
* For FHV:
> curl -o fhv_tripdata_2017-11.csv https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2017-11.csv
All of these commands get one month of data, I ran each of these mutliple times to get all the data I needed.
* For taxi-zones (used for ETL processes):
> curl -o taxi-zone.csv https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
Finally, load all data into dumbo, e.g.
> hdfs dfs -put green* data/green/
-----------------------
Data ETL
-----------------------
See under etl_code/taxi/nyc-taxi/main/java/ :
IdToNeighborhoodJob, LocalTimeJob, LocalTimeMapper, LocalTimeReducer
I used maven to build/package all of my MapReduce source files,
so in order to run any of the below commands, please run (on dumbo):
> cd etl_code/taxi/nyc-taxi
> mvn clean package
Step 1:
The following removes unnecessary columns and removes malformed data (screenshot: "cleaning"):
> cd nyc-taxi
> hadoop jar target/nyc-taxi-1.0.jar LocationTimeJob data/green/*.csv data/green/cleaned
Usage:
hadoop jar target/nyc-taxi-1.0.jar LocationTimeJob <input path> <output path>
Step 2:
The following adds borough and location/neighborhood information (screenshot: "addBoro"):
> hadoop jar target/nyc-taxi-1.0.jar IdToNeighborhoodJob data/green/cleaned data/green/withBoro data/taxi_zone.csv
Usage:
hadoop jar target/nyc-taxi-1.0.jar IdToNeighborhoodJob <input path> <output path> <taxi zone path>
-----------------------
Data Profiling
-----------------------
See under etl_code/taxi/nyc-taxi/main/java/ :
DataProfiler
The following outputs <k, v> pairs counting the # of occurrences of a particular key
for a specified column of the data (screenshot: "data_prof"):
> hadoop jar target/nyc-taxi-1.0.jar DataProfiler data/green/*.csv data/green/profile 1
Usage:
hadoop jar target/nyc-taxi-1.0.jar IdToNeighborhoodJob <input path> <output path> <col ind>
Where <col ind> is the 0-based index of the column you want to profile
-----------------------
Data Iterations
-----------------------
See under etl_code/taxi/nyc-taxi/main/java/old/ :
This directory contains code from previous iterations of the process,
most of which doesn't compile anymore
-----------------------
Data Joining
-----------------------
See under etl_code/taxi/nyc-spark/ :
DataSchema, JoinWeatherAndFHV, JoinWeatherAndGreen, JoinWeatherAndYellow
For the next steps, I used sbt to build/package all of my spark source files,
so in order to run any of the below commands, please run (on dumbo):
> module load sbt
> cd etl_code/taxi/nyc-spark
> sbt package
The following joins taxi (Green cab) and weather data (screenshot: "join1" and "join2"):
> cd nyc-spark
> spark2-submit --class JoinWeatherAndGreen --master yarn target/scala-2.11/nyc-spark_2.11-0.1.jar data/green/withBoro data/weatherdata data/green/joined/
Usage:
spark2-submit --class JoinWeatherAndGreen --master yarn target/scala-2.11/nyc-spark_2.11-0.1.jar <input path> <weather data> <output path>
Use JoinWeatherAndYellow and JoinWeatherAndFHV for yellow cab and FHV resp.
-----------------------
Linear Regression
-----------------------
See under etl_code/taxi/nyc-spark/ :
PredGreen, RFGreen
The following creates a prediction model and saves it to some directory (screenshot: "linear_reg1" and "linear_reg2"):
> cd nyc-spark
> spark2-submit --class RFGreen --master yarn target/scala-2.11/nyc-spark_2.11-0.1.jar data/green/joined data/lr/output
Usage:
spark2-submit --class RFGreen --master yarn target/scala-2.11/nyc-spark_2.11-0.1.jar <input path> <output path>
The following loads a prediction model and predicts the usage for a given input data:
An example of the input is given under source_code/nyc-spark/test_data.csv
(screenshot green_pred1, green_pred2, green_pred3)
> cd nyc-spark
> hdfs dfs -put test_data.csv data/lr
> spark2-submit --class PredGreen target/scala-2.11/nyc-spark_2.11-0.1.jar data/lr/output data/lr/test_data.csv data/lr/output_data
Usage:
spark2-submit --class PredGreen --master yarn target/scala-2.11/nyc-spark_2.11-0.1.jar <model path> <input path> <output path>
As of right now, only green taxi works, and predictions are entirely accurate, sorry about that!
-----------------------
Impala Queries
-----------------------
See under etl_code/taxi/nyc-impala/ :
fhv.sql, greentaxi.sql, yellowtaxi.sql
These files contain SQL commands used to create tables/views and queries for the taxi data
These commands should be in the order in which they were run, see the inline comments
Screenshots (and what they show):
create_table - create table from joined data
create_view - create view with date fields
snow_by_boro - taxi usage on days where it did snow, grouped by borough
no_snow_by_boro - taxi usage on days where it didn't snow, grouped by borough
num_of_snow_days - count # of snow days
by_avg_temp - taxi usage vs. average temp
by_avg_temp_brooklyn - taxi usage vs. average temp for a particular borough (Brooklyn)