-
Notifications
You must be signed in to change notification settings - Fork 0
/
GCDA_CaseStudy_Cyclistic_2020Q1.Rmd
156 lines (139 loc) · 7.34 KB
/
GCDA_CaseStudy_Cyclistic_2020Q1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: "Cyclistic Rider Comparison"
author: "Ana Faria"
date: "2020 Q1"
output:
pdf_document: default
html_document: default
---
## Introduction
This notebook contains my first case study for my Google's Data Analytics Professional Certificate. This case study requires the analyst to follow to the steps of the data analysis process (ask, prepare, process, analyze, share,and act) for a data set that represents a fictional company, Cyclistic Bike-Share.
## Case Study Objective - Ask
For this Case Study our objective business task is to find how bike usage differs from annual membership holders to casual riders who use the Cyclistic bike service in order to understand the best marketing strategy to convert casual riders into membership holders, which will maximize profits. This project looks to find the main differences between casual riders and member riders.
#### Package downloads
```{r install package }
install.packages("tidyverse")
install.packages("lubridate")
install.packages("ggplot2")
```
```{r load packages}
library(tidyverse) #helps wrangle data
library(lubridate) #helps wrangle data attributes
library(ggplot2) #for data visualization
setwd("/cloud/project") #sets your working directory to simplify calls to datalib
```
## Data Collection - Prepare
For this case study we are using a public dataset provided by Divvy Bikes as a proxy for our fictitious company’s ridership history. The dataset covers 2019 service usage, including type of bike used, date time ride started, date time ride ended, start station name and id, end station name and id, start latitude, start longitude, end latitude, and end longitude. Data Privacy laws limit the amount of personally identifiable data that could lead us to relevant information regarding casual rider’s bike usage, including if they live close to a Cyclistic service area or if they purchase multiple one day passes.
#### Reading the data
```{r dataset upload}
q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")
```
## Wrangle Data and Combine Into One File - Process
#### Dataset column names
```{r observe column names}
colnames(q1_2020)
```
#### Data types per column
```{r inspect of dataframes}
str(q1_2020)
```
#### Removing columns on latitude and longitude
```{r remove lat, long}
q1_2020 <- q1_2020 %>%
select(-c(start_lat, start_lng, end_lat, end_lng))
```
## Data cleanup and preparation for analysis
```{r new table inspection}
colnames(q1_2020) #List of column names
nrow(q1_2020) #How many rows are in data frame?
dim(q1_2020) #Dimensions of the data frame?
head(q1_2020) #See the first 6 rows of data frame. Also tail(qs_raw)
str(q1_2020) #See list of columns and data types (numeric, character, etc)
summary(q1_2020) #Statistical summary of data. Mainly for numerics
```
**Problems in data set that needs fixing before analysis**
- The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data -- such as day, month, year -- that provide additional opportunities to aggregate the data.
- We will want to add a calculated field for length of ride since the 2020Q1 data did not have the "tripduration" column. We will add "ride_length" to the entire data frame for consistency.
- There are some rides where tripduration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.
```{r aggregate data at ride-level}
q1_2020$date <- as.Date(q1_2020$started_at) #The default format is yyyy-mm-dd
q1_2020$month <- format(as.Date(q1_2020$date), "%m")
q1_2020$day <- format(as.Date(q1_2020$date), "%d")
q1_2020$year <- format(as.Date(q1_2020$date), "%Y")
q1_2020$day_of_week <- format(as.Date(q1_2020$date), "%A")
```
#### Calculation for ride_length in seconds
```{r add a ride_length in seconds calculation }
q1_2020$ride_length <- difftime(q1_2020$ended_at,q1_2020$started_at)
```
```{r inspect new data structure with added column}
str(q1_2020)
```
#### Converting ride_length factor into numeric
```{r convert ride_length to numeric factor to be able to run calculations}
is.factor(q1_2020$ride_length)
q1_2020$ride_length <- as.numeric(as.character(q1_2020$ride_length))
is.numeric(q1_2020$ride_length)
```
#### Removal of bad data for new data set all_trips_v2
```{r remove bad data for new data set all_trips_v2}
all_trips_v2 <- q1_2020[!(q1_2020$start_station_name == "HQ QR" | q1_2020$ride_length<0),]
```
## Descriptive Analysis - Analysis
### Descriptive analysis on ride_length, all figures in seconds
```{r descriptive analysis}
mean(all_trips_v2$ride_length) # average ride length in seconds
median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths in seconds
max(all_trips_v2$ride_length) #longest ride in seconds
min(all_trips_v2$ride_length) #shortest ride in seconds
```
### Descriptive Analysis Summary For ride_length
##### *Note to self: Should add a normal distribution graph here*
```{r summarized descriptive analysis}
summary(all_trips_v2$ride_length)
```
### Member vs. Casual Users Comparison
```{r compare members and casual users}
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean) #average ride length in seconds
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median) #mid-point of ride length in seconds
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max) #longest ride in seconds
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min) #shortest ride in seconds
```
### Member vs. Casual Users Rides per Day of the week
```{r ordered days of the week for average ride time}
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")) # to order days of the week
```
```{r ordered days of the week for average ride time calculation}
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean) #average ride length per day of the week for each type of user
```
```{r analysis of ridership data by type and weekday}
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday) %>% #groups by user type and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% #calculates the average duration
arrange(member_casual, weekday) #sorts
```
## Data Visualizations - Share
### Number of Rides each Day of the Week per Rider Type
```{r visualize number of rides by rider type}
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")
```
### Average Ride each Day of the Week per Rider Type
```{r visualize average ride duration}
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge")
```