Author: Yinhui(Kami) Yang
Flight delays might cause a lot of inconvenience. Customers would not want to waste precious vocation or work time by waiting for hours at a busy airport. That is the purpose for this exploratory data analysis on this United Airlines (carrier code UA) dataset. The main focus is to improve both efficiency and customer satisfaction.
In this study, I would find out what causes departure/arrival delays. Are delays growing better or worse with different time of the day? Is there a relationship between departure delays and the time of year? Does the visibility, wind speed, weather temperature, and precipitation affect airplane ability to arrive on time?
The nycflights13 package contains on-time information data for all flights that departed NYC (i.e. JFK, LGA or EWR) in the year of 2013.
Goal: To use exploratory data analysis methods that we have been studying, and permutation tests, to analyze departure delays with a unique carriers United Airlines (carrier code UA)
The data is skewed to the right, with a mean departure delay of roughly 10 minutes. We can also notice some negative numbers, which signify both early departures and exceptionally long departure delays.
We can observe from the above boxplots that the median departure delay is nearly the same for all times of the day, which is around 0 with some outliers. The IQR in the evening is higher. So we could see there is early departures are more common in the morning and at night. To investigate if there is a relationship between departure delays and time of day, we will do two-sided permutation tests. The null hypothesis is that there is no difference in departure delays. The alternative theory is that there is a genuine difference between different times of the day. Next, we will run permutation test and then use histograms show us the result as below.
Permutation test result for Morning (06:00 - 11:59) vs. Afternoon (12:00 - 17:59) :[1] 2e-04
Permutation test result for Morning (06:00 - 11:59) vs. Evening (18:00 - 23:59) :[1] 2e-04
Permutation test result for Morning (06:00 - 11:59) vs. Night (24:00 - 05:59) :[1] 0.0012
Permutation test result for Afternoon (12:00 - 17:59) vs. Evening (18:00 - 23:59) :[1] 2e-04
Permutation test result for Afternoon (12:00 - 17:59) vs. Evening (18:00 - 23:59) :[1] 2e-04
Permutation test result for Evening (18:00 - 23:59) vs. Night (24:00 - 05:59) :[1] 2e-04
Accoridng to the above test results, we can determine from the graphs that the simulated mean is normally distributed around 0. The observed values are distant from the generated distribution, indicating that we seldom attain an observed mean when randomly classifying different kinds throughout the day. All of our P-values are as low as 2e-04 = 0.0002, however one exchange instance involving morning and evening varies from the others with a higher P-value = 0.0012. We could infer that there is a significant difference between the means of departure delays for the different time periods of the day since the P-values are all less than 5%, therefore we reject the null hypothesis and endorse the alternative hypothesis.
With the statistics from the flightsUA data, which are Fall 14531, Spring 14828, Summer 15016, and Winter 13604, we can see from the bar diagram above that there is nearly the same numbers of flights during each season, with the exception of winter. Next we can use boxplot to show us more direct comparison for all seasons.
The graph demonstrates that there are anomalies in every season. The median, for instance, is around zero in the summer and winter and somewhat lower in the fall and spring. Therefore, compared to summer and winter, we have more early departures in the fall and spring. We can use the permutation test to see whether there is a relationship between departure delays and the season. Let’s assume that the mean value of departure delays is the same for all cases. Another theory is that there are variations at different times of the year. Next, we will run permutation test and then use histograms show us the result as below.
Permutation test result for Fall (Sep - Nov) vs. Spring (Mar - May) :[1] 2e-04
Permutation test result for Fall (Sep - Nov) vs. Summer (Jun - Aug) :[1] 2e-04
Permutation test result for Fall (Sep - Nov) vs. Winter (Dec - Feb) :[1] 2e-04
Permutation test result for Spring (Mar - May) vs. Summer (Jun - Aug) :[1] 2e-04
Permutation test result for Spring (Mar - May) vs. Winter (Dec - Feb) :[1] 0.0076
Permutation test result for Summer (Jun - Aug) vs. Winter (Dec - Feb) :[1] 2e-04
We can determine from the graphs that the simulated mean is normally distributed around 0. The observed values are distant from the generated distribution, indicating that we seldom attain an observed mean when randomly classifying different season throughout the year. All of our P-values are as low as 2e-04 = 0.0002, however one exchange instance involving Spring and Winter varies from the others with a higher P-value = 0.0078. We could infer that there is a significant difference between the means of departure delays for the different time periods of the year since the P-values are all less than 5%, therefore we reject the null hypothesis and endorse the alternative hypothesis.
We can observe from the graphs above that the average departure delay time increase with the lower visibility. This demonstrates how weather conditions like visibility affect flights in New York City. Next we will calculate the permutation test result as well as the low visibility and high visibility.
Permutation test result: [1] 1e-04
Low visibility mean result: [1] 11.74591
High visibility mean result: [1] 17.89801
The observed values indicating that we observed the permutation test mean result 1e-04 = 0.0001, and the mean between Low visibility and High visibility are 11.74591 and 17.89801. We could infer that there is a significant difference between the means of departure delays for the different visibility, since the P-values are all less than 5%, therefore we reject the null hypothesis and endorse the alternative hypothesis.
Here, we can observe from grouped base on the grouped wind speeds and calculated the average departure delay time for each wind speed value (since only specific wind speed values were observed in the dataset, it is actually a continuous variable). The graph above shows that as the wind speed increases, a small increase in the average departure delay time occurs. Next we will calculate the permutation test result as well as the low wind speeds and high wind speeds.
Permutation test result: [1] 0.3839
Low wind speed standard deviation result: [1] 36.39903
High wind speed standard deviation result: [1] 34.79615
The observed values indicating that we observed the permutation test standard deviation result 0.385, and the standard deviation between Low wind speed and High wind speed are 36.39903 and 34.79615. We could infer that there is not statistically significant, since the P-values are all greater than 5% which indicating the difference between the standard deviation of departure delays for the different wind speed.
We can observe from the graphs above that the average departure delay time increase with the temperature increase. This demonstrates how weather conditions like visibility affect flights in New York City. Next we will calculate the permutation test result as well as the low temperature and high temperature.
Permutation test result: [1] 1e-04
Low temperature mean result: [1] 15.82985
High temperature mean result: [1] 10.10027
The observed values indicating that we observed the permutation test mean result 1e-04 = 0.0001, and the mean between Low temperature and High temperature are 15.82985 and 10.10027. We could infer that there is a significant difference between the means of departure delays for the different visibility, since the P-values are all less than 5%, therefore we reject the null hypothesis and endorse the alternative hypothesis.
We can observe from the graphs above that the average departure delay time increase with the precipitation increase. This demonstrates how weather conditions like precipitation affect flights in New York City. Next we will calculate the permutation test result as well as the low precipitation and high precipitation.
Permutation test result: [1] 1e-04
Low precipitation median result: [1] 9
High precipitation median result: [1] 0
The observed values indicate that we observed the permutation test median result 1e-04 = 0.0001, and the mean between Low precipitation and High precipitation are 9 and 0. We could infer that there is a significant difference between the median of departure delays for the different precipitation, since the P-values are all less than 5%, therefore we reject the null hypothesis and endorse the alternative hypothesis.