diff --git a/DESCRIPTION b/DESCRIPTION index 45239a3f..feae599e 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,8 +1,7 @@ Type: Package Package: nflfastR -Title: Functions to Efficiently Scrape NFL Play by Play and - Roster Data -Version: 1.1.10 +Title: Functions to Efficiently Scrape NFL Play by Play Data +Version: 2.0.0 Authors@R: c(person(given = "Sebastian", family = "Carl", @@ -12,11 +11,10 @@ Authors@R: role = c("cre", "aut"), email = "bbaldwin206@gmail.com")) Description: nflfastR is a set of functions to efficiently - scrape NFL play-by-play and roster data. + scrape NFL play-by-play data. License: MIT + file LICENSE Depends: - R (>= 3.5.0), - nflscrapR + R (>= 3.5.0) Imports: dplyr, glue, @@ -26,12 +24,12 @@ Imports: lubridate, magrittr, mgcv, + progressr, purrr, stringr, tibble, tidyr, - tidyselect, - xml2 + tidyselect Suggests: furrr, future diff --git a/NAMESPACE b/NAMESPACE index 72153d41..d17893b5 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -1,10 +1,8 @@ # Generated by roxygen2: do not edit by hand -export("%>%") +export(add_qb_epa) +export(calculate_expected_points) export(clean_pbp) export(fast_scraper) -export(fast_scraper_clips) -export(fast_scraper_roster) export(fast_scraper_schedules) -export(fix_fumbles) importFrom(magrittr,"%>%") diff --git a/NEWS.md b/NEWS.md new file mode 100644 index 00000000..f61267a9 --- /dev/null +++ b/NEWS.md @@ -0,0 +1,121 @@ +# nflfastR 2.0.0 + +### Models +* Added new models for Expected Points, Win Probability and Completion Probability +and removed `nflscrapR` dependency. This is a **major** change as we are stepping away +from the well established `nflscrapR` models. But we believe it is a good step forward. +See `data-raw/MODEL-README.md` for detailed model information. + +* Added internal functions for `EPA` and `WPA` to `helper_add_ep_wp.R`. + +* Added new function `calculate_expected_points()` usable for the enduser. + +### Functions +* Completely overhauled `fastcraper()` to make it work with the NFL's new server +backend. The option `source` is still available but will be deprecated since there +is only one source now. There are some changes in the output as well (please see below). + +* `fastcraper()` now adds game data to the play by play data set courtesy of Lee Sharpe. +Game data include: +away_score, home_score, location, result, total, spread_line, total_line, div_game, +roof, surface, temp, wind, home_coach, away_coach, stadium, stadium_id, gameday + +* `fastcraper_schedules()` now incorporates Lee Sharpe's `games.rds`. + +* The functions `fast_scraper_clips()` and `fast_scraper_roster()` are deactivated +due to the missing data source. They might be reactivated or completely dropped +in future versions. + +* The function `fix_fumbles()` has been renamed to `add_qb_epa()` as the new name +much better describes what the function is actually doing. + +### Miscellaneous + +* Added progress information using the `progressr`package and removed the +`furrr` progress bars. + +* `clean_pbp()` now adds the column `ìd` which is the id of the player in the column `name`. +Because we have to piece together different data to cover the full span of years, +**player IDs are not consistent between the early (1999-2010) and recent (2011 onward) +periods**. + +* Added a `NEWS.md` file to track changes to the package. + +* Fixed several bugs inhereted from `nflscrapR`, including one where EPA was missing +when a play was followed by two timeouts (for example, a two-minute warning followed by a timeout), +and another where `play_type` was incorrect on plays with declined penalties. + +* Fixed a bug, where `receiver_player_name` and `receiver` didn't name the correct +players on plays with lateral passes. + +### Play-by-Play Output +The output has changed a little bit. + +#### The following variables were dropped + +| Dropped Variables | Description | +|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| game_key | RS feed game identifier. | +| game_time_local | Kickoff time in local time zone. | +| iso_time | Kickoff time according ISO 8601. | +| game_type | One of 'REG', 'WC', 'DIV', 'CON', 'SB' indicating if a game was a regular season game or one of the playoff rounds. | +| site_id | RS feed id for game site. | +| site_city | Game site city. | +| site_state | Game site state. | +| drive_possession_team_abbr | Abbreviation of the possession team in a given drive. | +| scoring_team_abbr | Abbreviation of the scoring team if the play was a scoring play. | +| scoring_type | String indicating the scoring type. One of 'FG', 'TD', 'PAT', 'SFTY', 'PAT2'. | +| alert_play_type | String describing the play type of a play the NFL has listed as alert play. For most of those plays there are highlight clips available through fast_scraper_clips. | +| time_of_day | Local time at the beginning of the play. | +| yards | Analogue yards_gained but with the kicking team being the possession team (which means that there are many yards gained through kickoffs and punts). | +| end_yardline_number | Yardline number within the above given side at the end of the given play. | +| end_yardline_side | String indicating the side of the field at the end of the given play. | + +#### The following variables were renamed + +| Renamed Variables | Description | +|-----------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------| +| game_time_eastern -> start_time | Kickoff time in eastern time zone. | +| site_fullname -> stadium | Game site name. | +| drive_how_started -> drive_start_transition | String indicating how the offense got the ball. | +| drive_how_ended -> drive_end_transition | String indicating how the offense lost the ball. | +| drive_start_time -> drive_game_clock_start | Game time at the beginning of a given drive. | +| drive_end_time -> drive_game_clock_end | Game time at the end of a given drive. | +| drive_start_yardline -> drive_start_yard_line | String indicating where a given drive started consisting of team half and yard line number. | +| drive_end_yardline -> drive_end_yard_line | String indicating where a given drive ended consisting of team half and yard line number. | +| roof_type -> roof | One of 'dome', 'outdoors', 'closed', 'open' indicating indicating the roof status of the stadium the game was played in. (Source: Pro-Football-Reference) | + +#### The following variables were added + +| Added Variables | Description | +|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| vegas_wp | Estimated win probabiity for the posteam given the current situation at the start of the given play, incorporating pre-game Vegas line. | +| vegas_home_wp | Estimated win probability for the home team incorporating pre-game Vegas line. | +| weather | String describing the weather including temperature, humidity and wind (direction and speed). Doesn't change during the game! | +| nfl_api_id | UUID of the game in the new NFL API. | +| play_clock | Time on the playclock when the ball was snapped. | +| play_deleted | Binary indicator for deleted plays. | +| end_clock_time | Game time at the end of a given play. | +| end_yard_line | String indicating the yardline at the end of the given play consisting of team half and yard line number. | +| drive_real_start_time | Local day time when the drive started (currently not used by the NFL and therefore mostly 'NA'). | +| drive_ended_with_score | Binary indicator the drive ended with a score. | +| drive_quarter_start | Numeric value indicating in which quarter the given drive has started. | +| drive_quarter_end | Numeric value indicating in which quarter the given drive has ended. | +| drive_play_id_started | Play_id of the first play in the given drive. | +| drive_play_id_ended | Play_id of the last play in the given drive. | +| away_score | Total points scored by the away team. | +| home_score | Total points scored by the home team. | +| location | Either 'Home' o 'Neutral' indicating if the home team played at home or at a neutral site. | +| result | Equals home_score - away_score and means the game outcome from the perspective of the home team. | +| total | Equals home_score + away_score and means the total points scored in the given game. | +| spread_line | The closing spread line for the game. A positive number means the home team was favored by that many points, a negative number means the away team was favored by that many points. (Source: Pro-Football-Reference) | +| total_line | The closing total line for the game. (Source: Pro-Football-Reference) | +| div_game | Binary indicator for if the given game was a division game. | +| roof | One of 'dome', 'outdoors', 'closed', 'open' indicating indicating the roof status of the stadium the game was played in. (Source: Pro-Football-Reference) | +| surface | What type of ground the game was played on. (Source: Pro-Football-Reference) | +| temp | The temperature at the stadium only for 'roof' = 'outdoors' or 'open'.(Source: Pro-Football-Reference) | +| wind | The speed of the wind in miles/hour only for 'roof' = 'outdoors' or 'open'. (Source: Pro-Football-Reference) | +| home_coach | First and last name of the home team coach. (Source: Pro-Football-Reference) | +| away_coach | First and last name of the away team coach. (Source: Pro-Football-Reference) | +| stadium_id | ID of the stadium the game was played in. (Source: Pro-Football-Reference) | +| game_stadium | Name of the stadium the game was played in. (Source: Pro-Football-Reference) | diff --git a/R/data_documentation.R b/R/data_documentation.R index c5fcef30..bb7d08e4 100644 --- a/R/data_documentation.R +++ b/R/data_documentation.R @@ -21,5 +21,8 @@ #' \item{team_logo_wikipedia}{Url to Team logo on wikipedia} #' \item{team_logo_espn}{Url to higher quality logo on espn} #' } +#' The colors are taken from Lee Sharpe's teamcolors.csv who has taken them from the +#' `teamcolors` package created by Ben Baumer and Gregory Matthews. +#' The Wikipeadia logo urls are taken from Lee Sharpe's logos.csv #' "teams_colors_logos" diff --git a/R/ep_calculator.R b/R/ep_calculator.R new file mode 100644 index 00000000..999120f0 --- /dev/null +++ b/R/ep_calculator.R @@ -0,0 +1,68 @@ +#' Compute expected points +#' +#' for provided plays. Returns the data with +#' probabilities of each scoring event and EP added. The following columns +#' must be present: season, week, home_team, posteam, roof (coded as 'open', +#' 'closed', or 'retractable'), half_seconds_remaining, yardline_100, +#' ydstogo, posteam_timeouts_remaining, defteam_timeouts_remaining +#' +#' @param pbp_data Play-by-play dataset to estimate expected points for. +#' @details Computes expected points for provided plays. Returns the data with +#' probabilities of each scoring event and EP added. The following columns +#' must be present: season, week, home_team, posteam, roof (coded as 'open', +#' 'closed', or 'retractable'), half_seconds_remaining, yardline_100, +#' ydstogo, posteam_timeouts_remaining, defteam_timeouts_remaining +#' @return The original pbp_data with the following columns appended to it: +#' \itemize{ +#' \item{ep} - expected points. +#' \item{no_score_prob} - probability of no more scoring this half. +#' \item{opp_fg_prob} - probability next score opponent field goal this half. +#' \item{opp_safety_prob} - probability next score opponent safety this half. +#' \item{opp_td_prob} - probability of next score opponent touchdown this half. +#' \item{fg_prob} - probability next score field goal this half. +#' \item{safety_prob} - probability next score safety this half. +#' \item{td_prob} - probability text score touchdown this half. +#' } +#' @export + +calculate_expected_points <- function(pbp_data) { + + suppressWarnings( + model_data <- pbp_data %>% + #drop existing values of ep and the probs before making new ones + dplyr::select(-one_of(drop.cols)) %>% + make_model_mutations() %>% + ep_model_select() + ) + + + preds <- as.data.frame( + matrix(stats::predict(ep_model, as.matrix(model_data)), ncol=7, byrow=TRUE) + ) + + colnames(preds) <- c("td_prob","opp_td_prob","fg_prob","opp_fg_prob", + "safety_prob","opp_safety_prob","no_score_prob") + + preds <- preds %>% + dplyr::mutate( + ep = + (-3 * opp_fg_prob) + + (-2 * opp_safety_prob) + + (-7 * opp_td_prob) + + (3 * fg_prob) + + (2 * safety_prob) + + (7 * td_prob) + ) %>% + dplyr::bind_cols(pbp_data) + + return(preds) + +} + + +drop.cols <- c( + 'ep', 'td_prob', 'opp_td_prob', 'fg_prob', 'opp_fg_prob', + 'safety_prob', 'opp_safety_prob', 'no_score_prob' +) + + diff --git a/R/helper_add_cp_cpoe.R b/R/helper_add_cp_cpoe.R index bf227624..0623c64d 100644 --- a/R/helper_add_cp_cpoe.R +++ b/R/helper_add_cp_cpoe.R @@ -1,42 +1,35 @@ ################################################################################ # Author: Ben Baldwin, Sebastian Carl -# Purpose: Function to add cp and cpoe variables +# Purpose: Function to add cp and cpoe variables. +# CP model created by Zach Feldman: https://github.com/z-feldman/Expected_Completion_NFL # Code Style Guide: styler::tidyverse_style() ################################################################################ add_cp <- function(pbp) { - passes <- pbp %>% - dplyr::filter(complete_pass == 1 | incomplete_pass == 1 | interception == 1) %>% - #this step is necessary because NFL doesn't record receiver on incomplete passes prior to 2009 - dplyr::mutate(receiver_player_name = - stringr::str_extract(desc, "(?<=((to)|(for))\\s[:digit:]{0,2}\\-{0,1})[A-Z][A-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?")) %>% - dplyr::filter( - !is.na(air_yards) & air_yards >= -15 & air_yards < 70 & !is.na(receiver_player_name) & - !is.na(pass_location) & season >= 2006 - ) %>% - dplyr::mutate(air_is_zero = dplyr::if_else(air_yards == 0, 1, 0)) %>% - dplyr::select( - game_id, play_id, complete_pass, air_yards, pass_location, - air_is_zero, yardline_100, ydstogo, down, season - ) %>% - dplyr::mutate( - #since cpoe model uses season as a factor and there's no 2020 data to estimate the model - #pretend that seasons > 2019 are 2019 - #if anyone has better ideas here, please get in touch! - season = dplyr::if_else(season > 2019, 2019, season) - ) - - if (!nrow(passes) == 0) { - passes$cp <- mgcv::predict.gam(cp_models, passes, type = "response") - passes_with_cp <- passes %>% - dplyr::select(game_id, play_id, cp) + + #testing only + #pbp <- g + + passes <- prepare_cp_data(pbp) + + if (!nrow(passes %>% dplyr::filter(valid_pass == 1)) == 0) { + + pred <- stats::predict(cp_model, as.matrix(passes %>% dplyr::select(-complete_pass, -valid_pass))) %>% + tibble::as_tibble() %>% + dplyr::rename(cp = value) %>% + dplyr::bind_cols(passes) %>% + dplyr::select(cp, valid_pass) pbp <- pbp %>% - dplyr::left_join(passes_with_cp, by = c("game_id", "play_id")) %>% + dplyr::bind_cols(pred) %>% dplyr::mutate( - cp = dplyr::if_else(!is.na(cp), cp, NA_real_), + cp = dplyr::if_else( + valid_pass == 1, cp, NA_real_ + ), cpoe = dplyr::if_else(!is.na(cp), 100 * (complete_pass - cp), NA_real_) - ) + ) %>% + dplyr::select(-valid_pass) + message("added cp and cpoe") } else { pbp <- pbp %>% @@ -51,3 +44,32 @@ add_cp <- function(pbp) { } +### helper function for getting the data ready +prepare_cp_data <- function(pbp) { + + # valid pass play: at least -15 air yards, less than 70 air yards, has intended receiver, has pass location + passes <- pbp %>% + dplyr::mutate(receiver_player_name = + stringr::str_extract(desc, "(?<=((to)|(for))\\s[:digit:]{0,2}\\-{0,1})[A-Z][A-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"), + pass_middle = dplyr::if_else(pass_location == 'middle', 1, 0), + air_is_zero= dplyr::if_else(air_yards == 0,1,0), + distance_to_sticks = air_yards - ydstogo, + valid_pass = dplyr::if_else( + (complete_pass == 1 | incomplete_pass == 1 | interception == 1) & + !is.na(air_yards) & air_yards >= -15 & air_yards <70 & !is.na(receiver_player_name) & !is.na(pass_location), + 1, 0 + ) + ) %>% + dplyr::select( + complete_pass, air_yards, yardline_100, ydstogo, + down1, down2, down3, down4, air_is_zero, pass_middle, + era2, era3, era4, qb_hit, home, model_week, + outdoors, retractable, dome, distance_to_sticks, valid_pass + ) + +} + + + + + diff --git a/R/helper_add_ep_wp.R b/R/helper_add_ep_wp.R index e0374388..9d15df58 100644 --- a/R/helper_add_ep_wp.R +++ b/R/helper_add_ep_wp.R @@ -9,9 +9,10 @@ # could happen add_ep <- function(pbp) { - out <- - pbp %>% - nflscrapR::add_ep_variables() + + out <- pbp %>% + add_ep_variables() + message("added ep variables") return(out) } @@ -44,7 +45,7 @@ add_air_yac_ep <- function(pbp) { message("No non-NA air_yards detected. air_yac_ep variables set to NA") } else { out <- pbp %>% - nflscrapR::add_air_yac_ep_variables() + add_air_yac_ep_variables() message("added air_yac_ep variables") } return(out) @@ -53,7 +54,7 @@ add_air_yac_ep <- function(pbp) { add_wp <- function(pbp) { out <- pbp %>% - nflscrapR::add_wp_variables() + add_wp_variables() message("added wp variables") return(out) } @@ -86,8 +87,1334 @@ add_air_yac_wp <- function(pbp) { message("No non-NA air_yards detected. air_yac_wp variables set to NA") } else { out <- pbp %>% - nflscrapR::add_air_yac_wp_variables() + add_air_yac_wp_variables() message("added air_yac_wp variables") } return(out) } + +#get predictions for a set of pbp data +#for predict stage +get_preds <- function(pbp) { + + preds <- as.data.frame( + matrix(stats::predict(ep_model, as.matrix(pbp %>% ep_model_select())), ncol=7, byrow=TRUE) + ) + + colnames(preds) <- c("Touchdown","Opp_Touchdown","Field_Goal","Opp_Field_Goal", + "Safety","Opp_Safety","No_Score") + + return(preds) +} + +#get predictions for a set of pbp data +#for predict stage +get_preds_wp <- function(pbp) { + + preds <- stats::predict(wp_model, as.matrix(pbp %>% wp_model_select())) + + return(preds) +} + +#get predictions for a set of pbp data +#for predict stage +get_preds_wp_spread <- function(pbp) { + + preds <- stats::predict(wp_model_spread, as.matrix(pbp %>% wp_spread_model_select())) + + return(preds) +} + + + +#get the columns needed for ep predictions +#making sure they're in the right order +ep_model_select <- function(pbp) { + + pbp <- pbp %>% + dplyr::select( + half_seconds_remaining, + yardline_100, + home, + retractable, + dome, + outdoors, + ydstogo, + era0, era1, era2, era3, era4, + down1, down2, down3, down4, + posteam_timeouts_remaining, + defteam_timeouts_remaining, + model_week + ) + + return(pbp) + +} + +#get the columns needed for wp predictions +#making sure they're in the right order +wp_model_select <- function(pbp) { + + pbp <- pbp %>% + dplyr::select( + receive_2h_ko, + half_seconds_remaining, + game_seconds_remaining, + ExpScoreDiff_Time_Ratio, + ep, + score_differential, + down, + ydstogo, + home, + posteam_timeouts_remaining, + defteam_timeouts_remaining + ) + + return(pbp) + +} + +#get the columns needed for wp predictions +#making sure they're in the right order +wp_spread_model_select <- function(pbp) { + + pbp <- pbp %>% + dplyr::select( + receive_2h_ko, + spread_time, + half_seconds_remaining, + game_seconds_remaining, + ExpScoreDiff_Time_Ratio, + ep, + score_differential, + down, + ydstogo, + home, + posteam_timeouts_remaining, + defteam_timeouts_remaining + ) + + return(pbp) + +} + +prepare_wp_data <- function(pbp) { + + pbp <- pbp %>% + dplyr::group_by(game_id) %>% + dplyr::mutate( + receive_2h_ko = dplyr::if_else(qtr <= 2 & posteam == dplyr::first(na.omit(defteam)), 1, 0) + ) %>% + dplyr::ungroup() %>% + dplyr::mutate( + ExpScoreDiff = ep + score_differential, + posteam_spread = dplyr::if_else(home == 1, spread_line, -1 * spread_line), + spread_time = posteam_spread * log(3600 / (50 + (3600 - game_seconds_remaining))), + ExpScoreDiff_Time_Ratio = ExpScoreDiff / (game_seconds_remaining + 1) + ) + + return(pbp) + +} + + +#add ep variables +#All of these are heavily borrowed from nflscrapR (Maksim Horowitz, Ronald Yurko, and Samuel Ventura) +add_ep_variables <- function(pbp_data) { + + #testing + #pbp_data <- g + + #this function is below + base_ep_preds <- get_preds(pbp_data) + + # ---------------------------------------------------------------------------- + # ---- special case: deal with FG attempts + # Now make another dataset that to get the EP probabilities from a missed FG: + missed_fg_data <- pbp_data + # Subtract 5.065401 from TimeSecs: + missed_fg_data$half_seconds_remaining <- missed_fg_data$half_seconds_remaining - 5.065401 + + # Correct the yrdline100: + missed_fg_data$yardline_100 <- 100 - (missed_fg_data$yardline_100 + 8) + # Now first down: + missed_fg_data$down1 <- rep(1,nrow(pbp_data)) + missed_fg_data$down2 <- rep(0,nrow(pbp_data)) + missed_fg_data$down3 <- rep(0,nrow(pbp_data)) + missed_fg_data$down4 <- rep(0,nrow(pbp_data)) + # 10 ydstogo: + missed_fg_data$ydstogo <- rep(10,nrow(pbp_data)) + + # Get the new predicted probabilites: + if (nrow(missed_fg_data) > 1) { + missed_fg_ep_preds <- get_preds(missed_fg_data) + } else{ + missed_fg_ep_preds <- get_preds(missed_fg_data) + } + + # Find the rows where TimeSecs_Remaining became 0 or negative and make all the probs equal to 0: + end_game_i <- which(missed_fg_data$half_seconds_remaining <= 0) + missed_fg_ep_preds[end_game_i,] <- rep(0,ncol(missed_fg_ep_preds)) + + # Get the probability of making the field goal: + make_fg_prob <- as.numeric(mgcv::predict.bam(fg_model, newdata= pbp_data, type="response")) + + # Multiply each value of the missed_fg_ep_preds by the 1 - make_fg_prob + missed_fg_ep_preds <- missed_fg_ep_preds * (1 - make_fg_prob) + # Find the FG attempts: + fg_attempt_i <- which(pbp_data$play_type == "field_goal") + + # Now update the probabilities for the FG attempts (also includes Opp_Field_Goal probability from missed_fg_ep_preds) + base_ep_preds[fg_attempt_i, "Field_Goal"] <- make_fg_prob[fg_attempt_i] + missed_fg_ep_preds[fg_attempt_i,"Opp_Field_Goal"] + # Update the other columns based on the opposite possession: + base_ep_preds[fg_attempt_i, "Touchdown"] <- missed_fg_ep_preds[fg_attempt_i,"Opp_Touchdown"] + base_ep_preds[fg_attempt_i, "Opp_Field_Goal"] <- missed_fg_ep_preds[fg_attempt_i,"Field_Goal"] + base_ep_preds[fg_attempt_i, "Opp_Touchdown"] <- missed_fg_ep_preds[fg_attempt_i,"Touchdown"] + base_ep_preds[fg_attempt_i, "Safety"] <- missed_fg_ep_preds[fg_attempt_i,"Opp_Safety"] + base_ep_preds[fg_attempt_i, "Opp_Safety"] <- missed_fg_ep_preds[fg_attempt_i,"Safety"] + base_ep_preds[fg_attempt_i, "No_Score"] <- missed_fg_ep_preds[fg_attempt_i,"No_Score"] + + # ---------------------------------------------------------------------------------- + # ---- special case: deal with kickoffs + # Calculate the EP for receiving a touchback (from the point of view for recieving team) + # and update the columns for Kickoff plays: + kickoff_data <- pbp_data + + # Change the yard line to be 80 for 2009-2015 and 75 otherwise + # (accounting for the fact that Jan 2016 is in the 2015 season: + kickoff_data$yardline_100 <- with(kickoff_data, + ifelse(season < 2016, + 80, 75)) + # Now first down: + kickoff_data$down1 <- rep(1,nrow(pbp_data)) + kickoff_data$down2 <- rep(0,nrow(pbp_data)) + kickoff_data$down3 <- rep(0,nrow(pbp_data)) + kickoff_data$down4 <- rep(0,nrow(pbp_data)) + # 10 ydstogo: + kickoff_data$ydstogo <- rep(10,nrow(pbp_data)) + + # Get the new predicted probabilites: + kickoff_preds <- get_preds(kickoff_data) + + # Find the kickoffs: + kickoff_i <- which(pbp_data$play_type == "kickoff") + + # Now update the probabilities: + base_ep_preds[kickoff_i, "Field_Goal"] <- kickoff_preds[kickoff_i, "Field_Goal"] + base_ep_preds[kickoff_i, "Touchdown"] <- kickoff_preds[kickoff_i, "Touchdown"] + base_ep_preds[kickoff_i, "Opp_Field_Goal"] <- kickoff_preds[kickoff_i, "Opp_Field_Goal"] + base_ep_preds[kickoff_i, "Opp_Touchdown"] <- kickoff_preds[kickoff_i, "Opp_Touchdown"] + base_ep_preds[kickoff_i, "Safety"] <- kickoff_preds[kickoff_i, "Safety"] + base_ep_preds[kickoff_i, "Opp_Safety"] <- kickoff_preds[kickoff_i, "Opp_Safety"] + base_ep_preds[kickoff_i, "No_Score"] <- kickoff_preds[kickoff_i, "No_Score"] + + # ---------------------------------------------------------------------------------- + # Insert probabilities of 0 for everything but No_Score for QB Kneels that + # occur on the possession team's side of the field: + # Find these QB Kneels: + qb_kneels_i <- which(pbp_data$play_type == "qb_kneel" & pbp_data$yardline_100 > 50) + + # Now update the probabilities: + base_ep_preds[qb_kneels_i, "Field_Goal"] <- 0 + base_ep_preds[qb_kneels_i, "Touchdown"] <- 0 + base_ep_preds[qb_kneels_i, "Opp_Field_Goal"] <- 0 + base_ep_preds[qb_kneels_i, "Opp_Touchdown"] <- 0 + base_ep_preds[qb_kneels_i, "Safety"] <- 0 + base_ep_preds[qb_kneels_i, "Opp_Safety"] <- 0 + base_ep_preds[qb_kneels_i, "No_Score"] <- 1 + + + # ---------------------------------------------------------------------------------- + # Create two new columns, ExPoint_Prob and TwoPoint_Prob, for the PAT events: + base_ep_preds$ExPoint_Prob <- 0 + base_ep_preds$TwoPoint_Prob <- 0 + + # Find the indices for these types of plays: + extrapoint_i <- which(pbp_data$play_type == "extra_point") + twopoint_i <- which(pbp_data$two_point_attempt == 1) + + # Assign the make_fg_probs of the extra-point PATs: + base_ep_preds$ExPoint_Prob[extrapoint_i] <- make_fg_prob[extrapoint_i] + + # Assign the TwoPoint_Prob with the historical success rate: + base_ep_preds$TwoPoint_Prob[twopoint_i] <- 0.4735 + + # ---------------------------------------------------------------------------------- + # Insert NAs for timeouts and end of play rows: + missing_i <- which((pbp_data$timeout == 1 & pbp_data$play_type == "no_play") | is.na(pbp_data$play_type)) + + # Now update the probabilities for missing and PATs: + base_ep_preds$Field_Goal[c(missing_i, extrapoint_i, twopoint_i)] <- 0 + base_ep_preds$Touchdown[c(missing_i, extrapoint_i, twopoint_i)] <- 0 + base_ep_preds$Opp_Field_Goal[c(missing_i, extrapoint_i, twopoint_i)] <- 0 + base_ep_preds$Opp_Touchdown[c(missing_i, extrapoint_i, twopoint_i)] <- 0 + base_ep_preds$Safety[c(missing_i, extrapoint_i, twopoint_i)] <- 0 + base_ep_preds$Opp_Safety[c(missing_i, extrapoint_i, twopoint_i)] <- 0 + base_ep_preds$No_Score[c(missing_i, extrapoint_i, twopoint_i)] <- 0 + + # Rename the events to all have _Prob at the end of them: + base_ep_preds <- dplyr::rename(base_ep_preds, + Field_Goal_Prob = Field_Goal, + Touchdown_Prob = Touchdown, + Opp_Field_Goal_Prob = Opp_Field_Goal, + Opp_Touchdown_Prob = Opp_Touchdown, + Safety_Prob = Safety, + Opp_Safety_Prob = Opp_Safety, + No_Score_Prob = No_Score) + + + # Join them together: + pbp_data <- cbind(pbp_data, base_ep_preds) + + # Calculate the ExpPts: + pbp_data_ep <- dplyr::mutate(pbp_data, + ExpPts = (0*No_Score_Prob) + (-3 * Opp_Field_Goal_Prob) + + (-2 * Opp_Safety_Prob) + + (-7 * Opp_Touchdown_Prob) + (3 * Field_Goal_Prob) + + (2 * Safety_Prob) + (7 * Touchdown_Prob) + + (1 * ExPoint_Prob) + (2 * TwoPoint_Prob)) + + ################################################################# + # Calculate EPA: + + ### Adding Expected Points Added (EPA) column + + # Create multiple types of EPA columns + # for each of the possible cases, + # grouping by GameID (will then just use + # an ifelse statement to decide which one + # to use as the final EPA): + pbp_data_ep %>% + dplyr::group_by(game_id) %>% + dplyr::mutate(# Now conditionally assign the EPA, first for possession team + # touchdowns: + EPA = dplyr::if_else(!is.na(td_team), + dplyr::if_else(td_team == posteam, + 7 - ExpPts, -7 - ExpPts), + 0), + # 7 - ExpPts, 0), + # Offense field goal: + EPA = dplyr::if_else(is.na(td_team) & field_goal_made == 1, + 3 - ExpPts, EPA), + # Offense extra-point: + EPA = dplyr::if_else(is.na(td_team) & field_goal_made == 0 & + extra_point_good == 1, + 1 - ExpPts, EPA), + # Offense two-point conversion: + EPA = dplyr::if_else(is.na(td_team) & field_goal_made == 0 & + extra_point_good == 0 & + (two_point_rush_good == 1 | + two_point_pass_good == 1 | + two_point_pass_reception_good == 1), + 2 - ExpPts, EPA), + # Failed PAT (both 1 and 2): + EPA = dplyr::if_else(is.na(td_team) & field_goal_made == 0 & + extra_point_good == 0 & + ((extra_point_failed == 1 | + extra_point_blocked == 1 | + extra_point_aborted == 1) | + (two_point_rush_failed == 1 | + two_point_pass_failed == 1 | + two_point_pass_reception_failed == 1)), + 0 - ExpPts, EPA), + # Opponent safety: + EPA = dplyr::if_else(is.na(td_team) & field_goal_made == 0 & + extra_point_good == 0 & + extra_point_failed == 0 & + extra_point_blocked == 0 & + extra_point_aborted == 0 & + two_point_rush_failed == 0 & + two_point_pass_failed == 0 & + two_point_pass_reception_failed == 0 & + two_point_rush_good == 0 & + two_point_pass_good == 0 & + two_point_pass_reception_good == 0 & + safety == 1, + -2 - ExpPts, EPA), + # Defense touchdown + #EPA = dplyr::if_else(touchdown == 1 & td_team == defteam, + # -7 - ExpPts, EPA), + # Change of possession without defense scoring + # and no timeout, two minute warning, or quarter end follows: + EPA = dplyr::if_else(is.na(td_team) & field_goal_made == 0 & + extra_point_good == 0 & + extra_point_failed == 0 & + extra_point_blocked == 0 & + extra_point_aborted == 0 & + two_point_rush_failed == 0 & + two_point_pass_failed == 0 & + two_point_pass_reception_failed == 0 & + two_point_rush_good == 0 & + two_point_pass_good == 0 & + two_point_pass_reception_good == 0 & + safety == 0 & + drive != dplyr::lead(drive) & + posteam != dplyr::lead(posteam) & + !is.na(dplyr::lead(play_type)) & + (dplyr::lead(timeout) == 0 | + (dplyr::lead(timeout) == 1 & + dplyr::lead(play_type) != "no_play")), + -dplyr::lead(ExpPts) - ExpPts, EPA), + # Same thing except for when timeouts and end of play follow: + EPA = dplyr::if_else(is.na(td_team) & field_goal_made == 0 & + extra_point_good == 0 & + extra_point_failed == 0 & + extra_point_blocked == 0 & + extra_point_aborted == 0 & + two_point_rush_failed == 0 & + two_point_pass_failed == 0 & + two_point_pass_reception_failed == 0 & + two_point_rush_good == 0 & + two_point_pass_good == 0 & + two_point_pass_reception_good == 0 & + safety == 0 & + (is.na(dplyr::lead(play_type)) | + (dplyr::lead(timeout) == 1 & + dplyr::lead(play_type) == "no_play")) & + drive != dplyr::lead(drive, 2) & + posteam != dplyr::lead(posteam, 2), + -dplyr::lead(ExpPts, 2) - ExpPts, EPA), + # Same thing except for when back to back rows of end of + # play that can potentially occur because the NFL likes to + # make my life difficult: + EPA = dplyr::if_else(is.na(td_team) & field_goal_made == 0 & + extra_point_good == 0 & + extra_point_failed == 0 & + extra_point_blocked == 0 & + extra_point_aborted == 0 & + two_point_rush_failed == 0 & + two_point_pass_failed == 0 & + two_point_pass_reception_failed == 0 & + two_point_rush_good == 0 & + two_point_pass_good == 0 & + two_point_pass_reception_good == 0 & + safety == 0 & + (is.na(dplyr::lead(play_type)) & + is.na(dplyr::lead(play_type, 2))) & + drive != dplyr::lead(drive, 3) & + posteam != dplyr::lead(posteam, 3), + -dplyr::lead(ExpPts, 3) - ExpPts, EPA), + # Team keeps possession and no timeout or end of play follows: + EPA = dplyr::if_else(is.na(td_team) & field_goal_made == 0 & + extra_point_good == 0 & + extra_point_failed == 0 & + extra_point_blocked == 0 & + extra_point_aborted == 0 & + two_point_rush_failed == 0 & + two_point_pass_failed == 0 & + two_point_pass_reception_failed == 0 & + two_point_rush_good == 0 & + two_point_pass_good == 0 & + two_point_pass_reception_good == 0 & + safety == 0 & + posteam == dplyr::lead(posteam) & + !is.na(dplyr::lead(play_type)) & + (dplyr::lead(timeout) == 0 | + (dplyr::lead(timeout) == 1 & + dplyr::lead(play_type) != "no_play")), + dplyr::lead(ExpPts) - ExpPts, EPA), + # Same but timeout or end of play follows: + EPA = dplyr::if_else(is.na(td_team) & field_goal_made == 0 & + extra_point_good == 0 & + extra_point_failed == 0 & + extra_point_blocked == 0 & + extra_point_aborted == 0 & + two_point_rush_failed == 0 & + two_point_pass_failed == 0 & + two_point_pass_reception_failed == 0 & + two_point_rush_good == 0 & + two_point_pass_good == 0 & + two_point_pass_reception_good == 0 & + safety == 0 & + (is.na(dplyr::lead(play_type)) | + (dplyr::lead(timeout) == 1 & + dplyr::lead(play_type) == "no_play")) & + posteam == dplyr::lead(posteam, 2), + dplyr::lead(ExpPts, 2) - ExpPts, EPA), + # Same as above but when two rows without play info follow: + EPA = dplyr::if_else(is.na(td_team) & field_goal_made == 0 & + extra_point_good == 0 & + extra_point_failed == 0 & + extra_point_blocked == 0 & + extra_point_aborted == 0 & + two_point_rush_failed == 0 & + two_point_pass_failed == 0 & + two_point_pass_reception_failed == 0 & + two_point_rush_good == 0 & + two_point_pass_good == 0 & + two_point_pass_reception_good == 0 & + safety == 0 & + ( + #next play is missing play type or has timeout + ( is.na(dplyr::lead(play_type)) | (dplyr::lead(timeout) == 1 & dplyr::lead(play_type) == "no_play") ) & + #same for play after that + ( is.na(dplyr::lead(play_type, 2)) | (dplyr::lead(timeout, 2) == 1 & dplyr::lead(play_type, 2) == "no_play") ) + ) & + posteam == dplyr::lead(posteam, 3), + dplyr::lead(ExpPts, 3) - ExpPts, EPA)) %>% + # Now rename each of the expected points columns to match the style of + # the updated code: + dplyr::rename(ep = ExpPts, epa = EPA, + no_score_prob = No_Score_Prob, + opp_fg_prob = Opp_Field_Goal_Prob, + opp_safety_prob = Opp_Safety_Prob, + opp_td_prob = Opp_Touchdown_Prob, + fg_prob = Field_Goal_Prob, + safety_prob = Safety_Prob, + td_prob = Touchdown_Prob, + extra_point_prob = ExPoint_Prob, + two_point_conversion_prob = TwoPoint_Prob) %>% + # Create columns with cumulative epa totals for both teams: + dplyr::mutate(ep = dplyr::if_else(timeout == 1 & play_type == "no_play", + dplyr::lead(ep), ep), + epa = dplyr::if_else(timeout == 1 & play_type == "no_play", + 0, epa), + # Change epa for plays occurring at end of half with no scoring + # plays to be just the difference between 0 and starting ep: + epa = dplyr::if_else(((qtr == 2 & + (dplyr::lead(qtr) == 3 | + dplyr::lead(desc) == "END QUARTER 2")) | + (qtr == 4 & + (dplyr::lead(qtr) == 5 | + dplyr::lead(desc) == "END QUARTER 4"))) & + sp == 0 & + !is.na(play_type), + 0 - ep, epa), + home_team_epa = dplyr::if_else(posteam == home_team, + epa, -epa), + away_team_epa = dplyr::if_else(posteam == away_team, + epa, -epa), + home_team_epa = dplyr::if_else(is.na(home_team_epa), + 0, home_team_epa), + away_team_epa = dplyr::if_else(is.na(away_team_epa), + 0, away_team_epa), + total_home_epa = cumsum(home_team_epa), + total_away_epa = cumsum(away_team_epa), + # Same thing but separating passing and rushing: + home_team_rush_epa = dplyr::if_else(play_type == "run", + home_team_epa, 0), + away_team_rush_epa = dplyr::if_else(play_type == "run", + away_team_epa, 0), + home_team_rush_epa = dplyr::if_else(is.na(home_team_rush_epa), + 0, home_team_rush_epa), + away_team_rush_epa = dplyr::if_else(is.na(away_team_rush_epa), + 0, away_team_rush_epa), + total_home_rush_epa = cumsum(home_team_rush_epa), + total_away_rush_epa = cumsum(away_team_rush_epa), + home_team_pass_epa = dplyr::if_else(play_type == "pass", + home_team_epa, 0), + away_team_pass_epa = dplyr::if_else(play_type == "pass", + away_team_epa, 0), + home_team_pass_epa = dplyr::if_else(is.na(home_team_pass_epa), + 0, home_team_pass_epa), + away_team_pass_epa = dplyr::if_else(is.na(away_team_pass_epa), + 0, away_team_pass_epa), + total_home_pass_epa = cumsum(home_team_pass_epa), + total_away_pass_epa = cumsum(away_team_pass_epa)) %>% + dplyr::ungroup() %>% + return +} + + +################################################################# +# Calculate WP and WPA: + +add_wp_variables <- function(pbp_data) { + + #testing only + #pbp_data <- g + + # Initialize the df to store predicted win probability + OffWinProb <- rep(NA_real_, nrow(pbp_data)) + OffWinProb_spread <- rep(NA_real_, nrow(pbp_data)) + + pbp_data <- pbp_data %>% + prepare_wp_data() + + # First check if there's any overtime plays: + if (any(pbp_data$qtr == 5 | pbp_data$qtr == 6)){ + # Find the rows that are overtime: + overtime_i <- which(pbp_data$qtr == 5 | pbp_data$qtr == 6) + + # Separate the dataset into regular_df and overtime_df: + regular_df <- pbp_data[-overtime_i,] + overtime_df <- pbp_data[overtime_i,] + + # Use the win prob model to predict the win probability for + # regulation time plays: + OffWinProb[-overtime_i] <- get_preds_wp(regular_df) + OffWinProb_spread[-overtime_i] <- get_preds_wp_spread(regular_df) + + ## now we need to fix WP on kickoffs + kickoff_data <- regular_df + + # Change the yard line to be 80 for 2009-2015 and 75 otherwise + kickoff_data$yardline_100 <- with(kickoff_data, + ifelse(season < 2016, + 80, 75)) + # Now first down: + kickoff_data$down1 <- rep(1,nrow(regular_df)) + kickoff_data$down2 <- rep(0,nrow(regular_df)) + kickoff_data$down3 <- rep(0,nrow(regular_df)) + kickoff_data$down4 <- rep(0,nrow(regular_df)) + # 10 ydstogo: + kickoff_data$ydstogo <- rep(10,nrow(regular_df)) + + # Get the new predicted probabilites: + kickoff_preds <- get_preds_wp(kickoff_data) + kickoff_preds_spread <- get_preds_wp_spread(kickoff_data) + + # Find the kickoffs: + kickoff_i <- which(regular_df$play_type == "kickoff") + + # Now update the probabilities: + OffWinProb[kickoff_i] <- kickoff_preds[kickoff_i] + OffWinProb_spread[kickoff_i] <- kickoff_preds_spread[kickoff_i] + + ## end fix for kickoffs + + + # Separate routine for overtime: + + # Create a column that is just the first drive of overtime repeated: + overtime_df$First_Drive <- rep(min(overtime_df$drive, + na.rm = TRUE), + nrow(overtime_df)) + + # Calculate the difference in drive number + overtime_df <- dplyr::mutate(overtime_df, + Drive_Diff = drive - First_Drive) + + # Create an indicator column that means the posteam is losing by 3 and + # its the second drive of overtime: + overtime_df$One_FG_Game <- ifelse(overtime_df$score_differential == -3 & + overtime_df$Drive_Diff == 1, 1, 0) + + # Now create a copy of the dataset to then make the EP predictions for when + # a field goal is scored and its not sudden death: + overtime_df_ko <- overtime_df + + overtime_df_ko$yrdline100 <- with(overtime_df_ko, + ifelse(game_year < 2016 | + (game_year == 2016 & game_month < 4), + 80, 75)) + + # Now first down: + overtime_df_ko$down1 <- rep(1,nrow(overtime_df_ko)) + overtime_df_ko$down2 <- rep(0,nrow(overtime_df_ko)) + overtime_df_ko$down3 <- rep(0,nrow(overtime_df_ko)) + overtime_df_ko$down4 <- rep(0,nrow(overtime_df_ko)) + # 10 ydstogo: + overtime_df_ko$ydstogo <- rep(10,nrow(overtime_df_ko)) + + # Get the predictions from the EP model and calculate the necessary probability: + overtime_df_ko_preds <- get_preds(overtime_df_ko) + + overtime_df_ko_preds <- dplyr::mutate(overtime_df_ko_preds, + Win_Back = No_Score + Opp_Field_Goal + Opp_Safety + Opp_Touchdown) + + # Calculate the two possible win probability types, Sudden Death and one Field Goal: + overtime_df$Sudden_Death_WP <- overtime_df$fg_prob + overtime_df$td_prob + overtime_df$safety_prob + overtime_df$One_FG_WP <- overtime_df$td_prob + (overtime_df$fg_prob * overtime_df_ko_preds$Win_Back) + + + # Decide which win probability to use: + OffWinProb[overtime_i] <- ifelse(overtime_df$game_year >= 2012 & (overtime_df$Drive_Diff == 0 | (overtime_df$Drive_Diff == 1 & overtime_df$One_FG_Game == 1)), + overtime_df$One_FG_WP, overtime_df$Sudden_Death_WP) + OffWinProb_spread[overtime_i] <- OffWinProb[overtime_i] + + + } else { + + OffWinProb <- get_preds_wp(pbp_data) + OffWinProb_spread <- get_preds_wp_spread(pbp_data) + + ## now we need to fix WP on kickoffs + kickoff_data <- pbp_data + + # Change the yard line to be 80 for 2009-2015 and 75 otherwise + kickoff_data$yardline_100 <- with(kickoff_data, + ifelse(season < 2016, + 80, 75)) + # Now first down: + kickoff_data$down1 <- rep(1,nrow(pbp_data)) + kickoff_data$down2 <- rep(0,nrow(pbp_data)) + kickoff_data$down3 <- rep(0,nrow(pbp_data)) + kickoff_data$down4 <- rep(0,nrow(pbp_data)) + # 10 ydstogo: + kickoff_data$ydstogo <- rep(10,nrow(pbp_data)) + + # Get the new predicted probabilites: + kickoff_preds <- get_preds_wp(kickoff_data) + kickoff_preds_spread <- get_preds_wp_spread(kickoff_data) + + # Find the kickoffs: + kickoff_i <- which(pbp_data$play_type == "kickoff") + + # Now update the probabilities: + OffWinProb[kickoff_i] <- kickoff_preds[kickoff_i] + OffWinProb_spread[kickoff_i] <- kickoff_preds_spread[kickoff_i] + + ## end fix + + } + + + # Now create the win probability columns and return: + pbp_data <- pbp_data %>% + dplyr::mutate( + wp = OffWinProb, + wp = dplyr::if_else(is.na(posteam), NA_real_, wp), + def_wp = 1 - wp, + home_wp = dplyr::if_else(posteam == home_team, + wp, def_wp), + away_wp = dplyr::if_else(posteam == away_team, + wp, def_wp), + #add columns for WP taking into account spread + vegas_wp = OffWinProb_spread, + vegas_wp = dplyr::if_else(is.na(posteam), NA_real_, vegas_wp), + vegas_home_wp = dplyr::if_else(posteam == home_team, + vegas_wp, 1 - vegas_wp), + #make 1 or 0 the final win prob + vegas_home_wp = dplyr::if_else( + stringr::str_detect( + tolower(desc), "(end of game)|(end game)" + ), + dplyr::case_when( + home_score > away_score ~ 1, + away_score > home_score ~ 0, + home_score == away_score ~ .5 + ), + vegas_home_wp + ) + ) + + # For now follow the code from before, will need to update later: + # Create the possible WPA values + pbp_data <- dplyr::mutate(pbp_data, + # Team keeps possession (most general case): + WPA_base = dplyr::lead(wp) - wp, + # Team keeps possession but either Timeout, Two Minute Warning, + # Quarter End is the following row + WPA_base_nxt = dplyr::lead(wp,2) - wp, + # Change of possession and no timeout, + # two minute warning, or quarter end follows: + WPA_change = (1 - dplyr::lead(wp)) - wp, + # Change of possession but either Timeout, + # Two Minute Warning, or + # Quarter End is the following row: + WPA_change_nxt = (1 - dplyr::lead(wp, 2)) - wp, + # End of quarter, half or end rows: + WPA_halfend_to = 0) + # Create a WPA column for the last play of the game: + pbp_data$WPA_final <- ifelse(pbp_data$score_differential_post > 0 & pbp_data$posteam == pbp_data$home_team, + 1 - pbp_data$home_wp, + ifelse(pbp_data$score_differential_post > 0 & pbp_data$posteam == pbp_data$away_team, + 1 - pbp_data$away_wp, + ifelse(pbp_data$score_differential_post <= 0 & pbp_data$posteam == pbp_data$home_team, + 0 - pbp_data$home_wp, + ifelse(pbp_data$score_differential_post <= 0 & pbp_data$posteam == pbp_data$away_team, + 0 - pbp_data$away_wp, 0)))) + + pbp_data$WPA_base_nxt_ind <- with(pbp_data, + ifelse(posteam == dplyr::lead(posteam, 2) & + #drive == dplyr::lead(drive, 2) & + (is.na(dplyr::lead(play_type)) | + (dplyr::lead(timeout) == 1 & + dplyr::lead(play_type) == "no_play")), 1, 0)) + + pbp_data$WPA_change_nxt_ind <- with(pbp_data, + ifelse(posteam != dplyr::lead(posteam, 2) & + #drive != dplyr::lead(drive, 2) & + (is.na(dplyr::lead(play_type)) | + (dplyr::lead(timeout) == 1 & + dplyr::lead(play_type) == "no_play")), 1, 0)) + + pbp_data$WPA_change_ind <- with(pbp_data, + ifelse(posteam != dplyr::lead(posteam) & + #drive != dplyr::lead(drive) & + !is.na(dplyr::lead(play_type)) & + (dplyr::lead(timeout) == 0 | + (dplyr::lead(timeout) == 1 & + dplyr::lead(play_type) != "no_play")), 1, 0)) + pbp_data$WPA_halfend_to_ind <- with(pbp_data, + ifelse(is.na(play_type) | + (timeout == 1 & play_type == "no_play"), 1, 0)) + pbp_data$WPA_final_ind <- with(pbp_data, ifelse(stringr::str_detect(dplyr::lead(tolower(desc)), + "(end of game)|(end game)"), 1, 0)) + + # Replace the missings with 0 due to how ifelse treats missings + pbp_data$WPA_base_nxt_ind[is.na(pbp_data$WPA_base_nxt_ind)] <- 0 + pbp_data$WPA_change_nxt_ind[is.na(pbp_data$WPA_change_nxt_ind)] <- 0 + pbp_data$WPA_change_ind[is.na(pbp_data$WPA_change_ind)] <- 0 + pbp_data$WPA_halfend_to_ind[is.na(pbp_data$WPA_halfend_to_ind)] <- 0 + pbp_data$WPA_final_ind[is.na(pbp_data$WPA_final_ind)] <- 0 + + + # Assign WPA using these indicator columns: + pbp_data$wpa <- with(pbp_data, + ifelse(WPA_final_ind == 1, WPA_final, + ifelse(WPA_halfend_to_ind == 1, WPA_halfend_to, + ifelse(WPA_change_nxt_ind == 1, WPA_change_nxt, + ifelse(WPA_base_nxt_ind == 1, WPA_base_nxt, + ifelse(WPA_change_ind == 1, WPA_change, + WPA_base)))))) + # Home and Away post: + + pbp_data$home_wp_post <- ifelse(pbp_data$posteam == pbp_data$home_team, + pbp_data$home_wp + pbp_data$wpa, + pbp_data$home_wp - pbp_data$wpa) + pbp_data$away_wp_post <- ifelse(pbp_data$posteam == pbp_data$away_team, + pbp_data$away_wp + pbp_data$wpa, + pbp_data$away_wp - pbp_data$wpa) + + # If next thing is end of game, and post score differential is tied because it's + # overtime then make both the home_wp_post and away_wp_post equal to 0: + pbp_data <- pbp_data %>% + dplyr::mutate(home_wp_post = dplyr::if_else(qtr == 5 & + stringr::str_detect(tolower(dplyr::lead(desc)), + "(end of game)|(end game)") & + score_differential_post == 0, + 0, home_wp_post), + away_wp_post = dplyr::if_else(qtr == 5 & + stringr::str_detect(tolower(dplyr::lead(desc)), + "(end of game)|(end game)") & + score_differential_post == 0, + 0, away_wp_post)) + + + # For plays with playtype of End of Game, use the previous play's WP_post columns + # as the pre and post, since those are already set to be 1 and 0: + pbp_data$home_wp <- with(pbp_data, + ifelse(stringr::str_detect(tolower(desc), + "(end of game)|(end game)"), + dplyr::lag(home_wp_post), + ifelse(dplyr::lag(play_type) == "no_play" & play_type == "no_play", dplyr::lag(home_wp),home_wp))) + + pbp_data$home_wp_post <- with(pbp_data, + ifelse(stringr::str_detect(tolower(desc), + "(end of game)|(end game)"), dplyr::lag(home_wp_post), + ifelse(dplyr::lag(play_type) == "no_play" & play_type == "no_play", dplyr::lag(home_wp_post),home_wp_post))) + pbp_data$away_wp <- with(pbp_data, + ifelse(stringr::str_detect(tolower(desc), + "(end of game)|(end game)"), + dplyr::lag(away_wp_post), + ifelse(dplyr::lag(play_type) == "no_play" & play_type == "no_play", dplyr::lag(away_wp),away_wp))) + + pbp_data$away_wp_post <- with(pbp_data, + ifelse(stringr::str_detect(tolower(desc), + "(end of game)|(end game)"), dplyr::lag(away_wp_post), + ifelse(dplyr::lag(play_type) == "no_play" & play_type == "no_play", dplyr::lag(away_wp_post),away_wp_post))) + + + + # Now drop the unnecessary columns, rename variables back, and return: + pbp_data %>% dplyr::select(-c(WPA_base,WPA_base_nxt,WPA_change_nxt,WPA_change, + WPA_halfend_to, WPA_final, + WPA_base_nxt_ind, WPA_change_nxt_ind, + WPA_change_ind, WPA_halfend_to_ind, WPA_final_ind + )) %>% + dplyr::mutate( + # Generate columns to keep track of cumulative rushing and + # passing WPA values: + home_team_wpa = dplyr::if_else(posteam == home_team, + wpa, -wpa), + away_team_wpa = dplyr::if_else(posteam == away_team, + wpa, -wpa), + home_team_wpa = dplyr::if_else(is.na(home_team_wpa), + 0, home_team_wpa), + away_team_wpa = dplyr::if_else(is.na(away_team_wpa), + 0, away_team_wpa), + # Same thing but separating passing and rushing: + home_team_rush_wpa = dplyr::if_else(play_type == "run", + home_team_wpa, 0), + away_team_rush_wpa = dplyr::if_else(play_type == "run", + away_team_wpa, 0), + home_team_rush_wpa = dplyr::if_else(is.na(home_team_rush_wpa), + 0, home_team_rush_wpa), + away_team_rush_wpa = dplyr::if_else(is.na(away_team_rush_wpa), + 0, away_team_rush_wpa), + total_home_rush_wpa = cumsum(home_team_rush_wpa), + total_away_rush_wpa = cumsum(away_team_rush_wpa), + home_team_pass_wpa = dplyr::if_else(play_type == "pass", + home_team_wpa, 0), + away_team_pass_wpa = dplyr::if_else(play_type == "pass", + away_team_wpa, 0), + home_team_pass_wpa = dplyr::if_else(is.na(home_team_pass_wpa), + 0, home_team_pass_wpa), + away_team_pass_wpa = dplyr::if_else(is.na(away_team_pass_wpa), + 0, away_team_pass_wpa), + total_home_pass_wpa = cumsum(home_team_pass_wpa), + total_away_pass_wpa = cumsum(away_team_pass_wpa)) %>% + return + +} + + + + +################################################################# +# air and YAC EP: +# as with the rest, heavily borrowed from nflscrapR: +# https://github.com/maksimhorowitz/nflscrapR/blob/master/R/add_ep_wp_variables.R + +add_air_yac_ep_variables <- function(pbp_data) { + + #testing + #pbp_data <- g + + # Final all pass attempts that are not sacks: + pass_plays_i <- which(!is.na(pbp_data$air_yards) & pbp_data$play_type == 'pass') + pass_pbp_data <- pbp_data[pass_plays_i,] + + # Using the air_yards need to update the following: + # - yrdline100 + # - TimeSecs_Remaining + # - ydstogo + # - down + # - timeouts + + # Get everything set up for calculation + pass_pbp_data <- pass_pbp_data %>% + dplyr::mutate( + posteam_timeouts_pre = posteam_timeouts_remaining, + defeam_timeouts_pre = defteam_timeouts_remaining + ) %>% + # Rename the old columns to update for calculating the EP from the air: + dplyr::rename(old_yrdline100 = yardline_100, + old_ydstogo = ydstogo, + old_TimeSecs_Remaining = half_seconds_remaining, + old_down = down) %>% + dplyr::mutate(Turnover_Ind = dplyr::if_else(old_down == 4 & air_yards < old_ydstogo, + 1, 0), + yardline_100 = dplyr::if_else(Turnover_Ind == 0, + old_yrdline100 - air_yards, + 100 - (old_yrdline100 - air_yards)), + ydstogo = dplyr::if_else(air_yards >= old_ydstogo | + Turnover_Ind == 1, + 10, old_ydstogo - air_yards), + down = dplyr::if_else(air_yards >= old_ydstogo | + Turnover_Ind == 1, + 1, as.numeric(old_down) + 1), + half_seconds_remaining = old_TimeSecs_Remaining - 5.704673, + down1 = dplyr::if_else(down == 1, 1, 0), + down2 = dplyr::if_else(down == 2, 1, 0), + down3 = dplyr::if_else(down == 3, 1, 0), + down4 = dplyr::if_else(down == 4, 1, 0), + posteam_timeouts_remaining = dplyr::if_else(Turnover_Ind == 1, + defeam_timeouts_pre, + posteam_timeouts_pre), + defteam_timeouts_remaining = dplyr::if_else(Turnover_Ind == 1, + posteam_timeouts_pre, + defeam_timeouts_pre) + ) + + + #get EP predictions + pass_pbp_data_preds <- get_preds(pass_pbp_data) + + # Convert to air EP: + pass_pbp_data_preds <- dplyr::mutate(pass_pbp_data_preds, airEP = (Opp_Safety*-2) + (Opp_Field_Goal*-3) + + (Opp_Touchdown*-7) + (Safety*2) + (Field_Goal*3) + (Touchdown*7)) + + # Return back to the passing data: + pass_pbp_data$airEP <- pass_pbp_data_preds$airEP + + # For the plays that have TimeSecs_Remaining 0 or less, set airEP to 0: + pass_pbp_data$airEP[which(pass_pbp_data$half_seconds_remaining <= 0)] <- 0 + + # Calculate the airEPA based on 4 scenarios: + pass_pbp_data$airEPA <- with(pass_pbp_data, ifelse(old_yrdline100 - air_yards <= 0, + 7 - ep, + ifelse(old_yrdline100 - air_yards > 99, + -2 - ep, + ifelse(Turnover_Ind == 1, + (-1*airEP) - ep, + airEP - ep)))) + + # If the play is a two-point conversion then change the airEPA to NA since + # no air yards are provided: + pass_pbp_data$airEPA <- with(pass_pbp_data, ifelse(two_point_attempt == 1, + NA, airEPA)) + # Calculate the yards after catch EPA: + pass_pbp_data <- dplyr::mutate(pass_pbp_data, yacEPA = epa - airEPA) + + + # if Yards after catch is 0 make yacEPA set to 0: + pass_pbp_data$yacEPA <- ifelse(pass_pbp_data$penalty == 0 & pass_pbp_data$yards_after_catch == 0 & pass_pbp_data$complete_pass==1, + 0, pass_pbp_data$yacEPA) + + # if Yards after catch is 0 make airEPA set to EPA: + pass_pbp_data$airEPA <- ifelse(pass_pbp_data$penalty == 0 & pass_pbp_data$yards_after_catch == 0 & pass_pbp_data$complete_pass == 1, + pass_pbp_data$epa, pass_pbp_data$airEPA) + + # Now add airEPA and yacEPA to the original dataset: + pbp_data$airEPA <- NA + pbp_data$yacEPA <- NA + pbp_data$airEPA[pass_plays_i] <- pass_pbp_data$airEPA + pbp_data$yacEPA[pass_plays_i] <- pass_pbp_data$yacEPA + + # Now change the names to be the right style, calculate the completion form + # of the variables, as well as the cumulative totals and return: + pbp_data %>% + dplyr::rename(air_epa = airEPA, + yac_epa = yacEPA) %>% + dplyr::mutate(comp_air_epa = dplyr::if_else(complete_pass == 1, + air_epa, 0), + comp_yac_epa = dplyr::if_else(complete_pass == 1, + yac_epa, 0), + home_team_comp_air_epa = dplyr::if_else(posteam == home_team, + comp_air_epa, -comp_air_epa), + away_team_comp_air_epa = dplyr::if_else(posteam == away_team, + comp_air_epa, -comp_air_epa), + home_team_comp_yac_epa = dplyr::if_else(posteam == home_team, + comp_yac_epa, -comp_yac_epa), + away_team_comp_yac_epa = dplyr::if_else(posteam == away_team, + comp_yac_epa, -comp_yac_epa), + home_team_comp_air_epa = dplyr::if_else(is.na(home_team_comp_air_epa), + 0, home_team_comp_air_epa), + away_team_comp_air_epa = dplyr::if_else(is.na(away_team_comp_air_epa), + 0, away_team_comp_air_epa), + home_team_comp_yac_epa = dplyr::if_else(is.na(home_team_comp_yac_epa), + 0, home_team_comp_yac_epa), + away_team_comp_yac_epa = dplyr::if_else(is.na(away_team_comp_yac_epa), + 0, away_team_comp_yac_epa), + total_home_comp_air_epa = cumsum(home_team_comp_air_epa), + total_away_comp_air_epa = cumsum(away_team_comp_air_epa), + total_home_comp_yac_epa = cumsum(home_team_comp_yac_epa), + total_away_comp_yac_epa = cumsum(away_team_comp_yac_epa), + # Same but for raw - not just completions: + home_team_raw_air_epa = dplyr::if_else(posteam == home_team, + air_epa, -air_epa), + away_team_raw_air_epa = dplyr::if_else(posteam == away_team, + air_epa, -air_epa), + home_team_raw_yac_epa = dplyr::if_else(posteam == home_team, + yac_epa, -yac_epa), + away_team_raw_yac_epa = dplyr::if_else(posteam == away_team, + yac_epa, -yac_epa), + home_team_raw_air_epa = dplyr::if_else(is.na(home_team_raw_air_epa), + 0, home_team_raw_air_epa), + away_team_raw_air_epa = dplyr::if_else(is.na(away_team_raw_air_epa), + 0, away_team_raw_air_epa), + home_team_raw_yac_epa = dplyr::if_else(is.na(home_team_raw_yac_epa), + 0, home_team_raw_yac_epa), + away_team_raw_yac_epa = dplyr::if_else(is.na(away_team_raw_yac_epa), + 0, away_team_raw_yac_epa), + total_home_raw_air_epa = cumsum(home_team_raw_air_epa), + total_away_raw_air_epa = cumsum(away_team_raw_air_epa), + total_home_raw_yac_epa = cumsum(home_team_raw_yac_epa), + total_away_raw_yac_epa = cumsum(away_team_raw_yac_epa)) %>% + return +} + + +################################################################# +# air and YAC WP: +# as with the rest, heavily borrowed from nflscrapR: +# https://github.com/maksimhorowitz/nflscrapR/blob/master/R/add_ep_wp_variables.R + +add_air_yac_wp_variables <- function(pbp_data) { + + #testing + #pbp_data <- g + + # Change the names to reflect the old style - will update this later on: + pbp_data <- pbp_data %>% + dplyr::mutate( + posteam_timeouts_pre = posteam_timeouts_remaining, + defeam_timeouts_pre = defteam_timeouts_remaining + ) + + # Final all pass attempts that are not sacks: + pass_plays_i <- which(!is.na(pbp_data$air_yards) & pbp_data$play_type == 'pass') + pass_pbp_data <- pbp_data[pass_plays_i,] + + pass_pbp_data <- pass_pbp_data %>% + dplyr::mutate(ExpScoreDiff = ep + air_epa + score_differential, + half_seconds_remaining = half_seconds_remaining - 5.704673, + game_seconds_remaining = game_seconds_remaining - 5.704673, + ExpScoreDiff_Time_Ratio = ExpScoreDiff / (game_seconds_remaining + 1), + Turnover_Ind = dplyr::if_else(down == 4 & air_yards < ydstogo, + 1, 0), + ExpScoreDiff = dplyr::if_else(Turnover_Ind == 1, + -1 * ExpScoreDiff, ExpScoreDiff), + ExpScoreDiff_Time_Ratio = dplyr::if_else(Turnover_Ind == 1, + -1 * ExpScoreDiff_Time_Ratio, + ExpScoreDiff_Time_Ratio), + posteam_timeouts_remaining = dplyr::if_else(Turnover_Ind == 1, + defeam_timeouts_pre, + posteam_timeouts_pre), + defteam_timeouts_remaining = dplyr::if_else(Turnover_Ind == 1, + posteam_timeouts_pre, + defeam_timeouts_pre)) + + # Calculate the airWP: + pass_pbp_data$airWP <- get_preds_wp(pass_pbp_data) + + # Now for plays marked with Turnover_Ind, use 1 - airWP to flip back to the original + # team with possession: + pass_pbp_data$airWP <- ifelse(pass_pbp_data$Turnover_Ind == 1, + 1 - pass_pbp_data$airWP, pass_pbp_data$airWP) + + # For the plays that have TimeSecs_Remaining 0 or less, set airWP to 0: + pass_pbp_data$airWP[which(pass_pbp_data$half_seconds_remaining <= 0)] <- 0 + pass_pbp_data$airWP[which(pass_pbp_data$game_seconds_remaining <= 0)] <- 0 + + # Calculate the airWPA and yacWPA: + pass_pbp_data <- dplyr::mutate(pass_pbp_data, airWPA = airWP - wp, + yacWPA = wpa - airWPA) + + + # If the play is a two-point conversion then change the airWPA to NA since + # no air yards are provided: + pass_pbp_data$airWPA <- with(pass_pbp_data, ifelse(two_point_attempt == 1, + NA, airWPA)) + pass_pbp_data$yacWPA <- with(pass_pbp_data, ifelse(two_point_attempt == 1, + NA, yacWPA)) + + # Check to see if there is any overtime plays, if so then need to calculate + # by essentially taking the same process as the airEP calculation and using + # the resulting probabilities for overtime: + + # First check if there's any overtime plays: + if (any(pass_pbp_data$qtr == 5 | pass_pbp_data$qtr == 6)){ + # Find the rows that are overtime: + pass_overtime_i <- which(pass_pbp_data$qtr == 5 | pass_pbp_data$qtr == 6) + pass_overtime_df <- pass_pbp_data[pass_overtime_i,] + + # Find the rows that are overtime: + + # Need to generate same overtime scenario data as before in the wp function: + # Find the rows that are overtime: + overtime_i <- which(pbp_data$qtr == 5 | pbp_data$qtr == 6) + + overtime_df <- pbp_data[overtime_i,] + + # Separate routine for overtime: + + # Create a column that is just the first drive of overtime repeated: + overtime_df$First_Drive <- rep(min(overtime_df$drive, + na.rm = TRUE), + nrow(overtime_df)) + + # Calculate the difference in drive number + overtime_df <- dplyr::mutate(overtime_df, + Drive_Diff = drive - First_Drive) + + # Create an indicator column that means the posteam is losing by 3 and + # its the second drive of overtime: + overtime_df$One_FG_Game <- ifelse(overtime_df$score_differential == -3 & + overtime_df$Drive_Diff == 1, 1, 0) + + # Now create a copy of the dataset to then make the EP predictions for when + # a field goal is scored and its not sudden death: + overtime_df_ko <- overtime_df + + overtime_df_ko$yardline_100 <- with(overtime_df_ko, + ifelse(game_year < 2016 | + (game_year == 2016 & game_month < 4), + 80, 75)) + + # Now first down: + overtime_df_ko$down1 <- rep(1,nrow(overtime_df_ko)) + overtime_df_ko$down2 <- rep(0,nrow(overtime_df_ko)) + overtime_df_ko$down3 <- rep(0,nrow(overtime_df_ko)) + overtime_df_ko$down4 <- rep(0,nrow(overtime_df_ko)) + # 10 ydstogo: + overtime_df_ko$ydstogo <- rep(10,nrow(overtime_df_ko)) + + # Get the predictions from the EP model and calculate the necessary probability: + if (nrow(overtime_df_ko) > 1) { + overtime_df_ko_preds <- get_preds(overtime_df_ko) + } else{ + overtime_df_ko_preds <- get_preds(overtime_df_ko) + } + + overtime_df_ko_preds <- dplyr::mutate(overtime_df_ko_preds, + Win_Back = No_Score + Opp_Field_Goal + Opp_Safety + Opp_Touchdown) + + # Calculate the two possible win probability types, Sudden Death and one Field Goal: + overtime_df$Sudden_Death_WP <- overtime_df$fg_prob + overtime_df$td_prob + overtime_df$safety_prob + overtime_df$One_FG_WP <- overtime_df$td_prob + (overtime_df$fg_prob * overtime_df_ko_preds$Win_Back) + + # Find all Pass Attempts that are also actual plays in overtime: + overtime_pass_plays_i <- which(overtime_df$play_type == "pass" & + !is.na(overtime_df$air_yards)) + + overtime_pass_df <- overtime_df[overtime_pass_plays_i,] + overtime_df_ko_preds_pass <- overtime_df_ko_preds[overtime_pass_plays_i,] + + # Using the AirYards need to update the following: + # - yardline_100 + # - half_seconds_remaining + # - ydstogo + # - down + + # First rename the old columns to update for calculating the EP from the air: + overtime_pass_df <- dplyr::rename(overtime_pass_df, + old_yrdline100 = yardline_100, + old_ydstogo = ydstogo, + old_TimeSecs_Remaining = half_seconds_remaining, + old_down = down) + + # Create an indicator column for the air yards failing to convert the first down: + overtime_pass_df$Turnover_Ind <- ifelse(overtime_pass_df$old_down == 4 & + overtime_pass_df$air_yards < overtime_pass_df$old_ydstogo, + 1, 0) + # Adjust the field position variables: + overtime_pass_df$yardline_100 <- ifelse(overtime_pass_df$Turnover_Ind == 0, + overtime_pass_df$old_yrdline100 - overtime_pass_df$air_yards, + 100 - (overtime_pass_df$old_yrdline100 - overtime_pass_df$air_yards)) + + overtime_pass_df$ydstogo <- ifelse(overtime_pass_df$air_yards >= overtime_pass_df$old_ydstogo | + overtime_pass_df$Turnover_Ind == 1, + 10, overtime_pass_df$old_ydstogo - overtime_pass_df$air_yards) + + overtime_pass_df$down <- ifelse(overtime_pass_df$air_yards >= overtime_pass_df$old_ydstogo | + overtime_pass_df$Turnover_Ind == 1, + 1, as.numeric(overtime_pass_df$old_down) + 1) + + # Adjust the time with the average incomplete pass time: + overtime_pass_df$half_seconds_remaining <- overtime_pass_df$old_TimeSecs_Remaining - 5.704673 + + overtime_pass_df <- overtime_pass_df %>% + dplyr::mutate( + down1 = dplyr::if_else(down == 1, 1, 0), + down2 = dplyr::if_else(down == 2, 1, 0), + down3 = dplyr::if_else(down == 3, 1, 0), + down4 = dplyr::if_else(down == 4, 1, 0) + ) + + # Get the predictions from the EP model and calculate the necessary probability: + if (nrow(overtime_df_ko) > 1) { + overtime_pass_data_preds <- get_preds(overtime_pass_df) + } else{ + overtime_pass_data_preds <- get_preds(overtime_pass_df) + } + + # For the turnover plays flip the scoring probabilities: + overtime_pass_data_preds <- dplyr::mutate(overtime_pass_data_preds, + old_Opp_Field_Goal = Opp_Field_Goal, + old_Opp_Safety = Opp_Safety, + old_Opp_Touchdown = Opp_Touchdown, + old_Field_Goal = Field_Goal, + old_Safety = Safety, + old_Touchdown = Touchdown) + overtime_pass_data_preds$Opp_Field_Goal <- ifelse(overtime_pass_df$Turnover_Ind == 1, + overtime_pass_data_preds$old_Field_Goal, + overtime_pass_data_preds$Opp_Field_Goal) + overtime_pass_data_preds$Opp_Safety <- ifelse(overtime_pass_df$Turnover_Ind == 1, + overtime_pass_data_preds$old_Safety, + overtime_pass_data_preds$Opp_Safety) + overtime_pass_data_preds$Opp_Touchdown <- ifelse(overtime_pass_df$Turnover_Ind == 1, + overtime_pass_data_preds$old_Touchdown, + overtime_pass_data_preds$Opp_Touchdown) + overtime_pass_data_preds$Field_Goal <- ifelse(overtime_pass_df$Turnover_Ind == 1, + overtime_pass_data_preds$old_Opp_Field_Goal, + overtime_pass_data_preds$Field_Goal) + overtime_pass_data_preds$Safety <- ifelse(overtime_pass_df$Turnover_Ind == 1, + overtime_pass_data_preds$old_Opp_Safety, + overtime_pass_data_preds$Safety) + overtime_pass_data_preds$Touchdown <- ifelse(overtime_pass_df$Turnover_Ind == 1, + overtime_pass_data_preds$old_Opp_Touchdown, + overtime_pass_data_preds$Touchdown) + + # Calculate the two possible win probability types, Sudden Death and one Field Goal: + pass_overtime_df$Sudden_Death_airWP <- with(overtime_pass_data_preds, Field_Goal + Touchdown + Safety) + pass_overtime_df$One_FG_airWP <- overtime_pass_data_preds$Touchdown + (overtime_pass_data_preds$Field_Goal*overtime_df_ko_preds_pass$Win_Back) + + # Decide which win probability to use: + pass_overtime_df$airWP <- ifelse(overtime_pass_df$game_year >= 2012 & (overtime_pass_df$Drive_Diff == 0 | (overtime_pass_df$Drive_Diff == 1 & overtime_pass_df$One_FG_Game == 1)), + pass_overtime_df$One_FG_airWP, pass_overtime_df$Sudden_Death_airWP) + + # For the plays that have TimeSecs_Remaining 0 or less, set airWP to 0: + pass_overtime_df$airWP[which(overtime_pass_df$half_seconds_remaining <= 0)] <- 0 + + # Calculate the airWPA and yacWPA: + pass_overtime_df <- dplyr::mutate(pass_overtime_df, airWPA = airWP - wp, + yacWPA = wpa - airWPA) + + # If the play is a two-point conversion then change the airWPA to NA since + # no air yards are provided: + pass_overtime_df$airWPA <- with(pass_overtime_df, ifelse(two_point_attempt == 1, + NA, airWPA)) + pass_overtime_df$yacWPA <- with(pass_overtime_df, ifelse(two_point_attempt == 1, + NA, yacWPA)) + + + pass_overtime_df <- pass_pbp_data[pass_overtime_i,] + + # Now update the overtime rows in the original pass_pbp_data for airWPA and yacWPA: + pass_pbp_data$airWPA[pass_overtime_i] <- pass_overtime_df$airWPA + pass_pbp_data$yacWPA[pass_overtime_i] <- pass_overtime_df$yacWPA + } + + # if Yards after catch is 0 make yacWPA set to 0: + pass_pbp_data$yacWPA <- ifelse(pass_pbp_data$penalty == 0 & pass_pbp_data$yards_after_catch == 0 & + pass_pbp_data$complete_pass == 1, + 0, pass_pbp_data$yacWPA) + # if Yards after catch is 0 make airWPA set to WPA: + pass_pbp_data$airWPA <- ifelse(pass_pbp_data$penalty == 0 & pass_pbp_data$yards_after_catch == 0 & + pass_pbp_data$complete_pass == 1, + pass_pbp_data$wpa, pass_pbp_data$airWPA) + + # Now add airWPA and yacWPA to the original dataset: + pbp_data$airWPA <- NA + pbp_data$yacWPA <- NA + pbp_data$airWPA[pass_plays_i] <- pass_pbp_data$airWPA + pbp_data$yacWPA[pass_plays_i] <- pass_pbp_data$yacWPA + + + # Now change the names to be the right style, calculate the completion form + # of the variables, as well as the cumulative totals and return: + pbp_data %>% + dplyr::rename(air_wpa = airWPA, + yac_wpa = yacWPA) %>% + dplyr::mutate(comp_air_wpa = dplyr::if_else(complete_pass == 1, + air_wpa, 0), + comp_yac_wpa = dplyr::if_else(complete_pass == 1, + yac_wpa, 0), + home_team_comp_air_wpa = dplyr::if_else(posteam == home_team, + comp_air_wpa, -comp_air_wpa), + away_team_comp_air_wpa = dplyr::if_else(posteam == away_team, + comp_air_wpa, -comp_air_wpa), + home_team_comp_yac_wpa = dplyr::if_else(posteam == home_team, + comp_yac_wpa, -comp_yac_wpa), + away_team_comp_yac_wpa = dplyr::if_else(posteam == away_team, + comp_yac_wpa, -comp_yac_wpa), + home_team_comp_air_wpa = dplyr::if_else(is.na(home_team_comp_air_wpa), + 0, home_team_comp_air_wpa), + away_team_comp_air_wpa = dplyr::if_else(is.na(away_team_comp_air_wpa), + 0, away_team_comp_air_wpa), + home_team_comp_yac_wpa = dplyr::if_else(is.na(home_team_comp_yac_wpa), + 0, home_team_comp_yac_wpa), + away_team_comp_yac_wpa = dplyr::if_else(is.na(away_team_comp_yac_wpa), + 0, away_team_comp_yac_wpa), + total_home_comp_air_wpa = cumsum(home_team_comp_air_wpa), + total_away_comp_air_wpa = cumsum(away_team_comp_air_wpa), + total_home_comp_yac_wpa = cumsum(home_team_comp_yac_wpa), + total_away_comp_yac_wpa = cumsum(away_team_comp_yac_wpa), + # Same but for raw - not just completions: + home_team_raw_air_wpa = dplyr::if_else(posteam == home_team, + air_wpa, -air_wpa), + away_team_raw_air_wpa = dplyr::if_else(posteam == away_team, + air_wpa, -air_wpa), + home_team_raw_yac_wpa = dplyr::if_else(posteam == home_team, + yac_wpa, -yac_wpa), + away_team_raw_yac_wpa = dplyr::if_else(posteam == away_team, + yac_wpa, -yac_wpa), + home_team_raw_air_wpa = dplyr::if_else(is.na(home_team_raw_air_wpa), + 0, home_team_raw_air_wpa), + away_team_raw_air_wpa = dplyr::if_else(is.na(away_team_raw_air_wpa), + 0, away_team_raw_air_wpa), + home_team_raw_yac_wpa = dplyr::if_else(is.na(home_team_raw_yac_wpa), + 0, home_team_raw_yac_wpa), + away_team_raw_yac_wpa = dplyr::if_else(is.na(away_team_raw_yac_wpa), + 0, away_team_raw_yac_wpa), + total_home_raw_air_wpa = cumsum(home_team_raw_air_wpa), + total_away_raw_air_wpa = cumsum(away_team_raw_air_wpa), + total_home_raw_yac_wpa = cumsum(home_team_raw_yac_wpa), + total_away_raw_yac_wpa = cumsum(away_team_raw_yac_wpa)) %>% + return + +} diff --git a/R/helper_add_game_data.R b/R/helper_add_game_data.R new file mode 100644 index 00000000..f49fc833 --- /dev/null +++ b/R/helper_add_game_data.R @@ -0,0 +1,54 @@ +################################################################################ +# Author: Ben Baldwin +# Purpose: Function to add Lee Sharpe's game data +# Code Style Guide: styler::tidyverse_style() +################################################################################ + +# Thanks Lee! + +add_game_data <- function(pbp) { + out <- pbp + tryCatch( + expr = { + url <- "https://github.com/leesharpe/nfldata/blob/master/data/games.rds?raw=true" + + request <- httr::HEAD(url) + + if (request$status_code %in% c(404, 500)) { + warning(warn <- 1) + } + + out <- out %>% + dplyr::left_join( + readRDS(url(url)) %>% + dplyr::select( + game_id, away_score, home_score, location, result, total, + spread_line, total_line, div_game, roof, surface, temp, wind, + home_coach, away_coach, stadium, stadium_id, gameday + ) %>% + dplyr::rename(game_stadium = stadium), + by = c("game_id") + ) %>% + dplyr::mutate( + game_date = gameday + ) + + message("added game variables") + }, + error = function(e) { + message("The following error has occured:") + message(e) + }, + warning = function(w) { + if (warn == 1) { + message(glue::glue("Warning: The data hosting servers are down, so we can't add game data in the moment!")) + } else { + message("The following warning has occured:") + message(w) + } + }, + finally = { + } + ) + return(out) +} diff --git a/R/helper_add_nflscrapr_mutations.R b/R/helper_add_nflscrapr_mutations.R index 1fa99543..d517b47d 100644 --- a/R/helper_add_nflscrapr_mutations.R +++ b/R/helper_add_nflscrapr_mutations.R @@ -5,6 +5,10 @@ ################################################################################ add_nflscrapr_mutations <- function(pbp) { + + #testing only + #pbp <- combined + out <- pbp %>% dplyr::mutate(index = 1 : dplyr::n()) %>% # to re-sort after removing duplicates @@ -63,6 +67,9 @@ add_nflscrapr_mutations <- function(pbp) { nchar(yardline) == 0 | is.null(yardline) | yardline == "NULL" | is.na(yardline), dplyr::lag(yardline), yardline ), + yardline_number = dplyr::if_else( + yardline == "MID 50", 50, yardline_number + ), yardline_100 = dplyr::if_else( yardline_side == posteam | yardline == "MID 50", 100 - yardline_number, yardline_number @@ -489,7 +496,50 @@ add_nflscrapr_mutations <- function(pbp) { qtr = quarter ) %>% dplyr::ungroup() %>% - dplyr::mutate(game_id = as.numeric(game_id)) + dplyr::mutate(game_id = as.character(game_id), drive_real_start_time = as.character(drive_real_start_time)) %>% + make_model_mutations() + + message("added nflscrapR variables") return(out) } + + +##some steps to prepare the data for the EP/WP/CP/FG models +make_model_mutations <- function(pbp) { + + pbp <- pbp %>% + dplyr::mutate( + #for EP, CP, and WP model, xgb needs 0/1 for eras + era0 = dplyr::if_else(season <= 2001, 1, 0), + era1 = dplyr::if_else(season > 2001 & season <= 2005, 1, 0), + era2 = dplyr::if_else(season > 2005 & season <= 2013, 1, 0), + era3 = dplyr::if_else(season > 2013 & season <= 2017, 1, 0), + era4 = dplyr::if_else(season > 2017, 1, 0), + #for fg model, an era factor + era = dplyr::case_when( + era0 == 1 ~ 0, + era1 == 1 ~ 1, + era2 == 1 ~ 2, + era3 | era4 == 1 ~ 3 + ), + era = as.factor(era), + #treat playoff games as week 17 as they aren't used for training + model_week = dplyr::if_else(week > 17, as.integer(17), as.integer(week)), + down1 = dplyr::if_else(down == 1, 1, 0), + down2 = dplyr::if_else(down == 2, 1, 0), + down3 = dplyr::if_else(down == 3, 1, 0), + down4 = dplyr::if_else(down == 4, 1, 0), + home = dplyr::if_else(posteam == home_team, 1, 0), + model_roof = dplyr::if_else(is.na(roof) | roof == 'open' | roof == 'closed', as.character('retractable'), as.character(roof)), + model_roof = as.factor(model_roof), + retractable = dplyr::if_else(model_roof == 'retractable', 1, 0), + dome = dplyr::if_else(model_roof == 'dome', 1, 0), + outdoors = dplyr::if_else(model_roof == 'outdoors', 1, 0) + ) + + return(pbp) +} + + + diff --git a/R/helper_add_series_data.R b/R/helper_add_series_data.R index e84f8dc4..e293d147 100644 --- a/R/helper_add_series_data.R +++ b/R/helper_add_series_data.R @@ -56,7 +56,7 @@ add_series_data <- function(pbp) { ) %>% dplyr::group_by(game_id, series) %>% # set series_success value for the whole series - dplyr::mutate(series_success = last(series_success)) %>% + dplyr::mutate(series_success = dplyr::last(series_success)) %>% dplyr::ungroup() %>% dplyr::select(-first_down, -trigger) diff --git a/R/helper_additional_functions.R b/R/helper_additional_functions.R index 48987abf..623da8f2 100644 --- a/R/helper_additional_functions.R +++ b/R/helper_additional_functions.R @@ -1,20 +1,36 @@ ################################################################################ -# Author: Ben Baldwin +# Author: Ben Baldwin, Sebastian Carl # Stlyeguide: styler::tidyverse_style() ################################################################################ #' Clean Play by Play Data #' -#' @param pbp is a dataframe of play-by-play data scraped using \code{\link{fast_scraper}}. +#' @param pbp is a Data frame of play-by-play data scraped using \code{\link{fast_scraper}}. #' @details Build columns that capture what happens on all plays, including -#' penalties, using string extraction from play description. The created 'name' -#' column denotes the dropback player on dropbacks or the rusher on rush attempts. +#' penalties, using string extraction from play description. #' Loosely based on Ben's nflscrapR guide (https://gist.github.com/guga31bb/5634562c5a2a7b1e9961ac9b6c568701) #' but updated to work with the RS data, which has a different player format in #' the play description; e.g. 24-M.Lynch instead of M.Lynch. #' The function also standardizes team abbreviations so that, for example, #' the Chargers are always represented by 'LAC' regardless of which year it was. -#' Also creates a 'play' column denoting 'normal' plays (Ie pass play or run play) +#' @return The input Data Frame of the paramter 'pbp' with the following columns +#' added: +#' \itemize{ +#' \item{success} - Binary indicator wheter epa > 0 in the given play. +#' \item{passer} - Name of the dropback player (scrambles included) including plays with penalties. +#' \item{rusher} - Name of the rusher (no scrambles) including plays with penalties. +#' \item{receiver} - Name of the receiver including plays with penalties. +#' \item{pass} - Binary indicator if the play was a pass play (sacks and scrambles included). +#' \item{rush} - Binary indicator if the play was a rushing play. +#' \item{special} - Binary indicator if the play was a special teams play. +#' \item{first_down} - Binary indicator if the play ended in a first down. +#' \item{play} - Binary indicator: 1 if the play was a 'normal' play (including penalties), 0 otherwise. +#' \item{passer_id} - ID of the player in the 'passer' column (NOTE: ids vary pre and post 2011) +#' \item{rusher_id} - ID of the player in the 'rusher' column (NOTE: ids vary pre and post 2011) +#' \item{receiver_id} - ID of the player in the 'receiver' column (NOTE: ids vary pre and post 2011) +#' \item{name} - Name of the 'passer' if it is not 'NA', or name of the 'rusher' otherwise. +#' \item{id} - ID of the player in the 'name' column. +#' } #' @export clean_pbp <- function(pbp) { message('Cleaning up play-by-play. If you run this with a lot of seasons this could take a few minutes.') @@ -94,8 +110,11 @@ clean_pbp <- function(pbp) { receiver == "F.R" ~ "F.Jones", TRUE ~ receiver ), - name = dplyr::if_else(!is.na(passer), passer, rusher), first_down = dplyr::if_else(first_down_rush == 1 | first_down_pass == 1 | first_down_penalty == 1, 1, 0), + # easy filter: play is 1 if a "special teams" play, or 0 otherwise + # with thanks to Lee Sharpe for the code + special=dplyr::if_else(play_type %in% + c("extra_point","field_goal","kickoff","punt"), 1, 0), # easy filter: play is 1 if a "normal" play (including penalties), or 0 otherwise # with thanks to Lee Sharpe for the code play=dplyr::if_else(!is.na(epa) & !is.na(posteam) & @@ -127,6 +146,10 @@ clean_pbp <- function(pbp) { dplyr::mutate(receiver = dplyr::if_else(is.na(receiver_id), NA_character_, custom_mode(receiver))) %>% dplyr::ungroup() %>% + dplyr::mutate( + name = dplyr::if_else(!is.na(passer), passer, rusher), + id = dplyr::if_else(!is.na(passer_id), passer_id, rusher_id) + ) %>% dplyr::arrange(index) %>% dplyr::select(-index) @@ -178,15 +201,15 @@ custom_mode <- function(x, na.rm = TRUE) { #' Compute QB epa #' -#' @param d is a dataframe of play-by-play data scraped using \code{\link{fast_scraper}}. +#' @param d is a Data frame of play-by-play data scraped using \code{\link{fast_scraper}}. #' @details Add the variable 'qb_epa', which gives QB credit for EPA for up to the point where #' a receiver lost a fumble after a completed catch and makes EPA work more #' like passing yards on plays with fumbles #' @export -fix_fumbles <- function(d) { +add_qb_epa <- function(d) { + fumbles_df <- d %>% dplyr::filter(complete_pass == 1 & fumble_lost == 1 & !is.na(epa)) %>% - dplyr::select(desc, game_id, play_id, epa, posteam, half_seconds_remaining, yardline_100, down, ydstogo, yards_gained, goal_to_go, ep) %>% dplyr::mutate( down = as.numeric(down), # save old stuff for testing/checking @@ -203,16 +226,12 @@ fix_fumbles <- function(d) { ydstogo = dplyr::if_else(change == 1, 10, ydstogo), # flip field for possession change yardline_100 = dplyr::if_else(change == 1, 100 - yardline_100, yardline_100), - goal_to_go = dplyr::if_else(yardline_100 == ydstogo, 1, 0), ep_old = ep ) %>% dplyr::select(-ep, -epa) if (nrow(fumbles_df) > 0) { - new_ep_df <- nflscrapR::calculate_expected_points( - fumbles_df, "half_seconds_remaining", "yardline_100", - "down", "ydstogo", "goal_to_go" - ) %>% + new_ep_df <- calculate_expected_points(fumbles_df) %>% dplyr::mutate(ep = dplyr::if_else(change == 1, -ep, ep), fixed_epa = ep - ep_old) %>% dplyr::select(game_id, play_id, fixed_epa) diff --git a/R/helper_scrape_gc.R b/R/helper_scrape_gc.R index 63ac456f..97065bdd 100644 --- a/R/helper_scrape_gc.R +++ b/R/helper_scrape_gc.R @@ -4,62 +4,60 @@ ################################################################################ # Build a tidy version of scraped gamecenter data -# Data exist since 2009 +# Data exist since 1999 # # @param gameId Specifies the game - - get_pbp_gc <- function(gameId) { combined <- data.frame() tryCatch( expr = { - if (gameId == 2013092206) { + #testing only + #gameId = '2013120812' + #gameId = '2019_01_GB_CHI' + #gameId = '2009_18_NYJ_CIN' + #gameId = '2007_01_ARI_SF' + #gameId = '1999_01_BAL_STL' + + if (gameId %in% c("2000_03_SD_KC", "2000_06_BUF_MIA", "1999_01_BAL_STL")) { warning(warn <- 1) } - if (gameId %in% c(2013112401, 2013120101)) - message( - glue::glue( - "Note: most yardage columns for game ID {as.character(gameId)} are missing. Use the RS scraper instead with source = 'rs'" - ) - ) + season <- as.integer(substr(gameId, 1, 4)) - url = paste0("http://www.nfl.com/liveupdate/game-center/", gameId, "/", - gameId, "_gtd.json") - request <- httr::GET(url) + #postseason games are in here too + url <- glue::glue('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/raw/{season}/{gameId}.rds') + request <- httr::HEAD(url = url) if (request$status_code == 404) { warning(warn <- 3) + } else if (request$status_code == 500) { + warning(warn <- 4) } - date_parse <- stringr::str_extract(paste(gameId), pattern = "[0-9]{8}") + raw <- readRDS(url(url)) + + game_json <- raw %>% purrr::pluck(1) + + date_parse <- names(raw)[1] %>% stringr::str_extract(pattern = "[0-9]{8}") date_year <- stringr::str_sub(date_parse, 1, 4) date_month <- stringr::str_sub(date_parse, 5, 6) date_day <- stringr::str_sub(date_parse, nchar(date_parse) - 1, nchar(date_parse)) - #fix for season maker - if (date_month == '01' | date_month == '02') { - date_season <- as.numeric(date_year) - 1 + week <- as.integer(substr(gameId, 6, 7)) + if (week <= 17) { + season_type <- 'REG' } else { - date_season <- as.numeric(date_year) + season_type <- 'POST' } - if (date_year < 2009) { + if (date_year < 1999) { warning(warn <- 2) } - game_json <- request %>% - httr::content(as = "text", encoding = "UTF-8") %>% - jsonlite::fromJSON(flatten = TRUE) %>% - purrr::pluck(1) - - # message(glue::glue("Scraping gamecenter play by play data for GameID {gameId}...")) - - #excluding last element since it's "crntdrv" and not an actual drives <- game_json$drives[-length(game_json$drives)] @@ -104,26 +102,63 @@ get_pbp_gc <- function(gameId) { sum_play_stats(x, stats = stats) }) - + #drive info + d <- tibble::tibble(drives) %>% + tidyr::unnest_wider(drives) %>% + #dplyr::select(-plays) %>% + tidyr::unnest_wider(start, names_sep="_") %>% + tidyr::unnest_wider(end, names_sep="_") %>% + dplyr::mutate(drive = 1 : dplyr::n()) %>% + dplyr::rename( + drive_play_count = numplays, + drive_time_of_possession = postime, + drive_first_downs = fds, + drive_inside20 = redzone, + drive_quarter_start = start_qtr, + drive_quarter_end = end_qtr, + drive_end_transition = result, + drive_game_clock_start = start_time, + drive_game_clock_end = end_time, + drive_start_yard_line = start_yrdln, + drive_end_yard_line = end_yrdln + ) %>% + dplyr::mutate( + drive_inside20 = dplyr::if_else(drive_inside20, 1, 0), + drive_how_ended_description = drive_end_transition, + drive_ended_with_score = dplyr::if_else(drive_how_ended_description == "Touchdown" | drive_how_ended_description == 'Field Goal', 1, 0), + drive_start_transition = dplyr::lag(drive_how_ended_description, 1), + drive_how_started_description = drive_start_transition + ) %>% + dplyr::select( + drive, drive_play_count, drive_time_of_possession, + drive_first_downs, drive_inside20, drive_ended_with_score, + drive_quarter_start, drive_quarter_end, + drive_end_transition, drive_how_ended_description, + drive_game_clock_start, drive_game_clock_end, + drive_start_yard_line, drive_end_yard_line, + drive_start_transition, drive_how_started_description + ) combined <- plays %>% dplyr::left_join(pbp_stats, by = "play_id") %>% dplyr::mutate_if(is.logical, as.numeric) %>% dplyr::mutate_if(is.integer, as.numeric) %>% - dplyr::select(-players) %>% - dplyr::rename(yardline = yrdln, quarter = qtr, play_description = desc, yards_to_go = ydstogo) %>% - tidyr::unnest(cols = c(sp, quarter, down, time, yardline, yards_to_go, ydsnet, posteam, - play_description, note)) %>% + dplyr::select(-players, -note) %>% + #Weirdly formatted and missing anyway + dplyr::mutate(note = NA_character_) %>% + dplyr::rename(yardline = yrdln, quarter = qtr, play_description = desc, yards_to_go = ydstogo) %>% + tidyr::unnest(cols = c(sp, quarter, down, time, yardline, yards_to_go, ydsnet, posteam, play_description, note)) %>% + dplyr::left_join(d, by = "drive") %>% dplyr::mutate( posteam_id = posteam, game_id = gameId, - game_year = date_year, - game_month = date_month, + game_year = as.integer(date_year), + game_month = as.integer(date_month), game_date = as.Date(paste(date_month, date_day, date_year, sep = "/"), format = "%m/%d/%Y"), - season = date_season, + season = season, #fix up yardline before doing stuff. from nflscrapr yardline = dplyr::if_else(yardline == "50", "MID 50", yardline), @@ -141,9 +176,31 @@ get_pbp_gc <- function(gameId) { (yards_to_go <= 1 & yardline_number == 1)), 1, 0), down = as.double(down), - quarter = as.double(quarter) + quarter = as.double(quarter), + week = week, + season_type = season_type, + #missing from older gc data + drive_real_start_time = NA_character_, + start_time = NA_character_, + stadium = NA_character_, + weather = NA_character_, + nfl_api_id = NA_character_, + play_clock = NA_character_, + play_deleted = NA_real_, + play_type_nfl = NA_character_, + end_clock_time = NA_character_, + end_yard_line = NA_character_ + ) %>% + dplyr::group_by(drive) %>% + dplyr::mutate( + drive_play_id_started = min(play_id, na.rm = T), + drive_play_seq_started = min(play_id, na.rm = T), + drive_play_id_ended = max(play_id, na.rm = T), + drive_play_seq_ended = max(play_id, na.rm = T) + + ) %>% + dplyr::ungroup() - ) }, error = function(e) { @@ -152,11 +209,13 @@ get_pbp_gc <- function(gameId) { }, warning = function(w) { if (warn == 1) { - message(glue::glue("You asked for {gameId}, which is broken. Use the RS scraper instead")) + message(glue::glue("You asked for {gameId}, which is broken. Skipping.")) } else if (warn == 2) { - message(glue::glue("You asked a game from {date_year}, which only goes back to 2009. Use the RS scraper instead with source = 'rs'")) + message(glue::glue("You asked a game from {date_year}, but data only goes back to 1999.")) } else if (warn == 3) { message(glue::glue("Warning: The requested GameID {gameId} is invalid!")) + } else if (warn == 4) { + message(glue::glue("Warning: The data hosting servers are down, please try again later!")) } else { message("The following warning has occured:") message(w) @@ -170,3 +229,6 @@ get_pbp_gc <- function(gameId) { } + + + diff --git a/R/helper_scrape_nfl.R b/R/helper_scrape_nfl.R new file mode 100644 index 00000000..692ebd63 --- /dev/null +++ b/R/helper_scrape_nfl.R @@ -0,0 +1,225 @@ +################################################################################ +# Author: Sebastian Carl, Ben Baldwin +# Purpose: Function for scraping pbp data from the new NFL web site +# Code Style Guide: styler::tidyverse_style() +################################################################################ + +# Build a tidy version of scraped NFL data +# +# @param id Specifies the game + +get_pbp_nfl <- function(id) { + combined <- data.frame() + tryCatch( + expr = { + + #testing + #id = '2019_01_GB_CHI' + #id = '2015_01_CAR_JAX' + #id = '2011_01_NO_GB' + + season <- substr(id, 1, 4) + week <- as.integer(substr(id, 6, 7)) + + path <- "https://github.com/guga31bb/nflfastR-data/raw/master/raw" + + request <- httr::HEAD(glue::glue("{path}/{season}/{id}.rds")) + + if (request$status_code == 404) { + warning(warn <- 1) + } else if (request$status_code == 500) { + warning(warn <- 2) + } + + raw_data <- readRDS(url(glue::glue("{path}/{season}/{id}.rds"))) + + season_type <- dplyr::if_else(week <= 17, "REG", "POST") + + # game_info <- raw_data$data$viewer$gameDetail + + game_id <- raw_data$data$viewer$gameDetail$id + home_team <- raw_data$data$viewer$gameDetail$homeTeam$abbreviation + away_team <- raw_data$data$viewer$gameDetail$visitorTeam$abbreviation + weather <- dplyr::if_else( + is.null(raw_data$data$viewer$gameDetail$weather$shortDescription), + NA_character_, + raw_data$data$viewer$gameDetail$weather$shortDescription + ) + stadium <- dplyr::if_else( + is.null(raw_data$data$viewer$gameDetail$stadium), + NA_character_, + raw_data$data$viewer$gameDetail$stadium + ) + start_time <- raw_data$data$viewer$gameDetail$startTime + + game_info <- data.frame( + game_id, + home_team, + away_team, + weather, + stadium, + start_time + ) %>% + tibble::as_tibble() %>% + dplyr::mutate(game_id = as.character(game_id)) + + plays <- raw_data$data$viewer$gameDetail$plays %>% dplyr::mutate(game_id = as.character(game_id)) + + #fill missing posteam info for this time + if ((home_team == 'JAC' | away_team == 'JAC') & season <= 2015) { + plays <- plays %>% + dplyr::mutate( + possessionTeam.abbreviation = stringr::str_extract(plays$prePlayByPlay, '[A-Z]{2,3}(?=\\s)'), + possessionTeam.abbreviation = dplyr::if_else( + possessionTeam.abbreviation %in% c('OUT', 'END', 'NA'), + NA_character_, possessionTeam.abbreviation + ), + possessionTeam.abbreviation = dplyr::if_else( + possessionTeam.abbreviation == 'JAX', 'JAC', possessionTeam.abbreviation + ) + ) + } + + drives <- raw_data$data$viewer$gameDetail$drives %>% + dplyr::mutate(ydsnet = yards + yardsPenalized) %>% + # these are already in plays + dplyr::select( + -possessionTeam.abbreviation, + -possessionTeam.nickName, + -possessionTeam.franchise.currentLogo.url + ) %>% + janitor::clean_names() + colnames(drives) <- paste0("drive_", colnames(drives)) + + stats <- tidyr::unnest(plays %>% dplyr::select(-yards), cols = c(playStats)) %>% + dplyr::mutate( + yards = as.integer(yards), + statId = as.numeric(statId), + team.abbreviation = as.character(team.abbreviation) + ) %>% + dplyr::rename( + player.esbId = gsisPlayer.id, + player.displayName = playerName, + teamAbbr = team.abbreviation + ) %>% + dplyr::select( + playId, + statId, + yards, + teamAbbr, + player.displayName, + player.esbId + ) + + # if I don't put this here it breaks + suppressWarnings( + pbp_stats <- + purrr::map_df(unique(stats$playId), function(x) { + sum_play_stats(x, stats = stats) + }) %>% + dplyr::mutate(play_id = as.integer(play_id)) + ) + + combined <- game_info %>% + dplyr::left_join(plays %>% dplyr::select(-playStats), by = c("game_id")) %>% + dplyr::left_join(drives, by = c("driveSequenceNumber" = "drive_order_sequence")) %>% + dplyr::left_join(pbp_stats, by = c("playId" = "play_id")) %>% + dplyr::mutate_if(is.logical, as.numeric) %>% + dplyr::mutate_if(is.integer, as.numeric) %>% + dplyr::mutate_if(is.factor, as.character) %>% + janitor::clean_names() %>% + dplyr::select(-drive_play_count, -drive_time_of_possession, -next_play_type) %>% + dplyr::rename( + time = clock_time, + play_type_nfl = play_type, + posteam = possession_team_abbreviation, + yardline = yard_line, + sp = scoring_play, + drive = drive_sequence_number, + nfl_api_id = game_id, + drive_play_count = drive_play_count_2, + drive_time_of_possession = drive_time_of_possession_2, + ydsnet = drive_ydsnet + ) %>% + dplyr::mutate( + posteam_id = posteam, + # have to do all this nonsense to make goal_to_go and yardline_side for compatibility with later functions + yardline_side = purrr::map_chr( + stringr::str_split(yardline, " "), + function(x) x[1] + ), + yardline_number = as.numeric(purrr::map_chr( + stringr::str_split(yardline, " "), + function(x) x[2] + )), + quarter_end = dplyr::if_else(stringr::str_detect(play_description, "END QUARTER"), 1, 0), + game_year = as.integer(season), + season = as.integer(season), + # this is only needed for epa and dropped later + game_month = as.integer(11), + game_id = as.character(glue::glue('{season}_{formatC(week, width=2, flag=\"0\")}_{away_team}_{home_team}')), + play_description = play_description_with_jersey_numbers, + week = week, + season_type = season_type, + play_clock = as.character(play_clock), + st_play_type = as.character(st_play_type), + td_team = dplyr::if_else( + season >= 2011 & season <= 2015 & posteam == 'JAC' & + drive_how_ended_description == 'Touchdown' & !is.na(td_team), + 'JAC', td_team + ), + yardline_side = dplyr::if_else( + season >= 2011 & season <= 2015 & yardline_side == 'JAX', + 'JAC', yardline_side + ) + ) %>% + dplyr::mutate_all(dplyr::na_if, "") + + # combined <- + # combined %>% + # dplyr::select( + # tidyselect::one_of( + # c(pbp_stat_columns, api_cols, save_cols) + # ) + # ) + }, + error = function(e) { + message("The following error has occured:") + message(e) + }, + warning = function(w) { + if (warn == 1) { + message(glue::glue("Warning: The requested GameID {id} is invalid!")) + } else if (warn == 2) { + message(glue::glue("Warning: The data hosting servers are down, please try again later!")) + } else { + message("The following warning has occured:") + message(w) + } + }, + finally = { + } + ) + return(combined) +} + + +# otherwise scraping a lot of seasons breaks +save_cols <- c( + "game_id", "nfl_api_id", "home_team", "away_team", + "season", "game_month", + "game_year", "time", "down", "drive_net_yards", + "drive", "first_down", "goal_to_go", "order_sequence", + "play_description", "play_review_status", + "play_type_nfl", "quarter", "sp", + "scoring_play_type", "special_teams_play", + "time_of_day", + "yardline", "yards", + "yards_to_go", "latest_play", + "posteam", + "scoring_team_id", + "scoring_team_abbreviation", "scoring_team_nick_name", + "ydsnet", "drive_yards_penalized", + "posteam_id", "yardline_side", + "yardline_number", "quarter_end" +) diff --git a/R/helper_scrape_schedule.R b/R/helper_scrape_schedule.R index 90a42d04..4d5c96f9 100644 --- a/R/helper_scrape_schedule.R +++ b/R/helper_scrape_schedule.R @@ -1,6 +1,6 @@ ################################################################################ # Author: Sebastian Carl and Ben Baldwin -# Purpose: Function for scraping season schedule from the NFL RS Feed +# Purpose: Function for scraping games that have been put in github repo # Code Style Guide: styler::tidyverse_style() ################################################################################ @@ -8,146 +8,25 @@ get_season_schedule <- function(season) { season_schedule <- data.frame() tryCatch( expr = { - request <- - httr::GET(url = glue::glue("http://www.nfl.com/feeds-rs/schedules/{season}")) + url <- glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/schedules/sched_{season}.rds") + + request <- httr::HEAD(url = url) if (request$status_code == 404) { + warning(warn <- 2) + } else if (request$status_code == 500) { warning(warn <- 1) } - #detect xml or json - #first do the xml stuff if detected - if (stringr::str_detect(request$headers$`content-type`, 'xml')) { - #message('Detected XML, parsing schedule!') - - raw_data <- request %>% - xml2::read_xml() %>% - xml2::xml_find_all('.//gameSchedule') - - if (length(raw_data) == 0) { - warning(warn <- 2) - } - - #get all the main stuff - attrs <- xml2::xml_attrs(xml2::xml_find_all(raw_data, "//gameSchedule")) - names <- attrs[[1]] %>% names() - sched <- dplyr::bind_cols(attrs) %>% - t() %>% - as.data.frame() %>% - tibble::remove_rownames() - names(sched) <- names - sched <- sched %>% janitor::clean_names() - - #now dig out the site stuff - path = xml2::xml_find_all(raw_data, "//site") - - site_id <- xml2::xml_attr(x = path, attr="siteId") %>% as.data.frame() - site_city <- xml2::xml_attr(x = path, attr="siteCity") %>% as.data.frame() - site_state <- xml2::xml_attr(x = path, attr="siteState") %>% as.data.frame() - site_fullname <- xml2::xml_attr(x = path, attr="siteFullname") %>% as.data.frame() - site_roof_type <- xml2::xml_attr(x = path, attr="roofType") %>% as.data.frame() - - sites <- dplyr::bind_cols( - site_id, - site_city, - site_fullname, - site_state, - site_roof_type - ) - names(sites) <- names(xml2::xml_attrs(path[[1]])) - sites <- sites %>% janitor::clean_names() - - season_schedule <- dplyr::bind_cols(sched, sites) %>% - tibble::as_tibble() %>% - tibble::remove_rownames() + # I know it's bad to call the server twice but couldn't figure out how + # to parse the content in the request variable - season_schedule[] <- lapply(season_schedule, as.character) - season_schedule <- season_schedule %>% - dplyr::mutate( - week = as.integer(week), - game_id = as.integer(game_id), - season = as.integer(season), - # add Lee Sharpe's very useful alt_game_id - alt_game_id = dplyr::if_else( - season_type %in% c("REG", "POST"), - glue::glue("{season}_{formatC(week, width=2, flag=\"0\")}_{visitor_team_abbr}_{home_team_abbr}"), - NA_character_ - ) - ) %>% - dplyr::select( - season, - season_type, - week, - game_id, - alt_game_id, - game_date, - game_time_eastern, - game_time_local, - home_team = home_team_abbr, - away_team = visitor_team_abbr, - home_team_name = home_display_name, - away_team_name = visitor_display_name, - home_nickname, - away_nickname = visitor_nickname, - home_team_id, - away_team_id = visitor_team_id, - game_type, - week_name, - site_city, - site_fullname, - site_state, - site_roof_type = roof_type + season_schedule <- + readRDS( + url( + glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/schedules/sched_{season}.rds") ) - #or do the json stuff if no xml detected - } else { - #message('Detected JSON, parsing schedule!') - raw_data <- request %>% - httr::content(as = "text", encoding = "UTF-8") %>% - jsonlite::fromJSON(flatten = TRUE) - - if (is.null(raw_data %>% purrr::pluck("gameSchedules"))) { - warning(warn <- 2) - } - - season_schedule <- raw_data %>% - purrr::pluck("gameSchedules") %>% - as.data.frame() %>% - janitor::clean_names() %>% - dplyr::mutate( - # add Lee Sharpe's very useful alt_game_id - alt_game_id = dplyr::if_else( - season_type %in% c("REG", "POST"), - glue::glue("{season}_{formatC(week, width=2, flag=\"0\")}_{visitor_team_abbr}_{home_team_abbr}"), - NA_character_ - ) - ) %>% - dplyr::select( - season, - season_type, - week, - game_id, - alt_game_id, - game_date, - game_time_eastern, - game_time_local, - home_team = home_team_abbr, - away_team = visitor_team_abbr, - home_team_name = home_display_name, - away_team_name = visitor_display_name, - home_nickname, - away_nickname = visitor_nickname, - home_team_id, - away_team_id = visitor_team_id, - game_type, - week_name, - site_city = site_site_city, - site_fullname = site_site_fullname, - site_state = site_site_state, - site_roof_type - #network_channel - ) %>% - dplyr::arrange(game_id) - } + ) }, error = function(e) { message("The following error has occured:") @@ -155,7 +34,7 @@ get_season_schedule <- function(season) { }, warning = function(w) { if (warn == 1) { - message(glue::glue("Warning: The requested Season {season} is invalid!")) + message(glue::glue("Warning: The data hosting servers are down, please try again later!")) } else if (warn == 2) { message(glue::glue("Warning: Either the requested season {season} is invalid or no data available at this time!")) } else { diff --git a/R/helper_tidy_play_stats.R b/R/helper_tidy_play_stats.R index b7312dba..c1710b48 100644 --- a/R/helper_tidy_play_stats.R +++ b/R/helper_tidy_play_stats.R @@ -1228,9 +1228,13 @@ sum_play_stats <- function(play_Id, stats) { } else if (play_stats$statId[index] == 113) { row$pass_attempt <- 1 row$complete_pass <- 1 - row$receiver_player_id <- play_stats$player.esbId[index] - row$receiver_player_name <- play_stats$player.displayName[index] - row$yards_after_catch <- play_stats$yards[index] + if (is.na(row$receiver_player_id)) { + row$receiver_player_id <- play_stats$player.esbId[index] + row$receiver_player_name <- play_stats$player.displayName[index] + } + if (is.na(row$yards_after_catch)) { + row$yards_after_catch <- play_stats$yards[index] + } } else if (play_stats$statId[index] == 115) { row$pass_attempt <- 1 row$receiver_player_id <- play_stats$player.esbId[index] diff --git a/R/helper_variable_selector.R b/R/helper_variable_selector.R index 9294b35d..261a899c 100644 --- a/R/helper_variable_selector.R +++ b/R/helper_variable_selector.R @@ -5,38 +5,29 @@ ################################################################################ select_variables <- function(pbp) { - if (!"game_key" %in% colnames(pbp)) { - suppressWarnings( - out <- - pbp %>% - dplyr::select( - tidyselect::one_of( - c(nflscrapr_cols, new_cols) - ) - ) - ) - } else { + suppressWarnings( out <- pbp %>% dplyr::select( tidyselect::one_of( - c(nflscrapr_cols, new_cols, rs_cols) + c(nflscrapr_cols, new_cols, api_cols) ) ) ) - } return(out) } # columns that are not in gamecenter that we created -new_cols <- c("season", "cp", "cpoe") +new_cols <- c("season", "cp", "cpoe", "series", "series_success") # original nflscrapr columns nflscrapr_cols <- c( "play_id", "game_id", "home_team", "away_team", + #added these to new gc scraper + "season_type", "week", "posteam", "posteam_type", "defteam", "side_of_field", "yardline_100", "game_date", "quarter_seconds_remaining", "half_seconds_remaining", "game_seconds_remaining", "game_half", "quarter_end", "drive", "sp", "qtr", "down", "goal_to_go", "time", "yrdln", "ydstogo", "ydsnet", @@ -57,8 +48,10 @@ nflscrapr_cols <- "comp_yac_epa", "total_home_comp_air_epa", "total_away_comp_air_epa", "total_home_comp_yac_epa", "total_away_comp_yac_epa", "total_home_raw_air_epa", "total_away_raw_air_epa", "total_home_raw_yac_epa", - "total_away_raw_yac_epa", "wp", "def_wp", "home_wp", "away_wp", "wpa", "home_wp_post", - "away_wp_post", "total_home_rush_wpa", "total_away_rush_wpa", + "total_away_raw_yac_epa", + "wp", "def_wp", "home_wp", "away_wp", "wpa", "home_wp_post", "away_wp_post", + "vegas_wp", "vegas_home_wp", + "total_home_rush_wpa", "total_away_rush_wpa", "total_home_pass_wpa", "total_away_pass_wpa", "air_wpa", "yac_wpa", "comp_air_wpa", "comp_yac_wpa", "total_home_comp_air_wpa", "total_away_comp_air_wpa", "total_home_comp_yac_wpa", "total_away_comp_yac_wpa", @@ -123,6 +116,33 @@ rs_cols <- c( "drive_time_of_possession", "drive_inside20", "drive_first_downs", "drive_possession_team_abbr", "scoring_team_abbr", "scoring_type", "alert_play_type", "play_type_nfl", "time_of_day", - "yards", "end_yardline_side", "end_yardline_number", "series", "series_success" + "yards", "end_yardline_side", "end_yardline_number" +) + + +# these are columns in the new API that aren't in nflscrapR +api_cols <- c( + "start_time", + "stadium", "weather", "nfl_api_id", + "play_clock", "play_deleted", + "play_type_nfl", + "end_clock_time", "end_yard_line", + "drive_real_start_time", + + "drive_play_count", "drive_time_of_possession", + "drive_first_downs", "drive_inside20", "drive_ended_with_score", + "drive_quarter_start", "drive_quarter_end", + "drive_yards_penalized", + + "drive_start_transition", "drive_end_transition", + + "drive_game_clock_start", "drive_game_clock_end", + "drive_start_yard_line", "drive_end_yard_line", + "drive_play_id_started", "drive_play_id_ended", + "away_score", "home_score", "location", "result", + "total", "spread_line", "total_line", "div_game", "roof", + "surface", "temp", "wind", "home_coach", "away_coach", + "stadium_id", "game_stadium" ) + diff --git a/R/sysdata.rda b/R/sysdata.rda index 53c414e7..85ffac0e 100644 Binary files a/R/sysdata.rda and b/R/sysdata.rda differ diff --git a/R/top-level_scraper.R b/R/top-level_scraper.R index b1c19d24..a14de013 100644 --- a/R/top-level_scraper.R +++ b/R/top-level_scraper.R @@ -6,30 +6,27 @@ #' Get NFL Play by Play Data #' -#' @param game_ids Vector of numeric or character ids (see details for further information) -#' @param source Character - either "rs" or "gc" (see details for further information) +#' @param game_ids Vector of character ids (see details for further information) +#' @param source Character - must now be \code{nfl} or unspecified (see details for further information) #' @param pp Logical - either \code{TRUE} or \code{FALSE} (see details for further information) #' @details To load valid game_ids please use the package function \code{\link{fast_scraper_schedules}}. #' #' The \code{source} parameter controls from which source the data is being -#' scraped. The following sources are available: -#' \itemize{ -#' \item{\code{rs}} - the RS Feed on NFL.com. It is being considered the more complete -#' source (data available back to 2000) but is not able to scrape live games. -#' \item{\code{gc}} - the gamecenter. It is less complete (back to 2009) and includes -#' less variables but is able to scrape live games. -#' } +#' scraped. The old parameters \code{rs} as well as \code{gc} +#' are not valid anymore. Please use \code{nfl} or leave unspecified. #' The \code{pp} parameter controls if the scraper should use parallel processing. #' Please note that the initiating process takes a few seconds which means it #' may be better to set \code{pp = FALSE} if you are scraping just a few games. #' @return Data frame where each individual row represents a single play for -#' all passed game_ids scraped from the choosen source containing the following -#' detailed information (description mostly extracted from nflscrapR): +#' all passed game_ids containing the following +#' detailed information (description partly extracted from nflscrapR): #' \itemize{ #' \item{play_id} - Numeric play id that when used with game_id and drive provides the unique identifier for a single play. #' \item{game_id} - Ten digit identifier for NFL game. #' \item{home_team} - String abbreviation for the home team. #' \item{away_team} - String abbreviation for the away team. +#' \item{season_type} - 'REG' or 'POST' indicating if the game belongs to regular or post season. +#' \item{week} - Season week. #' \item{posteam} - String abbreviation for the team with possession. #' \item{posteam_type} - String indicating whether the posteam team is home or away. #' \item{defteam} - String abbreviation for the team on defense. @@ -120,6 +117,8 @@ #' \item{wpa} - Win probability added (WPA) for the posteam. #' \item{home_wp_post} - Estimated win probability for the home team at the start of the play. #' \item{away_wp_post} - Estimated win probability for the away team at the start of the play. +#' \item{vegas_wp} - Estimated win probabiity for the posteam given the current situation at the start of the given play, incorporating pre-game Vegas line. +#' \item{vegas_home_wp} - Estimated win probability for the home team incorporating pre-game Vegas line. #' \item{total_home_rush_wpa} - Cumulative total rushing WPA for the home team in the game so far. #' \item{total_away_rush_wpa} - Cumulative total rushing WPA for the away team in the game so far. #' \item{total_home_pass_wpa} - Cumulative total passing WPA for the home team in the game so far. @@ -145,8 +144,8 @@ #' \item{fourth_down_converted} - Binary indicator for if the first down was converted on fourth down. #' \item{fourth_down_failed} - Binary indicator for if the posteam failed to convert first down on fourth down. #' \item{incomplete_pass} - Binary indicator for if the pass was incomplete. -#' \item{interception} - Binary indicator for if the pass was intercepted. #' \item{touchback} - Binary indicator for if a touchback occurred on the play. +#' \item{interception} - Binary indicator for if the pass was intercepted. #' \item{punt_inside_twenty} - Binary indicator for if the punt ended inside the twenty yard line. #' \item{punt_in_endzone} - Binary indicator for if the punt was in the endzone. #' \item{punt_out_of_bounds} - Binary indicator for if the punt went out of bounds. @@ -283,71 +282,87 @@ #' \item{defensive_extra_point_attempt} - Binary indicator whether or not the defense was able to have an attempt on an extra point attempt, this results following a blocked attempt that the defense recovers the ball. #' \item{defensive_extra_point_conv} - Binary indicator whether or not the defense successfully scored on an extra point attempt. #' \item{season} - 4 digit number indicating to which season the game belongs to. -#' \item{cp} - Numeric value indicationg the probability for a complete pass based on air yards, field position, down, yards to go, pass location, and season based on comparable game situations. +#' \item{cp} - Numeric value indicating the probability for a complete pass based on comparable game situations. #' \item{cpoe} - For a single pass play this is 1 - cp when the pass was completed or 0 - cp when the pass was incomplete. Analyzed for a whole game or season an indicator for the passer how much over or under expectation his completion percentage was. -#' \item{season_type} - 'REG' or 'POST' indicating if the game belongs to regular or post season. -#' \item{week} - Season week. -#' \item{game_key} - RS feed game identifier. -#' \item{game_time_eastern} - Kickoff time in eastern time zone. -#' \item{game_time_local} - Kickoff time in local time zone. -#' \item{iso_time} - Kickoff time according ISO 8601. -#' \item{game_type} - One of 'REG', 'WC', 'DIV', 'CON', 'SB' indicating if a game was a regular season game or one of the playoff rounds. -#' \item{site_id} - RS feed id for game site. -#' \item{site_city} - Game site city. -#' \item{site_fullname} - Game site name. -#' \item{site_state} - Game site state. -#' \item{roof_type} - Game site roof type. -#' \item{drive_start_time} - Game time at the beginning of a given drive. -#' \item{drive_end_time} - Game time at the end of a given drive. -#' \item{drive_start_yardline} - String indicating where a given drive started consisting of team half and yard line number. -#' \item{drive_end_yardline} - String indicating where a given drive ended consisting of team half and yard line number. -#' \item{drive_how_started} - String indicating how the offense got the ball. -#' \item{drive_how_ended} - String indicating how the offense lost the ball. +#' \item{series} - Starts at 1, each new first down increments, numbers shared across both teams NA: kickoffs, extra point/two point conversion attempts, non-plays, no posteam +#' \item{series_success} - 1: scored touchdown, gained enough yards for first down 0: punt, interception, fumble lost, turnover on downs, FG attempt NA: series is NA, series contains QB spike/kneel +#' \item{start_time} - Kickoff time in eastern time zone. +#' \item{stadium} - Game site name. +#' \item{weather} - String describing the weather including temperature, humidity and wind (direction and speed). Doesn't change during the game! +#' \item{nfl_api_id} - UUID of the game in the new NFL API. +#' \item{play_clock} - Time on the playclock when the ball was snapped. +#' \item{play_deleted} - Binary indicator for deleted plays. +#' \item{play_type_nfl} - Play type as listed in the NFL source. Slightly different to the regular play_type variable. +#' \item{end_clock_time} - Game time at the end of a given play. +#' \item{end_yard_line} - String indicating the yardline at the end of the given play consisting of team half and yard line number. +#' \item{drive_real_start_time} - Local day time when the drive started (currently not used by the NFL and therefore mostly 'NA'). #' \item{drive_play_count} - Numeric value of how many regular plays happened in a given drive. -#' \item{drive_yards_penalized} - Numeric value of how many yards the offense gained or lost through penalties. #' \item{drive_time_of_possession} - Time of possession in a given drive. -#' \item{drive_inside20} - Binary indicator if the offense was able to get inside the opponents 20 yard line. #' \item{drive_first_downs} - Number of forst downs in a given drive. -#' \item{drive_possession_team_abbr} - Abbreviation of the possession team in a given drive. -#' \item{scoring_team_abbr} - Abbreviation of the scoring team if the play was a scoring play. -#' \item{scoring_type} - String indicating the scoring type. One of 'FG', 'TD', 'PAT', 'SFTY', 'PAT2'. -#' \item{alert_play_type} - String describing the play type of a play the NFL has listed as alert play. For most of those plays there are highlight clips available through \code{\link{fast_scraper_clips}}. -#' \item{play_type_nfl} - Play type as listed in the rs feed. Slightly different to the regular play_type variable. -#' \item{time_of_day} - Local time at the beginning of the play. -#' \item{yards} - Analogue yards_gained but with the kicking team being the possession team (which means that there are many yards gained through kickoffs and punts). -#' \item{end_yardline_side} - String indicating the side of the field at the end of the given play. -#' \item{end_yardline_number} - Yardline number within the above given side at the end of the given play. -#' \item{series} - Starts at 1, each new first down increments, numbers shared across both teams. Is NA for: kickoffs, extra point/two point conversion attempts, no posteam. -#' \item{series_success} - 1 when scored touchdown, gained enough yards for first down. 0 when punt, interception, fumble lost, turnover on downs, 4th down FG attempt. NA when series is NA, series contains QB spike/kneel. +#' \item{drive_inside20} - Binary indicator if the offense was able to get inside the opponents 20 yard line. +#' \item{drive_ended_with_score} - Binary indicator the drive ended with a score. +#' \item{drive_quarter_start} - Numeric value indicating in which quarter the given drive has started. +#' \item{drive_quarter_end} - Numeric value indicating in which quarter the given drive has ended. +#' \item{drive_yards_penalized} - Numeric value of how many yards the offense gained or lost through penalties in the given drive. +#' \item{drive_start_transition} - String indicating how the offense got the ball. +#' \item{drive_end_transition} - String indicating how the offense lost the ball. +#' \item{drive_game_clock_start} - Game time at the beginning of a given drive. +#' \item{drive_game_clock_end} - Game time at the end of a given drive. +#' \item{drive_start_yard_line} - String indicating where a given drive started consisting of team half and yard line number. +#' \item{drive_end_yard_line} - String indicating where a given drive ended consisting of team half and yard line number. +#' \item{drive_play_id_started} - Play_id of the first play in the given drive. +#' \item{drive_play_id_ended} - Play_id of the last play in the given drive. +#' \item{away_score} - Total points scored by the away team. +#' \item{home_score} - Total points scored by the home team. +#' \item{location} - Either 'Home' o 'Neutral' indicating if the home team played at home or at a neutral site. +#' \item{result} - Equals home_score - away_score and means the game outcome from the perspective of the home team. +#' \item{total} - Equals home_score + away_score and means the total points scored in the given game. +#' \item{spread_line} - The closing spread line for the game. A positive number means the home team was favored by that many points, a negative number means the away team was favored by that many points. (Source: Pro-Football-Reference) +#' \item{total_line} - The closing total line for the game. (Source: Pro-Football-Reference) +#' \item{div_game} - Binary indicator for if the given game was a division game. +#' \item{roof} - One of 'dome', 'outdoors', 'closed', 'open' indicating indicating the roof status of the stadium the game was played in. (Source: Pro-Football-Reference) +#' \item{surface} - What type of ground the game was played on. (Source: Pro-Football-Reference) +#' \item{temp} - The temperature at the stadium only for 'roof' = 'outdoors' or 'open'.(Source: Pro-Football-Reference) +#' \item{wind} - The speed of the wind in miles/hour only for 'roof' = 'outdoors' or 'open'. (Source: Pro-Football-Reference) +#' \item{home_coach} - First and last name of the home team coach. (Source: Pro-Football-Reference) +#' \item{away_coach} - First and last name of the away team coach. (Source: Pro-Football-Reference) +#' \item{stadium_id} - ID of the stadium the game was played in. (Source: Pro-Football-Reference) +#' \item{game_stadium} - Name of the stadium the game was played in. (Source: Pro-Football-Reference) #' } #' @export #' @examples -#' # Get pbp data for two 2006 games using the rs feed and parallel processing -#' # game_ids <- c("2006091009", "2006123103") -#' # pbp <- fast_scraper(game_ids, source = "rs", pp = TRUE) -#' -#' # Get pbp data for two 2019 games using gamecenter and no parallel processing -#' # game_ids <- c("2019090804", "2019101700") -#' # pbp <- fast_scraper(game_ids, source = "gc", pp = FALSE) -fast_scraper <- function(game_ids, source = "rs", pp = FALSE) { +#' \dontrun{ +#' # Get pbp data for two games using parallel processing +#' game_ids <- c("2019_01_GB_CHI", "2013_21_SEA_DEN") +#' pbp <- fast_scraper(game_ids, pp = TRUE) +#' } +fast_scraper <- function(game_ids, source = "nfl", pp = FALSE) { # Error handling to correct source type - if (!source %in% c("rs", "gc")) { - stop("Please choose source of 'rs' or 'gc'") - } else if (source == "rs") { - scraper_func <- get_pbp_rs - } else { - scraper_func <- get_pbp_gc + if (source != "nfl") { + stop("You tried to specify a source that isn't the new NFL web page. Please remove source from your request or use source = 'nfl'. The 'source' option will soon be deprecated.") } # No parallel processing demanded -> use purrr if (pp == FALSE) { suppressWarnings({ - pbp <- purrr::map_dfr(game_ids, scraper_func) + progressr::with_progress({ + p <- progressr::progressor(along = game_ids) + pbp <- purrr::map_dfr(game_ids, function(x){ + if (substr(x, 1, 4) < 2011) { + plays <- get_pbp_gc(x) + } else { + plays <- get_pbp_nfl(x) + } + p(sprintf("x=%s", as.character(x))) + return(plays) + }) + }) if(purrr::is_empty(pbp) == FALSE) { message("Download finished. Adding variables...") - pbp <- pbp %>% + pbp <- pbp %>% + add_game_data() %>% add_nflscrapr_mutations() %>% add_ep() %>% add_air_yac_ep() %>% @@ -370,12 +385,24 @@ fast_scraper <- function(game_ids, source = "rs", pp = FALSE) { message(glue::glue("You have passed only {length(game_ids)} GameIDs to parallel processing.\nPlease note that the initiating process takes a few seconds\nand consider using pp=FALSE for a small number of games.")) } suppressWarnings({ - future::plan("multiprocess") - pbp <- furrr::future_map_dfr(game_ids, scraper_func, .progress = TRUE) + progressr::with_progress({ + p <- progressr::progressor(along = game_ids) + future::plan("multiprocess") + pbp <- furrr::future_map_dfr(game_ids, function(x){ + if (substr(x, 1, 4) < 2011) { + plays <- get_pbp_gc(x) + } else { + plays <- get_pbp_nfl(x) + } + p(sprintf("x=%s", as.character(x))) + return(plays) + }) + }) if(purrr::is_empty(pbp) == FALSE) { message("Download finished. Adding variables...") pbp <- pbp %>% + add_game_data() %>% add_nflscrapr_mutations() %>% add_ep() %>% add_air_yac_ep() %>% @@ -402,13 +429,16 @@ fast_scraper <- function(game_ids, source = "rs", pp = FALSE) { #' may be better to set \code{pp = FALSE} if you are scraping just a few games. #' @return Data frame containing game_id, play_id for all plays with available #' highlightclip and the clip url -#' @export +# @export +#' @noRd #' @examples #' #' # Get highlight clips for two 2019 games using parallel processing #' # game_ids <- c("2019090804", "2019101700") #' # clips <- fast_scraper_clips(game_ids, pp = TRUE) fast_scraper_clips <- function(game_ids, pp = FALSE) { + stop("The NFL removed the public available data feed. We are working on a new solution.\n Meanwhile please check https://github.com/guga31bb/nflfastR-data/tree/master/legacy-data for data of the seasons 2000-2019") + scraper_func <- get_pbp_highlights # No parallel processing demanded -> use purrr @@ -489,14 +519,17 @@ fast_scraper_clips <- function(game_ids, pp = FALSE) { #' \item{teamPlayers.profile_url} #' } #' @examples +#' \dontrun{ #' # Roster of Steelers in 2018, no parallel processing -#' # rosters <- fast_scraper_roster("3900", 2018, pp = FALSE) +#' rosters <- fast_scraper_roster("3900", 2018, pp = FALSE) #' #' # Roster of Steelers and Seahawks in 2016 & 2019 using parallel processing -#' # rosters <- fast_scraper_roster(c("3900", "4600"), c("2016", "2019"), pp = TRUE) -#' @export -#' +#' rosters <- fast_scraper_roster(c("3900", "4600"), c("2016", "2019"), pp = TRUE) +#' } +# @export +#' @noRd fast_scraper_roster <- function(team_ids, seasons, pp = FALSE) { + stop("The NFL removed the public available data feed. We are working on a new solution.\n Meanwhile please check https://github.com/guga31bb/nflfastR-data/tree/master/roster-data for data of the seasons 2000-2019") # No parallel processing demanded -> use purrr if (pp == FALSE) { @@ -542,49 +575,52 @@ fast_scraper_roster <- function(team_ids, seasons, pp = FALSE) { #' #' @param seasons Vector of numeric or character 4 digit seasons #' @param pp Logical - either \code{TRUE} or \code{FALSE} (see details for further information) -#' @details The \code{pp} parameter controls if the scraper should use parallel processing. +#' @details This functions now incorporates the games file provided and maintained +#' by Lee Sharpe. +#' +#' The \code{pp} parameter controls if the scraper should use parallel processing. #' Please note that the initiating process takes a few seconds which means it #' may be better to set \code{pp = FALSE} if you are scraping less than 10 seasons. -#' @return Data frame containing the follwoing detailed game information: +#' @return Data frame containing the following detailed game information: #' \itemize{ +#' \item{game_id} - Character identifier including season, week, away team and home team #' \item{season} - 4 digit season year. -#' \item{season_type} - Either 'PRE', 'REG', 'POST', 'PRO'. -#' \item{week} - Numeric week number. -#' \item{game_id} - Unique game identifier. -#' \item{alt_game_id} - Alternative and much more intuitive identifier introduced by Lee Sharpe and set to \code{NA} for Hall of Fame Week, Preseason and Pro Bowl -#' \item{game_date} - Game date in format dd/mm/yyyy. -#' \item{game_time_eastern} - Kickoff time in eastern time zone. -#' \item{game_time_local} - Kickoff time in local time zone. -#' \item{iso_time} - Kickoff time according ISO 8601. -#' \item{home_team} - Home team abbreviation -#' \item{away_team} - Away team abbreviation -#' \item{home_team_name} - Home team full name -#' \item{away_team_name} - Away team full name -#' \item{home_nickname} - Home team nick name -#' \item{away_nickname} - Away team nick name -#' \item{home_team_id} - Home team id (can be used with the package function \code{\link{fast_scraper_roster}}). -#' \item{away_team_id} - Away team id (can be used with the package function \code{\link{fast_scraper_roster}}). #' \item{game_type} - One of 'REG', 'WC', 'DIV', 'CON', 'SB' indicating if a game was a regular season game or one of the playoff rounds. -#' \item{week_name} - Full description of week -#' \item{site_city} - Game site city. -#' \item{site_fullname} - Game site name. -#' \item{site_state} - Game site state. -#' \item{site_roof_type} - Game site roof type. -#' \item{network_channel} - Name of broadcasting network channel. +#' \item{week} - Numeric week number. +#' \item{gameday} - Game date in format yyyy/mm/dd. +#' \item{weekday} - The day of the week on which the game occcured. +#' \item{gametime} - The kickoff time of the game. This is represented in 24-hour time and the Eastern time zone, regardless of what time zone the game was being played in. +#' \item{away_team} - Away team abbreviation. +#' \item{home_team} - Home team abbreviation. +#' \item{away_score} - The number of points the away team scored. Is 'NA' for games which haven't yet been played. +#' \item{home_score} - The number of points the home team scored. Is 'NA' for games which haven't yet been played. +#' \item{home_result} - Equals home_score - away_score and means the game outcome from the perspective of the home team. +#' \item{stadium} - Name of the stadium the game was or will be played in. (Source: Pro-Football-Reference) +#' \item{location} - Either 'Home' o 'Neutral' indicating if the home team played at home or at a neutral site. +#' \item{roof} - One of 'dome', 'outdoors', 'closed', 'open' indicating indicating the roof status of the stadium the game was played in. (Source: Pro-Football-Reference) +#' \item{surface} - What type of ground the game was played on. (Source: Pro-Football-Reference) +#' \item{old_game_id} - Unique game identifier of the old NFL API. #' } #' @export #' @examples -#' +#'\dontrun{ #' # Get schedules for the whole 2015 - 2018 seasons -#' # seasons <- 2015:2018 -#' # schedules <- fast_scraper_schedules(seasons) +#' seasons <- 2015:2018 +#' schedules <- fast_scraper_schedules(seasons) +#' } fast_scraper_schedules <- function(seasons, pp = FALSE) { - scraper_func <- get_season_schedule # No parallel processing demanded -> use purrr if (pp == FALSE) { suppressWarnings( - schedules <- purrr::map_dfr(seasons, scraper_func) + progressr::with_progress({ + p <- progressr::progressor(along = seasons) + schedules <- purrr::map_dfr(seasons, function(x){ + sched <- get_season_schedule(x) + p(sprintf("x=%s", as.integer(x))) + return(sched) + }) + }) ) } @@ -598,8 +634,15 @@ fast_scraper_schedules <- function(seasons, pp = FALSE) { message(glue::glue("You have passed only {length(seasons)} season(s) to parallel processing.\nPlease note that the initiating process takes a few seconds\nand consider using pp=FALSE for a small number of seasons.")) } suppressWarnings({ - future::plan("multiprocess") - schedules <- furrr::future_map_dfr(seasons, scraper_func, .progress = TRUE) + progressr::with_progress({ + p <- progressr::progressor(along = seasons) + future::plan("multiprocess") + schedules <- furrr::future_map_dfr(seasons, function(x){ + sched <- get_season_schedule(x) + p(sprintf("x=%s", as.integer(x))) + return(sched) + }) + }) }) } return(schedules) diff --git a/R/utils-pipe.R b/R/utils-pipe.R index e79f3d80..75b26405 100644 --- a/R/utils-pipe.R +++ b/R/utils-pipe.R @@ -5,7 +5,8 @@ #' @name %>% #' @rdname pipe #' @keywords internal -#' @export +# @export #' @importFrom magrittr %>% #' @usage lhs \%>\% rhs +#' @noRd NULL diff --git a/README.Rmd b/README.Rmd index de787e3f..a3316441 100644 --- a/README.Rmd +++ b/README.Rmd @@ -22,14 +22,14 @@ knitr::opts_chunk$set( -`nflfastR` is a set of functions to efficiently scrape NFL play-by-play and roster data. `nflfastR` expands upon the features of nflscrapR: +`nflfastR` is a set of functions to efficiently scrape NFL play-by-play data. `nflfastR` expands upon the features of nflscrapR: -* By incorporating the NFL's RS feed, the package currently supports full play-by-play back to 2000 -* As suggested by the package name, it scrapes games **much** faster +* The package contains NFL play-by-play data back to 1999 +* As suggested by the package name, it obtains games **much** faster * Includes completion probability (`cp`) and completion percentage over expected (`cpoe`) in play-by-play going back to 2006 -* The default RS feed includes drive information, including drive starting position and drive result -* Includes fast functions for roster and highlight scraping -* Hosts [a repository of play-by-play data going back to 2000](https://github.com/guga31bb/nflfastR-data) for very quick access +* Includes drive information, including drive starting position and drive result +* Includes series information, including series number and series success +* Hosts [a repository of play-by-play data going back to 1999](https://github.com/guga31bb/nflfastR-data) for very quick access We owe a debt of gratitude to the original [`nflscrapR`](https://github.com/maksimhorowitz/nflscrapR) team, Maksim Horowitz, Ronald Yurko, and Samuel Ventura, without whose contributions and inspiration this package would not exist. @@ -42,52 +42,46 @@ You can load and install nflfastR from [GitHub](https://github.com/) with: # If 'devtools' isn't installed run # install.packages("devtools") -# If 'nflscrapR' isn't installed run -# devtools::install_github("maksimhorowitz/nflscrapR") devtools::install_github("mrcaseb/nflfastR") ``` ## Usage -### Example 1: replicate `nflscrapR` with `fast_scraper` +``` {r load, warning = FALSE, message = FALSE} +library(nflfastR) +library(tidyverse) +``` -The functionality of `nflscrapR` can be duplicated by using `fast_scraper` with the 'gc' (for Gamecenter) option specified. This scrapes from the same source as `nflscrapR` but much more quickly. +### Example 1: replicate `nflscrapR` with `fast_scraper` -Reasons to use the `source = "gc"` option include (a) duplicating the output of `nflscrapR` or (b) when scraping a live or recently-completed game: Gamecenter updates live and the RS feed does not. For scraping old seasons, we recommend not specifying a source option and letting the scraper default to the RS feed (see Example 2 below). +The functionality of `nflscrapR` can be duplicated by using `fast_scraper` This obtains the same information contained in `nflscrapR` (plus some extra) but much more quickly. To compare to `nflscrapR`, we use their data repository as the program no longer functions now that the NFL has taken down the old Gamecenter feed. Note that EP differs from nflscrapR as we use a newer era-adjusted model (more on this below). This example also uses the built-in function `clean_pbp` to create a "name' column for the primary player involved (the QB on pass play or ball-carrier on run play). ``` {r ex1-nflscrapR, warning = FALSE, message = FALSE} -library(nflfastR) -library(tidyverse) -library(nflscrapR) - -gameId <- 2019111100 -nflscrapR::scrape_json_play_by_play(gameId) %>% - select(desc, play_type, epa, home_wp) %>% head(5) %>% +read_csv(url('https://github.com/ryurko/nflscrapR-data/blob/master/play_by_play_data/regular_season/reg_pbp_2019.csv?raw=true')) %>% + filter(home_team == 'SF' & away_team == 'SEA') %>% + select(desc, play_type, ep, epa, home_wp) %>% head(5) %>% knitr::kable(digits = 3) ``` ``` {r ex1-fs, warning = FALSE, message = FALSE} #The 'gc' option specifies scraping gamecenter like nflscrapR does, as opposed to 'rs' -fast_scraper(gameId, source = "gc") %>% +fast_scraper('2019_10_SEA_SF') %>% clean_pbp() %>% - select(desc, play_type, epa, home_wp, name) %>% head(5) %>% + select(desc, play_type, ep, epa, home_wp, name) %>% head(6) %>% knitr::kable(digits = 3) ``` ### Example 2: scrape a batch of games very quickly with `fast_scraper` and parallel processing -This is a demonstration of `nflfastR`'s capabilities. While `nflfastR` can scrape a batch of games very quickly, **please be respectful of the NFL's servers and use the [data repository](https://github.com/guga31bb/nflfastR-data) which hosts all the scraped and cleaned data**. The only reason to ever actually use the scraper is if it's in the middle of the season and we haven't updated the repository with recent games (but we will try to keep it updated). +This is a demonstration of `nflfastR`'s capabilities. While `nflfastR` can scrape a batch of games very quickly, **please be respectful of Github's servers and use the [data repository](https://github.com/guga31bb/nflfastR-data) which hosts all the scraped and cleaned data** whenever possible. The only reason to ever actually use the scraper is if it's in the middle of the season and we haven't updated the repository with recent games (but we will try to keep it updated). ``` {r ex2-bigscrape, warning = FALSE, message = FALSE} #get list of some games from 2019 -games_2019 <- fast_scraper_schedules(2019) %>% filter(season_type == 'REG') %>% head(3) %>% pull(game_id) +games_2019 <- fast_scraper_schedules(2019) %>% head(10) %>% pull(game_id) tictoc::tic(glue::glue('{length(games_2019)} games with nflfastR:')) f <- fast_scraper(games_2019, pp = TRUE) tictoc::toc() -tictoc::tic(glue::glue('{length(games_2019)} games with nflscrapR:')) -n <- map_df(games_2019, nflscrapR::scrape_json_play_by_play) -tictoc::toc() ``` ### Example 3: completion percentage over expected (CPOE) @@ -108,15 +102,15 @@ games_2009 %>% filter(!is.na(cpoe)) %>% group_by(passer_player_name) %>% ``` ### Example 4: using drive information -When scraping from the default RS feed, drive results are automatically included. Let's look at how much more likely teams were to score starting from 1st & 10 at their own 20 yard line in 2015 (the last year before touchbacks on kickoffs changed to the 25) than in 2006. +When scraping from the default RS feed, drive results are automatically included. Let's look at how much more likely teams were to score starting from 1st & 10 at their own 20 yard line in 2015 (the last year before touchbacks on kickoffs changed to the 25) than in 2000. ``` {r ex4, warning = FALSE, message = FALSE} games_2000 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2000.rds')) games_2015 <-readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2015.rds')) -pbp <- rbind(games_2000, games_2015) +pbp <- bind_rows(games_2000, games_2015) pbp %>% filter(season_type == 'REG' & down == 1 & ydstogo == 10 & yardline_100 == 80) %>% - mutate(drive_score = if_else(drive_how_ended %in% c("Touchdown", "Field_Goal"), 1, 0)) %>% + mutate(drive_score = if_else(drive_end_transition %in% c("Touchdown", "Field Goal", "TOUCHDOWN", "FIELD_GOAL"), 1, 0)) %>% group_by(season) %>% summarize(drive_score = mean(drive_score)) %>% knitr::kable(digits = 3) @@ -124,28 +118,11 @@ pbp %>% filter(season_type == 'REG' & down == 1 & ydstogo == 10 & yardline_100 = So about 23% of 1st & 10 plays from teams' own 20 would see the drive end up in a score in 2000, compared to 30% in 2015. This has implications for EPA models (see below). -### Example 5: scrape rosters with `fast_scraper_roster` -``` {r ex5, warning = FALSE, message = FALSE} -# Roster of Steelers and Seahawks in 2016 & 2019 using parallel processing -# teams_colors_logos is included in the package -team_ids <- teams_colors_logos %>% filter(team_abbr %in% c("SEA", "PIT")) %>% pull(team_id) -fast_scraper_roster(team_ids, c("2016", "2019"), pp = TRUE) %>% - select(2,9:13) %>% head() %>% - knitr::kable() -``` - -### Example 6: scrape highlight clips with `fast_scraper_clips` -``` {r ex6, warning = FALSE, message = FALSE} -#use same week 1 games from above -vids <- fast_scraper_clips(games_2019) -vids %>% select(highlight_video_url) %>% head(2) %>% knitr::kable() -``` - -### Example 7: Plot offensive and defensive EPA per play for a given season +### Example 5: Plot offensive and defensive EPA per play for a given season Let's build the NFL team tiers using offensive and defensive expected points added per play for the 2005 regular season. The logo urls of the espn logos are integrated into the 'team_colors_logos' data frame which is delivered with the package. Let's also use the included helper function `clean_pbp`, which creates "rush" and "pass" columns that (a) properly count sacks and scrambles as pass plays and (b) properly include plays with penalties. Using this, we can keep only rush or pass plays. -```{r ex7, warning = FALSE, message = FALSE, results = 'hide', fig.keep = 'all', dpi = 600} +```{r ex5, warning = FALSE, message = FALSE, results = 'hide', fig.keep = 'all', dpi = 600} library(ggimage) pbp <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2005.rds')) %>% filter(season_type == 'REG') %>% filter(!is.na(posteam) & (rush == 1 | pass == 1)) @@ -164,7 +141,7 @@ offense %>% labs( x = "Offense EPA/play", y = "Defense EPA/play", - caption = "Data: @nflfastR | EPA model: @nflscrapR", + caption = "Data: @nflfastR", title = "2005 NFL Offensive and Defensive EPA per Play" ) + theme_bw() + @@ -175,59 +152,42 @@ offense %>% scale_y_reverse() ``` -### Example 8: Working with roster and position data -The `clean_pbp()` function does a lot of work cleaning up player names and IDs for the purpose of joining them to roster data, and has already been performed on the data hosted in the repository. We recommend doing the join on `passer_id`, `rusher_id`, or `receiver_id`. Below we used `receiver_id` to figure out the position of targeted players, and then measure the top 5 at each position in terms of total EPA gained on pass targets. -``` {r ex8-position, warning = FALSE, message = FALSE} -#this file contains all roster data from 1999 through present -rosters <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/roster-data/roster.rds')) %>% - select(team.season, teamPlayers.gsisId, teamPlayers.displayName, team.abbr, teamPlayers.position) %>% - as_tibble() - -pbp <- - bind_rows( - readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2017.rds')), - readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2018.rds')), - readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds')) - ) %>% - filter(season_type == 'REG') %>% - filter(!is.na(posteam) & !is.na(epa) & (rush == 1 | pass == 1)) - -joined <- pbp %>% - filter(!is.na(receiver_id)) %>% - select(posteam, season, desc, receiver, receiver_id, epa) %>% - left_join(rosters, by = c('receiver_id' = 'teamPlayers.gsisId', 'season' = 'team.season')) - -#the real work is done, this just makes a table and has it look nice -joined %>% - filter(teamPlayers.position %in% c('WR', 'TE', 'RB')) %>% - group_by(receiver_id, receiver, teamPlayers.position) %>% - summarize(tot_epa = sum(epa), n=n()) %>% - arrange(-tot_epa) %>% - ungroup() %>% - group_by(teamPlayers.position) %>% - mutate(position_rank = 1:n()) %>% - filter(position_rank <= 5) %>% - dplyr::rename(Pos_Rank = position_rank, Player = receiver, Pos = teamPlayers.position, Tgt = n, EPA = tot_epa) %>% - select(Player, Pos, Pos_Rank, Tgt, EPA) %>% - knitr::kable(digits = 0) -``` +### Example 6: Working with roster and position data -Not surprisingly, all 5 of the top 5 WRs and 4 of the top 5 TEs in terms of EPA added come in ahead of the top RB. Note that the number of targets won't match official stats because we're including plays with penalties. +This section used to contain an example of working with roster data. Unfortunately, we have not found a way to obtain roster data that can be joined to the new play by play, so for now, it is empty. We would like to be able to get position data but haven't yet. -## More information +The `clean_pbp()` function does a lot of work cleaning up player names and IDs for the purpose of joining them to roster data, but we do not have any roster data to join to. **Note that player IDs are inconsistent between the old (1999-2010) and new (2011 - present) data sources so use IDs with caution**. Unfortunately there is nothing we can do about this as the NFL changed their system for IDs in the underlying data. -`nflfastR` scrapes NFL Gamecenter or RS feeds, defaulting to the RS feed. **Live games are only available from Gamecenter (we think) so when scraping ongoing or recent games, use `source = 'gc'`**. Columns that exist in both GC and RS are consistent across the two scrapers (e.g., player_id, play_id, etc.) but there are some columns in RS that do not exist in GC (drive_how_ended, roof_type, game_time_eastern, etc.). +## `nflfastR` models -`nflfastR` uses the Expected Points and Win Probability models developed by the `nflscrapR` team and provided by the `nflscrapR` package. For a description of the models, please see the paper [here](https://arxiv.org/pdf/1802.00998.pdf). When using EP or WP from this package, please cite `nflscrapR` as it is their work behind the models (see the example in the caption of the figure above). Because these models were trained on more recent seasons, they should be used with caution for games in the early 2000s (note the means being not centered at zero in the figure above). If you would like to help us extend the EPA model to work better in the early 2000s, we are very open to contributions from others. +`nflfastR` uses its own models for Expected Points, Win Probability, and Completion Percentage. To read about the models, [please see here](https://github.com/mrcaseb/nflfastR/blob/master/data-raw/MODEL-README.md). For a more detailed description of Expected Points models, we highly recommend this paper [from the nflscrapR team located here](https://arxiv.org/pdf/1802.00998.pdf). -## Data repository +`nflfastR` includes two win probability models: one with and one without incorporating the pre-game spread. -Even though `nflfastR` is very fast, **for completed seasons we recommend downloading the data from [here](https://github.com/guga31bb/nflfastR-data) as in Examples 4 and 7 above**. These data sets include play-by-play data of complete seasons going back to 2000 and we will update them in 2020 once the season starts. The files contain both regular season and postseason data, and one can use game_type or week to figure out which games occurred in the postseason. Data are available as either .csv or .rds, but if you're using R, the .rds files are much smaller and thus faster to download. +``` {r wp, warning = FALSE, message = FALSE} +fast_scraper('2019_03_NYJ_NE') %>% + select(desc, spread_line, home_wp, vegas_home_wp) %>% + head(5) %>% + knitr::kable(digits = 3) -`fast_scraper` can also scrape the 1999 season. However, several games of the 1999 season are missing play-by-play data completely. `nflfastR` will point this out when trying to scrape this season and specify the missing games. +``` + +Because the Patriots were favored by 20.5 points in this Week 3 Jets-Patriots game, the naive (no-spread) and spread win probabilities differ dramatically. In the no-spread model, `home_wp` begins at 0.56 (greater than 0.50 because home team advantage is included in the model) and `vegas_home_wp` is 0.858. We have also included `vegas_wp` which is the win probability of the possession team, incorporating the spread. + +## Data repository + +Even though `nflfastR` is very fast, **for historical games we recommend downloading the data from [here](https://github.com/guga31bb/nflfastR-data) as in Examples 4 and 5 above**. These data sets include play-by-play data of complete seasons going back to 1999 and we will update them in 2020 once the season starts. The files contain both regular season and postseason data, and one can use game_type or week to figure out which games occurred in the postseason. Data are available as .csv.gz, .parquet, or .rds. ## About `nflfastR` was developed by [Sebastian Carl](https://twitter.com/mrcaseb) and [Ben Baldwin](https://twitter.com/benbbaldwin). -Special thanks to [Florian Schmitt](https://twitter.com/Flosch1006) for the logo design! +## Special thanks + +* To [Nick Shoemaker](https://twitter.com/WeightRoomShoe) for [finding and making available JSON-formatted NFL play-by-play back to 1999](https://github.com/CroppedClamp/nfl_pbps) (`nflfastR` uses this source for 1999-2010) +* To [Lee Sharpe](https://twitter.com/LeeSharpeNFL) for curating a resource for game information +* To [Florian Schmitt](https://twitter.com/Flosch1006) for the logo design +* To [Zach Feldman](https://twitter.com/ZachFeldman3) for developing the CP model and recommending xgboost to us (which now drives the EP and WP models as well) +* To [Peter Owen](https://twitter.com/JSmoovesBrekkie) for [many helpful suggestions for the CP model](https://twitter.com/JSmoovesBrekkie/status/1268885950626623490) +* The many users who found and reported bugs in `nflfastR` 1.0 +* And of course, the original [`nflscrapR`](https://github.com/maksimhorowitz/nflscrapR) team, Maksim Horowitz, Ronald Yurko, and Samuel Ventura, whose work represented a dramatic step forward for the state of public NFL research diff --git a/README.md b/README.md index 06c2a608..ce093b9f 100644 --- a/README.md +++ b/README.md @@ -12,18 +12,15 @@ nflfastR (CPOE)](#example-3-completion-percentage-over-expected-cpoe) - [Example 4: using drive information](#example-4-using-drive-information) - - [Example 5: scrape rosters with - `fast_scraper_roster`](#example-5-scrape-rosters-with-fast_scraper_roster) - - [Example 6: scrape highlight clips with - `fast_scraper_clips`](#example-6-scrape-highlight-clips-with-fast_scraper_clips) - - [Example 7: Plot offensive and defensive EPA per play for a + - [Example 5: Plot offensive and defensive EPA per play for a given - season](#example-7-plot-offensive-and-defensive-epa-per-play-for-a-given-season) - - [Example 8: Working with roster and position - data](#example-8-working-with-roster-and-position-data) - - [More information](#more-information) + season](#example-5-plot-offensive-and-defensive-epa-per-play-for-a-given-season) + - [Example 6: Working with roster and position + data](#example-6-working-with-roster-and-position-data) + - [`nflfastR` models](#nflfastr-models) - [Data repository](#data-repository) - [About](#about) + - [Special thanks](#special-thanks) @@ -32,18 +29,18 @@ nflfastR `nflfastR` is a set of functions to efficiently scrape NFL play-by-play -and roster data. `nflfastR` expands upon the features of nflscrapR: +data. `nflfastR` expands upon the features of nflscrapR: - - By incorporating the NFL’s RS feed, the package currently supports - full play-by-play back to 2000 - - As suggested by the package name, it scrapes games **much** faster + - The package contains NFL play-by-play data back to 1999 + - As suggested by the package name, it obtains games **much** faster - Includes completion probability (`cp`) and completion percentage over expected (`cpoe`) in play-by-play going back to 2006 - - The default RS feed includes drive information, including drive - starting position and drive result - - Includes fast functions for roster and highlight scraping + - Includes drive information, including drive starting position and + drive result + - Includes series information, including series number and series + success - Hosts [a repository of play-by-play data going back - to 2000](https://github.com/guga31bb/nflfastR-data) for very quick + to 1999](https://github.com/guga31bb/nflfastR-data) for very quick access We owe a debt of gratitude to the original @@ -60,87 +57,81 @@ with: # If 'devtools' isn't installed run # install.packages("devtools") -# If 'nflscrapR' isn't installed run -# devtools::install_github("maksimhorowitz/nflscrapR") devtools::install_github("mrcaseb/nflfastR") ``` ## Usage +``` r +library(nflfastR) +library(tidyverse) +``` + ### Example 1: replicate `nflscrapR` with `fast_scraper` The functionality of `nflscrapR` can be duplicated by using -`fast_scraper` with the ‘gc’ (for Gamecenter) option specified. This -scrapes from the same source as `nflscrapR` but much more quickly. - -Reasons to use the `source = "gc"` option include (a) duplicating the -output of `nflscrapR` or (b) when scraping a live or recently-completed -game: Gamecenter updates live and the RS feed does not. For scraping old -seasons, we recommend not specifying a source option and letting the -scraper default to the RS feed (see Example 2 below). +`fast_scraper` This obtains the same information contained in +`nflscrapR` (plus some extra) but much more quickly. To compare to +`nflscrapR`, we use their data repository as the program no longer +functions now that the NFL has taken down the old Gamecenter feed. Note +that EP differs from nflscrapR as we use a newer era-adjusted model +(more on this below). This example also uses the built-in function `clean_pbp` to create a "name’ column for the primary player involved (the QB on pass play or ball-carrier on run play). ``` r -library(nflfastR) -library(tidyverse) -library(nflscrapR) - -gameId <- 2019111100 -nflscrapR::scrape_json_play_by_play(gameId) %>% - select(desc, play_type, epa, home_wp) %>% head(5) %>% +read_csv(url('https://github.com/ryurko/nflscrapR-data/blob/master/play_by_play_data/regular_season/reg_pbp_2019.csv?raw=true')) %>% + filter(home_team == 'SF' & away_team == 'SEA') %>% + select(desc, play_type, ep, epa, home_wp) %>% head(5) %>% knitr::kable(digits = 3) ``` -| desc | play\_type | epa | home\_wp | -| :------------------------------------------------------------------------------------------------------------------ | :--------- | ------: | -------: | -| J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. | kickoff | 0.000 | NA | -| (15:00) T.Coleman left guard to SF 26 for 1 yard (J.Clowney). | run | \-0.606 | 0.500 | -| (14:19) T.Coleman right tackle to SF 25 for -1 yards (P.Ford). | run | \-1.146 | 0.485 | -| (13:45) (Shotgun) J.Garoppolo pass short middle to K.Bourne to SF 41 for 16 yards (J.Taylor). Caught at SF39. 2-yac | pass | 3.223 | 0.453 | -| (12:58) PENALTY on SEA-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. | no\_play | 0.774 | 0.551 | +| desc | play\_type | ep | epa | home\_wp | +| :------------------------------------------------------------------------------------------------------------------ | :--------- | ------: | ------: | -------: | +| J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. | kickoff | 0.815 | 0.000 | NA | +| (15:00) T.Coleman left guard to SF 26 for 1 yard (J.Clowney). | run | 0.815 | \-0.606 | 0.500 | +| (14:19) T.Coleman right tackle to SF 25 for -1 yards (P.Ford). | run | 0.209 | \-1.146 | 0.485 | +| (13:45) (Shotgun) J.Garoppolo pass short middle to K.Bourne to SF 41 for 16 yards (J.Taylor). Caught at SF39. 2-yac | pass | \-0.937 | 3.223 | 0.453 | +| (12:58) PENALTY on SEA-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. | no\_play | 2.286 | 0.774 | 0.551 | ``` r #The 'gc' option specifies scraping gamecenter like nflscrapR does, as opposed to 'rs' -fast_scraper(gameId, source = "gc") %>% +fast_scraper('2019_10_SEA_SF') %>% clean_pbp() %>% - select(desc, play_type, epa, home_wp, name) %>% head(5) %>% + select(desc, play_type, ep, epa, home_wp, name) %>% head(6) %>% knitr::kable(digits = 3) ``` -| desc | play\_type | epa | home\_wp | name | -| :------------------------------------------------------------------------------------------------------------------ | :--------- | ------: | -------: | :---------- | -| J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. | kickoff | 0.000 | NA | NA | -| (15:00) T.Coleman left guard to SF 26 for 1 yard (J.Clowney). | run | \-0.606 | 0.500 | T.Coleman | -| (14:19) T.Coleman right tackle to SF 25 for -1 yards (P.Ford). | run | \-1.146 | 0.485 | T.Coleman | -| (13:45) (Shotgun) J.Garoppolo pass short middle to K.Bourne to SF 41 for 16 yards (J.Taylor). Caught at SF39. 2-yac | pass | 3.223 | 0.453 | J.Garoppolo | -| (12:58) PENALTY on SEA-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. | no\_play | 0.774 | 0.551 | NA | +| desc | play\_type | ep | epa | home\_wp | name | +| :--------------------------------------------------------------------------------------------------------------------------- | :--------- | ----: | ------: | -------: | :---------- | +| GAME | NA | NA | NA | NA | NA | +| 5-J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. | kickoff | 1.483 | 0.000 | 0.560 | NA | +| (15:00) 26-T.Coleman left guard to SF 26 for 1 yard (90-J.Clowney). | run | 1.483 | \-0.581 | 0.560 | T.Coleman | +| (14:19) 26-T.Coleman right tackle to SF 25 for -1 yards (97-P.Ford). | run | 0.903 | \-0.810 | 0.543 | T.Coleman | +| (13:45) (Shotgun) 10-J.Garoppolo pass short middle to 84-K.Bourne to SF 41 for 16 yards (24-J.Taylor). Caught at SF39. 2-yac | pass | 0.092 | 2.349 | 0.506 | J.Garoppolo | +| (12:58) PENALTY on SEA-91-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. | no\_play | 2.441 | 0.820 | 0.591 | NA | ### Example 2: scrape a batch of games very quickly with `fast_scraper` and parallel processing This is a demonstration of `nflfastR`’s capabilities. While `nflfastR` -can scrape a batch of games very quickly, **please be respectful of the -NFL’s servers and use the [data +can scrape a batch of games very quickly, **please be respectful of +Github’s servers and use the [data repository](https://github.com/guga31bb/nflfastR-data) which hosts all -the scraped and cleaned data**. The only reason to ever actually use the -scraper is if it’s in the middle of the season and we haven’t updated -the repository with recent games (but we will try to keep it updated). +the scraped and cleaned data** whenever possible. The only reason to +ever actually use the scraper is if it’s in the middle of the season and +we haven’t updated the repository with recent games (but we will try to +keep it updated). ``` r #get list of some games from 2019 -games_2019 <- fast_scraper_schedules(2019) %>% filter(season_type == 'REG') %>% head(3) %>% pull(game_id) +games_2019 <- fast_scraper_schedules(2019) %>% head(10) %>% pull(game_id) tictoc::tic(glue::glue('{length(games_2019)} games with nflfastR:')) f <- fast_scraper(games_2019, pp = TRUE) -#> Progress: ------------------------------------------------------------------------------------------------ 100% Progress: ------------------------------------------------------------------------------------------------ 100% -tictoc::toc() -#> 3 games with nflfastR:: 8.64 sec elapsed -tictoc::tic(glue::glue('{length(games_2019)} games with nflscrapR:')) -n <- map_df(games_2019, nflscrapR::scrape_json_play_by_play) tictoc::toc() -#> 3 games with nflscrapR:: 90.06 sec elapsed +#> 10 games with nflfastR:: 12.35 sec elapsed ``` ### Example 3: completion percentage over expected (CPOE) @@ -155,7 +146,7 @@ reads .rds files, but .csv is also available). tictoc::tic('loading all games from 2009') games_2009 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2009.rds')) %>% filter(season_type == 'REG') tictoc::toc() -#> loading all games from 2009: 3.2 sec elapsed +#> loading all games from 2009: 2.65 sec elapsed games_2009 %>% filter(!is.na(cpoe)) %>% group_by(passer_player_name) %>% summarize(cpoe = mean(cpoe), Atts=n()) %>% filter(Atts > 200) %>% @@ -166,28 +157,27 @@ games_2009 %>% filter(!is.na(cpoe)) %>% group_by(passer_player_name) %>% | passer\_player\_name | cpoe | Atts | | :------------------- | ---: | ---: | -| D.Brees | 9.3 | 509 | -| P.Manning | 7.4 | 569 | -| B.Favre | 6.6 | 526 | -| P.Rivers | 6.2 | 475 | -| B.Roethlisberger | 5.7 | 503 | +| D.Brees | 7.4 | 509 | +| P.Rivers | 6.5 | 474 | +| P.Manning | 6.3 | 569 | +| B.Favre | 6.1 | 527 | +| B.Roethlisberger | 5.4 | 503 | ### Example 4: using drive information When scraping from the default RS feed, drive results are automatically included. Let’s look at how much more likely teams were to score starting from 1st & 10 at their own 20 yard line in 2015 (the last year -before touchbacks on kickoffs changed to the 25) than in -2006. +before touchbacks on kickoffs changed to the 25) than in 2000. ``` r games_2000 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2000.rds')) games_2015 <-readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2015.rds')) -pbp <- rbind(games_2000, games_2015) +pbp <- bind_rows(games_2000, games_2015) pbp %>% filter(season_type == 'REG' & down == 1 & ydstogo == 10 & yardline_100 == 80) %>% - mutate(drive_score = if_else(drive_how_ended %in% c("Touchdown", "Field_Goal"), 1, 0)) %>% + mutate(drive_score = if_else(drive_end_transition %in% c("Touchdown", "Field Goal", "TOUCHDOWN", "FIELD_GOAL"), 1, 0)) %>% group_by(season) %>% summarize(drive_score = mean(drive_score)) %>% knitr::kable(digits = 3) @@ -195,48 +185,14 @@ pbp %>% filter(season_type == 'REG' & down == 1 & ydstogo == 10 & yardline_100 = | season | drive\_score | | -----: | -----------: | -| 2000 | 0.233 | +| 2000 | 0.234 | | 2015 | 0.305 | So about 23% of 1st & 10 plays from teams’ own 20 would see the drive end up in a score in 2000, compared to 30% in 2015. This has -implications for EPA models (see -below). - -### Example 5: scrape rosters with `fast_scraper_roster` - -``` r -# Roster of Steelers and Seahawks in 2016 & 2019 using parallel processing -# teams_colors_logos is included in the package -team_ids <- teams_colors_logos %>% filter(team_abbr %in% c("SEA", "PIT")) %>% pull(team_id) -fast_scraper_roster(team_ids, c("2016", "2019"), pp = TRUE) %>% - select(2,9:13) %>% head() %>% - knitr::kable() -``` - -| teamPlayers.displayName | teamPlayers.position | teamPlayers.nflId | teamPlayers.esbId | teamPlayers.gsisId | teamPlayers.birthDate | -| :---------------------- | :------------------- | ----------------: | :---------------- | :----------------- | :-------------------- | -| Shamarko Thomas | SS | 2539937 | THO379701 | 00-0030412 | 02/23/1991 | -| Sean Davis | SS | 2555386 | DAV746549 | 00-0033053 | 10/23/1993 | -| Javon Hargrave | NT | 2555239 | HAR143881 | 00-0033109 | 02/07/1993 | -| Mike Hilton | DB | 2556559 | HIL796239 | 00-0032521 | 03/09/1994 | -| Shaquille Riddick | LB | 2552584 | RID186261 | 00-0032111 | 03/12/1993 | -| Ricardo Mathews | DE | 1037901 | MAT188704 | 00-0027829 | 07/30/1987 | - -### Example 6: scrape highlight clips with `fast_scraper_clips` - -``` r -#use same week 1 games from above -vids <- fast_scraper_clips(games_2019) -vids %>% select(highlight_video_url) %>% head(2) %>% knitr::kable() -``` - -| highlight\_video\_url | -| :--------------------------------------------------------------------------------------------------------------------------------- | -| | -| | +implications for EPA models (see below). -### Example 7: Plot offensive and defensive EPA per play for a given season +### Example 5: Plot offensive and defensive EPA per play for a given season Let’s build the NFL team tiers using offensive and defensive expected points added per play for the 2005 regular season. The logo urls of the @@ -267,7 +223,7 @@ offense %>% labs( x = "Offense EPA/play", y = "Defense EPA/play", - caption = "Data: @nflfastR | EPA model: @nflscrapR", + caption = "Data: @nflfastR", title = "2005 NFL Offensive and Defensive EPA per Play" ) + theme_bw() + @@ -278,113 +234,66 @@ offense %>% scale_y_reverse() ``` - + -### Example 8: Working with roster and position data +### Example 6: Working with roster and position data + +This section used to contain an example of working with roster data. +Unfortunately, we have not found a way to obtain roster data that can be +joined to the new play by play, so for now, it is empty. We would like +to be able to get position data but haven’t yet. The `clean_pbp()` function does a lot of work cleaning up player names -and IDs for the purpose of joining them to roster data, and has already -been performed on the data hosted in the repository. We recommend doing -the join on `passer_id`, `rusher_id`, or `receiver_id`. Below we used -`receiver_id` to figure out the position of targeted players, and then -measure the top 5 at each position in terms of total EPA gained on pass -targets. +and IDs for the purpose of joining them to roster data, but we do not +have any roster data to join to. **Note that player IDs are inconsistent +between the old (1999-2010) and new (2011 - present) data sources so use +IDs with caution**. Unfortunately there is nothing we can do about this +as the NFL changed their system for IDs in the underlying data. + +## `nflfastR` models + +`nflfastR` uses its own models for Expected Points, Win Probability, and +Completion Percentage. To read about the models, [please see +here](https://github.com/mrcaseb/nflfastR/blob/master/data-raw/MODEL-README.md). +For a more detailed description of Expected Points models, we highly +recommend this paper [from the nflscrapR team located +here](https://arxiv.org/pdf/1802.00998.pdf). + +`nflfastR` includes two win probability models: one with and one without +incorporating the pre-game spread. ``` r -#this file contains all roster data from 1999 through present -rosters <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/roster-data/roster.rds')) %>% - select(team.season, teamPlayers.gsisId, teamPlayers.displayName, team.abbr, teamPlayers.position) %>% - as_tibble() - -pbp <- - bind_rows( - readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2017.rds')), - readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2018.rds')), - readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds')) - ) %>% - filter(season_type == 'REG') %>% - filter(!is.na(posteam) & !is.na(epa) & (rush == 1 | pass == 1)) - -joined <- pbp %>% - filter(!is.na(receiver_id)) %>% - select(posteam, season, desc, receiver, receiver_id, epa) %>% - left_join(rosters, by = c('receiver_id' = 'teamPlayers.gsisId', 'season' = 'team.season')) - -#the real work is done, this just makes a table and has it look nice -joined %>% - filter(teamPlayers.position %in% c('WR', 'TE', 'RB')) %>% - group_by(receiver_id, receiver, teamPlayers.position) %>% - summarize(tot_epa = sum(epa), n=n()) %>% - arrange(-tot_epa) %>% - ungroup() %>% - group_by(teamPlayers.position) %>% - mutate(position_rank = 1:n()) %>% - filter(position_rank <= 5) %>% - dplyr::rename(Pos_Rank = position_rank, Player = receiver, Pos = teamPlayers.position, Tgt = n, EPA = tot_epa) %>% - select(Player, Pos, Pos_Rank, Tgt, EPA) %>% - knitr::kable(digits = 0) +fast_scraper('2019_03_NYJ_NE') %>% + select(desc, spread_line, home_wp, vegas_home_wp) %>% + head(5) %>% + knitr::kable(digits = 3) ``` -| Player | Pos | Pos\_Rank | Tgt | EPA | -| :----------- | :-- | --------: | --: | --: | -| M.Thomas | WR | 1 | 505 | 291 | -| D.Adams | WR | 2 | 435 | 207 | -| D.Hopkins | WR | 3 | 535 | 203 | -| K.Allen | WR | 4 | 477 | 198 | -| T.Kelce | TE | 1 | 438 | 198 | -| T.Hill | WR | 5 | 341 | 183 | -| G.Kittle | TE | 2 | 327 | 140 | -| R.Gronkowski | TE | 3 | 196 | 121 | -| Z.Ertz | TE | 4 | 413 | 109 | -| C.McCaffrey | RB | 1 | 390 | 106 | -| J.Cook | TE | 5 | 265 | 99 | -| A.Kamara | RB | 2 | 320 | 75 | -| J.White | RB | 3 | 298 | 70 | -| A.Ekeler | RB | 4 | 199 | 67 | -| K.Hunt | RB | 5 | 149 | 60 | - -Not surprisingly, all 5 of the top 5 WRs and 4 of the top 5 TEs in terms -of EPA added come in ahead of the top RB. Note that the number of -targets won’t match official stats because we’re including plays with -penalties. - -## More information - -`nflfastR` scrapes NFL Gamecenter or RS feeds, defaulting to the RS -feed. **Live games are only available from Gamecenter (we think) so when -scraping ongoing or recent games, use `source = 'gc'`**. Columns that -exist in both GC and RS are consistent across the two scrapers (e.g., -player\_id, play\_id, etc.) but there are some columns in RS that do not -exist in GC (drive\_how\_ended, roof\_type, game\_time\_eastern, etc.). - -`nflfastR` uses the Expected Points and Win Probability models developed -by the `nflscrapR` team and provided by the `nflscrapR` package. For a -description of the models, please see the paper -[here](https://arxiv.org/pdf/1802.00998.pdf). When using EP or WP from -this package, please cite `nflscrapR` as it is their work behind the -models (see the example in the caption of the figure above). Because -these models were trained on more recent seasons, they should be used -with caution for games in the early 2000s (note the means being not -centered at zero in the figure above). If you would like to help us -extend the EPA model to work better in the early 2000s, we are very open -to contributions from others. +| desc | spread\_line | home\_wp | vegas\_home\_wp | +| :----------------------------------------------------------------------------------------------- | -----------: | -------: | --------------: | +| GAME | 20.5 | NA | NA | +| 3-S.Gostkowski kicks 65 yards from NE 35 to end zone, Touchback. | 20.5 | 0.563 | 0.858 | +| (15:00) 26-L.Bell up the middle to NYJ 32 for 7 yards (71-D.Shelton; 93-L.Guy). | 20.5 | 0.563 | 0.858 | +| (14:29) 8-L.Falk pass short right to 82-J.Crowder pushed ob at NYJ 47 for 15 yards (31-J.Jones). | 20.5 | 0.552 | 0.876 | +| (13:59) 26-L.Bell up the middle to NYJ 48 for 1 yard (54-D.Hightower, 55-J.Simon). | 20.5 | 0.516 | 0.780 | + +Because the Patriots were favored by 20.5 points in this Week 3 +Jets-Patriots game, the naive (no-spread) and spread win probabilities +differ dramatically. In the no-spread model, `home_wp` begins at 0.56 +(greater than 0.50 because home team advantage is included in the model) +and `vegas_home_wp` is 0.858. We have also included `vegas_wp` which is +the win probability of the possession team, incorporating the spread. ## Data repository -Even though `nflfastR` is very fast, **for completed seasons we -recommend downloading the data from -[here](https://github.com/guga31bb/nflfastR-data) as in Examples 4 and 7 +Even though `nflfastR` is very fast, **for historical games we recommend +downloading the data from +[here](https://github.com/guga31bb/nflfastR-data) as in Examples 4 and 5 above**. These data sets include play-by-play data of complete seasons -going back to 2000 and we will update them in 2020 once the season +going back to 1999 and we will update them in 2020 once the season starts. The files contain both regular season and postseason data, and one can use game\_type or week to figure out which games occurred in the -postseason. Data are available as either .csv or .rds, but if you’re -using R, the .rds files are much smaller and thus faster to download. - -`fast_scraper` can also scrape the 1999 season. However, several games -of the 1999 season are missing play-by-play data completely. `nflfastR` -will point this out when trying to scrape this season and specify the -missing games. +postseason. Data are available as .csv.gz, .parquet, or .rds. ## About @@ -392,5 +301,25 @@ missing games. Carl](https://twitter.com/mrcaseb) and [Ben Baldwin](https://twitter.com/benbbaldwin). -Special thanks to [Florian Schmitt](https://twitter.com/Flosch1006) for -the logo design\! +## Special thanks + + - To [Nick Shoemaker](https://twitter.com/WeightRoomShoe) for [finding + and making available JSON-formatted NFL play-by-play back + to 1999](https://github.com/CroppedClamp/nfl_pbps) (`nflfastR` uses + this source for 1999-2010) + - To [Lee Sharpe](https://twitter.com/LeeSharpeNFL) for curating a + resource for game information + - To [Florian Schmitt](https://twitter.com/Flosch1006) for the logo + design + - To [Zach Feldman](https://twitter.com/ZachFeldman3) for developing + the CP model and recommending xgboost to us (which now drives the EP + and WP models as well) + - To [Peter Owen](https://twitter.com/JSmoovesBrekkie) for [many + helpful suggestions for the CP + model](https://twitter.com/JSmoovesBrekkie/status/1268885950626623490) + - The many users who found and reported bugs in `nflfastR` 1.0 + - And of course, the original + [`nflscrapR`](https://github.com/maksimhorowitz/nflscrapR) team, + Maksim Horowitz, Ronald Yurko, and Samuel Ventura, whose work + represented a dramatic step forward for the state of public NFL + research diff --git a/data-raw/CP.R b/data-raw/CP.R deleted file mode 100644 index 4029af08..00000000 --- a/data-raw/CP.R +++ /dev/null @@ -1,30 +0,0 @@ -# calculate cp model - -#library(nflfastR) -#library(tidyverse) -seasons <- 2006:2019 -pbp_data <- purrr::map_df(seasons, function(x) { - readRDS( - url( - glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{x}.rds") - ) - ) -}) %>% - dplyr::mutate(receiver_player_name = - stringr::str_extract(desc, "(?<=((to)|(for))\\s[:digit:]{0,2}\\-{0,1})[A-Z][A-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?")) - - # valid pass play: at least -15 air yards, less than 70 air yards, has intended receiver, has pass location - passes <- pbp_data %>% - filter(complete_pass == 1 | incomplete_pass == 1 | interception == 1) %>% - filter(!is.na(air_yards) & air_yards >= -15 & air_yards <70 & !is.na(receiver_player_name) & !is.na(pass_location)) %>% - mutate(air_is_zero=ifelse(air_yards == 0,1,0)) - - # estimate CP - cp_models <- gam(complete_pass ~ s(air_yards) + s(yardline_100) +log(ydstogo) + air_is_zero + - factor(down) + factor(pass_location) + factor(season), - data=passes, method="REML", family = "binomial") - - - -# Save the models using this code. It will generate the file R/sysdata.rda -usethis::use_data(cp_models, internal = TRUE, overwrite = TRUE) diff --git a/data-raw/EP_functions.R b/data-raw/EP_functions.R new file mode 100644 index 00000000..9b44d3c7 --- /dev/null +++ b/data-raw/EP_functions.R @@ -0,0 +1,165 @@ + +## Helper functions + +#All of these are heavily borrowed from nflscrapR (Maksim Horowitz, Ronald Yurko, and Samuel Ventura) + +#original code +#https://github.com/ryurko/nflscrapR-models/blob/master/R/init_models/init_ep_fg_models.R + +######################################################################## +#### helper function for building dataset to estimate EPA model ######## +######################################################################## +#ben is reasonably confident that this works and is finished +#this is only needed to estimate the EP model, not actually adding EP to data +find_game_next_score_half <- function(pbp_dataset) { + + # Which rows are the scoring plays: + score_plays <- which(pbp_dataset$sp == 1 & pbp_dataset$play_type != "no_play") + + # Define a helper function that takes in the current play index, + # a vector of the scoring play indices, play-by-play data, + # and returns the score type and drive number for the next score: + find_next_score <- function(play_i, score_plays_i,pbp_df) { + + # Find the next score index for the current play + # based on being the first next score index: + next_score_i <- score_plays_i[which(score_plays_i >= play_i)[1]] + + # If next_score_i is NA (no more scores after current play) + # or if the next score is in another half, + # then return No_Score and the current drive number + if (is.na(next_score_i) | + (pbp_df$qtr[play_i] %in% c(1, 2) & pbp_df$qtr[next_score_i] %in% c(3, 4, 5)) | + (pbp_df$qtr[play_i] %in% c(3, 4) & pbp_df$qtr[next_score_i] == 5)) { + + score_type <- "No_Score" + + # Make it the current play index + score_drive <- pbp_df$drive[play_i] + + # Else return the observed next score type and drive number: + } else { + + # Store the score_drive number + score_drive <- pbp_df$drive[next_score_i] + + # Then check the play types to decide what to return + # based on several types of cases for the next score: + + # 1: Return TD + if (pbp_df$touchdown[next_score_i] == 1 & (pbp_df$td_team[next_score_i] != pbp_df$posteam[next_score_i])) { + + # For return touchdowns the current posteam would not have + # possession at the time of return, so it's flipped: + if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) { + + score_type <- "Opp_Touchdown" + + } else { + + score_type <- "Touchdown" + + } + } else if (identical(pbp_df$field_goal_result[next_score_i], "made")) { + + # 2: Field Goal + # Current posteam made FG + if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) { + + score_type <- "Field_Goal" + + # Opponent made FG + } else { + + score_type <- "Opp_Field_Goal" + + } + + # 3: Touchdown (returns already counted for) + } else if (pbp_df$touchdown[next_score_i] == 1) { + + # Current posteam TD + if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) { + + score_type <- "Touchdown" + + # Opponent TD + } else { + + score_type <- "Opp_Touchdown" + + } + # 4: Safety (similar to returns) + } else if (pbp_df$safety[next_score_i] == 1) { + + if (identical(pbp_df$posteam[play_i],pbp_df$posteam[next_score_i])) { + + score_type <- "Opp_Safety" + + } else { + + score_type <- "Safety" + + } + # 5: Extra Points + } else if (identical(pbp_df$extra_point_result[next_score_i], "good")) { + + # Current posteam Extra Point + if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) { + + score_type <- "Extra_Point" + + # Opponent Extra Point + } else { + + score_type <- "Opp_Extra_Point" + + } + # 6: Two Point Conversions + } else if (identical(pbp_df$two_point_conv_result[next_score_i], "success")) { + + # Current posteam Two Point Conversion + if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) { + + score_type <- "Two_Point_Conversion" + + # Opponent Two Point Conversion + } else { + + score_type <- "Opp_Two_Point_Conversion" + + } + + # 7: Defensive Two Point (like returns) + } else if (identical(pbp_df$defensive_two_point_conv[next_score_i], 1)) { + + if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) { + + score_type <- "Opp_Defensive_Two_Point" + + } else { + + score_type <- "Defensive_Two_Point" + + } + + # 8: Errors of some sort so return NA (but shouldn't take place) + } else { + + score_type <- NA + + } + } + + return(data.frame(Next_Score_Half = score_type, + Drive_Score_Half = score_drive)) + } + + # Using lapply and then bind_rows is much faster than + # using map_dfr() here: + lapply(c(1:nrow(pbp_dataset)), find_next_score, + score_plays_i = score_plays, pbp_df = pbp_dataset) %>% + bind_rows() %>% + return +} + diff --git a/data-raw/MODEL-README.Rmd b/data-raw/MODEL-README.Rmd new file mode 100644 index 00000000..79ba100a --- /dev/null +++ b/data-raw/MODEL-README.Rmd @@ -0,0 +1,716 @@ +--- +title: "nflfastR EP, WP, and CP models" +author: Ben Baldwin +output: + github_document: + toc: true + toc_depth: 3 +--- + +```{r opts, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + fig.path = "man/figures/README-", + out.width = "100%" +) +``` + +## About + +This page describes the nflfastR Expected Points (EP), Win Probability (WP), and Completion Percentage (CP) models before showing that they are well calibrated using the procedure [introduced by Yurko, Ventura, and Horowitz](https://arxiv.org/pdf/1802.00998.pdf). Because the 2020 season will mark 22 seasons of nflfastR data, the main purpose behind creating new models for EP and WP was to build in era adjustments to fascilitate better cross-era comparisons. However, we also discovered that switching to tree-based methods could improve model calibration, especially for end-of-half situations with complicated nonlinear interactions between variables. Because we are introducing new models, we compare our calibation results to nflscrapR to show that these new models are somewhat better calibrated. If they weren't, there would be no point in updating the models! + +nflfastR switching from the nflscrapR EP and WP models to its own model should not be thought of as a criticism of nflscrapR: the improvements are relatively minor and nflscrapR provided the code base to perform much of this analysis, breaking new ground in the process. + +## Model features + +The EP, WP, and CP models are trained using xgboost, which uses training data to create [decision trees](https://xgboost.readthedocs.io/en/latest/tutorials/model.html). + +**EP model features** + +* Seconds remaining in half +* Yard line +* Whether possession team is at home +* Roof type: retractable, dome, or outdoors +* Down +* Yards to go +* Era: 1999-2001 (pre-expansion), 2002-2005 (pre-CPOE), 2006-2013 (pre-LOB rules change), 2014-2017, 2018 and beyond +* Week (the model is trained on regular season only, so for playoff games, we impute week = 17) +* Timeouts remaining for each team + +**WP model features** + +* Seconds remaining in half +* Seconds remaining in game +* Yard line +* Expected Points +* Score differential +* Ratio of expected score differential (expected points + point differential) to time remaining (feature borrowed from nflscrapR) +* Down +* Yards to go +* Timeouts remaining for each team +* Whether team will receive 2nd half kickoff +* Whether possession team is at home +* [Model with Vegas line only: point spread * log(3600 / (50 + (seconds elapsed in game))] + +**CP model features** + +* Yard line +* Whether possession team is at home +* Roof type: retractable, dome, or outdoors +* Down +* Yards to go +* Distance to sticks (air yards - yards to go) +* Era: 2006-2013, 2014-2017, 2018 and beyond (note that air yards data only go back to 2006, so there is no CP for earlier years) +* Week (the model is trained on regular season only, so for playoff games, we impute week = 17) +* Air yards +* Whether air yards is 0 (probably unnecessary with tree-based method and a relic from earlier models where it was included because completion percentage is much lower for 0 air yard passes) +* Pass location (binary: middle or not middle) +* Whether quarterback was hit on the play + +## EP Model Calibration Results + +The goal of this section is to show that the nflfastR EP model is well calibrated. To measure calibration, we follow Yurko et al. and perform leave-one-season-out (LOSO) calibration. In particular, for each of the 20 available seasons (2000-2019), we exclude one season, train the EP model on the other 19 seasons, and then compare the model's predictions in the holdout season to what actually happened in that season. If the model is well calibrated, we would expect that, for example, 50 percent of plays with a touchdown probability of 50 percent prior to the play would have the next score be a touchdown for the possession team. + +Let's start with some setup. The file used here isn't pushed because it's large, but its creation can [be seen here](https://github.com/mrcaseb/fastscraper/blob/ben-upstream/data-raw/MODELS.R). +```{r setup, results = 'hide', message = FALSE} +set.seed(2013) #GoHawks +library(tidyverse) +library(xgboost) + +#some helper files are in these +source('../R/helper_add_nflscrapr_mutations.R') +source('../R/helper_add_ep_wp.R') +source('../R/helper_add_cp_cpoe.R') + +pbp_data <- readRDS('cal_data.rds') +model_data <- pbp_data %>% + #in '../R/helper_add_nflscrapr_mutations.R' + make_model_mutations() %>% + mutate( + label = case_when( + Next_Score_Half == "Touchdown" ~ 0, + Next_Score_Half == "Opp_Touchdown" ~ 1, + Next_Score_Half == "Field_Goal" ~ 2, + Next_Score_Half == "Opp_Field_Goal" ~ 3, + Next_Score_Half == "Safety" ~ 4, + Next_Score_Half == "Opp_Safety" ~ 5, + Next_Score_Half == "No_Score" ~ 6 + ), + label = as.factor(label), + #use nflscrapR weights + Drive_Score_Dist = Drive_Score_Half - drive, + Drive_Score_Dist_W = (max(Drive_Score_Dist) - Drive_Score_Dist) / + (max(Drive_Score_Dist) - min(Drive_Score_Dist)), + ScoreDiff_W = (max(abs(score_differential), na.rm=T) - abs(score_differential)) / + (max(abs(score_differential), na.rm=T) - min(abs(score_differential), na.rm=T)), + Total_W = Drive_Score_Dist_W + ScoreDiff_W, + Total_W_Scaled = (Total_W - min(Total_W, na.rm=T)) / + (max(Total_W, na.rm=T) - min(Total_W, na.rm=T)) + ) %>% + filter( + !is.na(defteam_timeouts_remaining), !is.na(posteam_timeouts_remaining), + !is.na(yardline_100) + ) %>% + select( + label, + season, + half_seconds_remaining, + yardline_100, + home, + retractable, + dome, + outdoors, + ydstogo, + era0, era1, era2, era3, era4, + down1, down2, down3, down4, + posteam_timeouts_remaining, + defteam_timeouts_remaining, + model_week, + Total_W_Scaled + ) + +#idk why this is all necessary for xgb but it is +model_data <- model_data %>% + mutate(label = as.numeric(label), + label = label - 1) + +rm(pbp_data) + +seasons <- unique(model_data$season) +``` + +Input the stuff we'll need to fit the model. The parameters were obtained from cross-validation, where each season was forced to be entirely contained in a given CV fold to prevent leakage in labels from one fold to another (for example, if a given drive were split up between folds). +```{r params} +nrounds = 70 +params <- + list( + booster = "gbtree", + objective = "multi:softprob", + eval_metric = c("mlogloss"), + num_class = 7, + eta = 0.2, + gamma = .2, + subsample=0.8, + colsample_bytree=0.8, + max_depth = 4, + min_child_weight = .9 + ) +``` + +Now do the LOSO model fitting. +```{r loso_fit, results = 'hide'} +cv_results <- map_dfr(seasons, function(x) { + + test_data <- model_data %>% + filter(season == x) %>% + select(-season) + train_data <- model_data %>% + filter(season != x) %>% + select(-season) + + full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = train_data %>% select(-label, -Total_W_Scaled)), + label = train_data$label, weight = train_data$Total_W_Scaled) + ep_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2) + + preds <- as.data.frame( + matrix(predict(ep_model, as.matrix(test_data %>% select(-label, -Total_W_Scaled))), ncol=7, byrow=TRUE) + ) + colnames(preds) <- c("Touchdown","Opp_Touchdown","Field_Goal","Opp_Field_Goal", + "Safety","Opp_Safety","No_Score") + + cv_data <- bind_cols(test_data, preds) %>% mutate(season = x) + return(cv_data) + +}) + +#get the BINS for the calibration plot +plot <- cv_results %>% + select(Touchdown, Opp_Touchdown, Field_Goal, Opp_Field_Goal, Safety, Opp_Safety, No_Score, label) %>% + pivot_longer(-label, names_to = 'type', values_to = 'pred_prob') %>% + mutate(bin_pred_prob = round(pred_prob / 0.05) * .05) %>% + mutate(outcome = case_when( + label == 0 ~ "Touchdown", + label == 1 ~ "Opp_Touchdown", + label == 2 ~ "Field_Goal", + label == 3 ~ "Opp_Field_Goal", + label == 4 ~ "Safety", + label == 5 ~ "Opp_Safety", + label == 6 ~ "No_Score" + )) %>% + group_by(type, bin_pred_prob) %>% + mutate(correct = if_else(outcome == type, 1, 0)) %>% + summarize(n_plays = n(), + n_outcome = sum(correct), + bin_actual_prob = n_outcome / n_plays) +``` + +Here is the EP calibration plot. Points close to the diagonal dotted line are consistent with a well-calibrated model: + +```{r plot, warning = FALSE, message = FALSE, results = 'hide', fig.keep = 'all', dpi = 700} +ann_text <- data.frame(x = c(.25, 0.75), y = c(0.75, 0.25), + lab = c("More times\nthan expected", "Fewer times\nthan expected"), + next_score_type = factor("No Score (0)")) +plot %>% + #about .75M plays in total + #filter(n_plays >= 50) %>% + ungroup() %>% + mutate(type = fct_relevel(type, + "Opp_Safety", "Opp_Field_Goal", + "Opp_Touchdown", "No_Score", "Safety", + "Field_Goal", "Touchdown" + ), + type = fct_recode(type, + "-Field Goal (-3)" = "Opp_Field_Goal", + "-Safety (-2)" = "Opp_Safety", + "-Touchdown (-7)" = "Opp_Touchdown", + "Field Goal (3)" = "Field_Goal", + "No Score (0)" = "No_Score", + "Touchdown (7)" = "Touchdown", + "Safety (2)" = "Safety")) %>% + ggplot() + + geom_point(aes(x = bin_pred_prob, y = bin_actual_prob, size = n_plays)) + + geom_smooth(aes(x = bin_pred_prob, y = bin_actual_prob), method = "loess") + + geom_abline(slope = 1, intercept = 0, color = "black", lty = 2) + + coord_equal() + + scale_x_continuous(limits = c(0,1)) + + scale_y_continuous(limits = c(0,1)) + + labs(size = "Number of plays", + x = "Estimated next score probability", + y = "Observed next score probability") + + geom_text(data = ann_text, aes(x = x, y = y, label = lab), size = 2) + + theme_bw() + + theme(plot.title = element_text(hjust = 0.5), + strip.background = element_blank(), + strip.text = element_text(size = 12), + axis.title = element_text(size = 12), + axis.text.y = element_text(size = 12), + axis.text.x = element_text(size = 10, angle = 90), + legend.title = element_text(size = 12), + legend.text = element_text(size = 12), + legend.position = c(1, .05), legend.justification = c(1, 0)) + + facet_wrap(~ type, ncol = 4) +``` + +There is some weirdness with the opponent safety predictions, but these dots represent an *extremely* small number of plays (10-50 plays out of about 750,000). + +Now let's get the calibration error using the measure developed in Yurko et al., and compare it to nflscrapR. First we need to get the nflscrapR predictions, which we have saved from the previous version of nflfastR which applied the nflscrapR models. + +```{r cal} +#calibration error +cv_cal_error <- plot %>% + ungroup() %>% + mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>% + group_by(type) %>% + summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE), + n_scoring_event = sum(n_outcome, na.rm = TRUE)) + +pbp_data <- readRDS('cal_data_nflscrapr.rds') +#nflscrapr calibration error +nflscrapr <- pbp_data %>% + select(td_prob, opp_td_prob, fg_prob, opp_fg_prob, safety_prob, opp_safety_prob, no_score_prob, Next_Score_Half) %>% + pivot_longer(-Next_Score_Half, names_to = 'type', values_to = 'pred_prob') %>% + mutate(bin_pred_prob = round(pred_prob / 0.05) * .05) %>% + mutate(outcome = Next_Score_Half, + type = case_when( + type == "td_prob" ~ 'Touchdown', + type == 'fg_prob' ~ "Field_Goal", + type == "opp_td_prob" ~ "Opp_Touchdown", + type == 'opp_fg_prob' ~ "Opp_Field_Goal", + type == 'safety_prob' ~ "Safety", + type == 'opp_safety_prob' ~ "Opp_Safety", + type == "no_score_prob" ~ "No_Score" + )) %>% + group_by(type, bin_pred_prob) %>% + mutate(correct = if_else(outcome == type, 1, 0)) %>% + summarize(n_plays = n(), + n_outcome = sum(correct), + bin_actual_prob = n_outcome / n_plays) %>% + ungroup() %>% + mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>% + group_by(type) %>% + summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE), + n_scoring_event = sum(n_outcome, na.rm = TRUE)) +rm(pbp_data) + +message(glue::glue( +' +--CALIBRATION ERROR-- + +nflfastR: +{round(with(cv_cal_error, weighted.mean(weight_cal_error, n_scoring_event)), 4)} + +nflscrapR: +{round(with(nflscrapr, weighted.mean(weight_cal_error, n_scoring_event)), 4)} +' +)) +``` +We see that the new EP model is better calibrated. Note that nflscrapR reports a calibration error of 0.01309723. The number is higher here because of the additional seasons included outside of the time period nflscrapR was trained on, and the lack of era adjustment in nflscrapR. + +## WP Model Calibration Results + +As with EP, do some initial setup to get the data ready for fitting. +```{r wp_setup} +model_data <- readRDS('cal_data.rds') +model_data <- model_data %>% + make_model_mutations() %>% + prepare_wp_data() %>% + mutate(label = ifelse(posteam == Winner, 1, 0)) %>% + filter(!is.na(ep) & !is.na(score_differential) & !is.na(play_type) & !is.na(label)) %>% + select( + label, + receive_2h_ko, + spread_time, + half_seconds_remaining, + game_seconds_remaining, + ExpScoreDiff_Time_Ratio, + score_differential, + ep, + down, + ydstogo, + home, + posteam_timeouts_remaining, + defteam_timeouts_remaining, + season, + #only needed for the plots here, not used in model + qtr + ) %>% + filter(qtr <= 4) + +nrounds = 65 +params <- + list( + booster = "gbtree", + objective = "binary:logistic", + eval_metric = c("logloss"), + eta = 0.2, + gamma = 0, + subsample=0.8, + colsample_bytree=0.8, + max_depth = 4, + min_child_weight = 1 + ) +``` + +Do the LOSO fitting: +``` {r wp_loso, results = 'hide'} +cv_results <- map_dfr(seasons, function(x) { + + test_data <- model_data %>% + filter(season == x) %>% + select(-season) + train_data <- model_data %>% + filter(season != x) %>% + select(-season) + + full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = train_data %>% select(-label, -qtr, -spread_time)), + label = train_data$label) + wp_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2) + + preds <- as.data.frame( + matrix(predict(wp_model, as.matrix(test_data %>% select(-label, -qtr, -spread_time)))) + ) %>% + dplyr::rename(wp = V1) + + cv_data <- bind_cols(test_data, preds) %>% mutate(season = x) + return(cv_data) + +}) + +#TIME FOR BINNING +wp_cv_loso_calibration_results <- cv_results %>% + # Create BINS for wp: + mutate(bin_pred_prob = round(wp / 0.05) * .05) %>% + # Group by both the qtr and bin_pred_prob: + group_by(qtr, bin_pred_prob) %>% + # Calculate the calibration results: + summarize(n_plays = n(), + n_wins = length(which(label == 1)), + bin_actual_prob = n_wins / n_plays) + +``` + +The WP plot. Looks good! +```{r plot_wp, warning = FALSE, message = FALSE, results = 'hide', fig.keep = 'all', dpi = 700} +# Create a label data frame for the chart: +ann_text <- data.frame(x = c(.25, 0.75), y = c(0.75, 0.25), + lab = c("More times\nthan expected", "Fewer times\nthan expected"), + qtr = factor("1st Quarter")) + +# Create the calibration chart: +wp_cv_loso_calibration_results %>% + ungroup() %>% + mutate(qtr = fct_recode(factor(qtr), "1st Quarter" = "1", "2nd Quarter" = "2", + "3rd Quarter" = "3", "4th Quarter" = "4")) %>% + ggplot() + + geom_point(aes(x = bin_pred_prob, y = bin_actual_prob, size = n_plays)) + + geom_smooth(aes(x = bin_pred_prob, y = bin_actual_prob), method = "loess") + + geom_abline(slope = 1, intercept = 0, color = "black", lty = 2) + + coord_equal() + + scale_x_continuous(limits = c(0,1)) + + scale_y_continuous(limits = c(0,1)) + + labs(size = "Number of plays", + x = "Estimated win probability", + y = "Observed win probability") + + geom_text(data = ann_text, aes(x = x, y = y, label = lab), size = 2) + + theme_bw() + + theme(plot.title = element_text(hjust = 0.5), + strip.background = element_blank(), + strip.text = element_text(size = 12), + axis.title = element_text(size = 12), + axis.text.y = element_text(size = 12), + axis.text.x = element_text(size = 10, angle = 90), + legend.title = element_text(size = 12), + legend.text = element_text(size = 12), + legend.position = "bottom") + + facet_wrap(~ qtr, ncol = 4) +``` + +And get the WP calibration error: +```{r cal_error_wp} +# Calculate the calibration error values: +wp_cv_cal_error <- wp_cv_loso_calibration_results %>% + ungroup() %>% + mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>% + group_by(qtr) %>% + summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE), + n_wins = sum(n_wins, na.rm = TRUE)) + +#get nflscrapR to compare +pbp_data <- readRDS('cal_data_nflscrapr.rds') %>% + mutate(label = ifelse(posteam == Winner, 1, 0)) %>% + filter(qtr <= 4, !is.na(label), !is.na(posteam), !is.na(wp)) + +nflscrapR <- pbp_data %>% + # Create binned probability column: + mutate(bin_pred_prob = round(wp / 0.05) * .05) %>% + # Group by both the qtr and bin_pred_prob: + group_by(qtr, bin_pred_prob) %>% + # Calculate the calibration results: + summarize(n_plays = n(), + n_wins = length(which(label == 1)), + bin_actual_prob = n_wins / n_plays) %>% + ungroup() %>% + mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>% + group_by(qtr) %>% + summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE), + n_wins = sum(n_wins, na.rm = TRUE)) + +message(glue::glue( + '--CALIBRATION ERROR-- + +nflfastR: +{round(with(wp_cv_cal_error, weighted.mean(weight_cal_error, n_wins)), 4)} + +nflscrapR: +{round(with(nflscrapR, weighted.mean(weight_cal_error, n_wins)), 4)}' +)) +``` +Again, the new WP model represents an improvement. + + +## WP Model Calibration Results: with point spread + +`nflfastR` has a secondary win probability model that also incorporates the pregame spread to more accurately reflect a team's chances of winning. Below are calibration results for this model. + +```{r wp_setup_spread} +nrounds = 170 +params <- + list( + booster = "gbtree", + objective = "binary:logistic", + eval_metric = c("logloss"), + eta = 0.075, + gamma = 3, + subsample=0.8, + colsample_bytree=0.8, + max_depth = 5, + min_child_weight = .9 + ) +``` + +Do the LOSO fitting: +``` {r wp_loso_spread, results = 'hide'} +cv_results <- map_dfr(seasons, function(x) { + + test_data <- model_data %>% + filter(season == x) %>% + select(-season) + train_data <- model_data %>% + filter(season != x) %>% + select(-season) + + full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = train_data %>% select(-label, -qtr)), + label = train_data$label) + wp_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2) + + preds <- as.data.frame( + matrix(predict(wp_model, as.matrix(test_data %>% select(-label, -qtr)))) + ) %>% + dplyr::rename(wp = V1) + + cv_data <- bind_cols(test_data, preds) %>% mutate(season = x) + return(cv_data) + +}) + +#TIME FOR BINNING +wp_cv_loso_calibration_results <- cv_results %>% + # Create BINS for wp: + mutate(bin_pred_prob = round(wp / 0.05) * .05) %>% + # Group by both the qtr and bin_pred_prob: + group_by(qtr, bin_pred_prob) %>% + # Calculate the calibration results: + summarize(n_plays = n(), + n_wins = length(which(label == 1)), + bin_actual_prob = n_wins / n_plays) + +``` + +The WP plot. +```{r plot_wp_spread, warning = FALSE, message = FALSE, results = 'hide', fig.keep = 'all', dpi = 700} +# Create a label data frame for the chart: +ann_text <- data.frame(x = c(.25, 0.75), y = c(0.75, 0.25), + lab = c("More times\nthan expected", "Fewer times\nthan expected"), + qtr = factor("1st Quarter")) + +# Create the calibration chart: +wp_cv_loso_calibration_results %>% + ungroup() %>% + mutate(qtr = fct_recode(factor(qtr), "1st Quarter" = "1", "2nd Quarter" = "2", + "3rd Quarter" = "3", "4th Quarter" = "4")) %>% + ggplot() + + geom_point(aes(x = bin_pred_prob, y = bin_actual_prob, size = n_plays)) + + geom_smooth(aes(x = bin_pred_prob, y = bin_actual_prob), method = "loess") + + geom_abline(slope = 1, intercept = 0, color = "black", lty = 2) + + coord_equal() + + scale_x_continuous(limits = c(0,1)) + + scale_y_continuous(limits = c(0,1)) + + labs(size = "Number of plays", + x = "Estimated win probability", + y = "Observed win probability") + + geom_text(data = ann_text, aes(x = x, y = y, label = lab), size = 2) + + theme_bw() + + theme(plot.title = element_text(hjust = 0.5), + strip.background = element_blank(), + strip.text = element_text(size = 12), + axis.title = element_text(size = 12), + axis.text.y = element_text(size = 12), + axis.text.x = element_text(size = 10, angle = 90), + legend.title = element_text(size = 12), + legend.text = element_text(size = 12), + legend.position = "bottom") + + facet_wrap(~ qtr, ncol = 4) +``` + +And get the WP calibration error: +```{r cal_error_wp_spread} +# Calculate the calibration error values: +wp_cv_cal_error <- wp_cv_loso_calibration_results %>% + ungroup() %>% + mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>% + group_by(qtr) %>% + summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE), + n_wins = sum(n_wins, na.rm = TRUE)) +message(glue::glue( + '--CALIBRATION ERROR-- + +nflfastR with Vegas line: +{round(with(wp_cv_cal_error, weighted.mean(weight_cal_error, n_wins)), 4)} + +nflscrapR: +{round(with(nflscrapR, weighted.mean(weight_cal_error, n_wins)), 4)}' +)) +``` +Again, the new WP model is better calibrated than nflscrapR. In our testing, incorporating the spread substantially improved the performance of the model as measured by cross-validation classification accuracy (reduced error rate from 27% to 23%) and log loss (reduced from .52 to .45). We include a time-decaying function of spread on its own as including spread on its own increases the LOSO calibration error, especially in the fourth quarter. We also tried removing the `home` indicator in the spread model, but this worsened the calibration results. + +## CP Model Calibration Results + +By now, the process should be familiar. +``` {r cp-setup, results = 'hide'} +pbp <- readRDS('cal_data.rds') + +model_data <- pbp %>% + filter(season >= 2006) %>% + make_model_mutations() %>% + dplyr::mutate(receiver_player_name = + stringr::str_extract(desc, "(?<=((to)|(for))\\s[:digit:]{0,2}\\-{0,1})[A-Z][A-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"), + pass_middle = dplyr::if_else(pass_location == 'middle', 1, 0), + air_is_zero= dplyr::if_else(air_yards == 0,1,0), + distance_to_sticks = air_yards - ydstogo + ) %>% + dplyr::filter(complete_pass == 1 | incomplete_pass == 1 | interception == 1) %>% + dplyr::filter(!is.na(air_yards) & air_yards >= -15 & air_yards <70 & !is.na(receiver_player_name) & !is.na(pass_location)) %>% + dplyr::select( + season, complete_pass, air_yards, yardline_100, ydstogo, + down1, down2, down3, down4, air_is_zero, pass_middle, + era2, era3, era4, qb_hit, home, model_week, + outdoors, retractable, dome, distance_to_sticks + ) +rm(pbp) + + +nrounds = 70 +params <- + list( + booster = "gbtree", + objective = "binary:logistic", + eval_metric = c("logloss"), + eta = 0.2, + gamma = 5, + subsample=0.8, + colsample_bytree=0.8, + max_depth = 4, + min_child_weight = 6, + base_score = mean(model_data$complete_pass) + ) + +cv_results <- map_dfr(2006:2019, function(x) { + + test_data <- model_data %>% + filter(season == x) %>% + select(-season) + train_data <- model_data %>% + filter(season != x) %>% + select(-season) + + full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = train_data %>% select(-complete_pass)), + label = train_data$complete_pass) + cp_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2) + + preds <- as.data.frame( + matrix(predict(cp_model, as.matrix(test_data %>% select(-complete_pass)))) + ) %>% + dplyr::rename(cp = V1) + + cv_data <- bind_cols(test_data, preds) %>% mutate(season = x) + return(cv_data) + +}) + +#TIME FOR BINNING +cp_cv_loso_calibration_results <- cv_results %>% + # Create BINS for wp: + mutate( + bin_pred_prob = round(cp / 0.05) * .05, + distance = case_when( + air_yards < 5 ~ "Short", + air_yards >= 5 & air_yards < 15 ~ "Intermediate", + air_yards >= 15 ~ "Deep" + ) + ) %>% + # Group by both the qtr and bin_pred_prob: + group_by(distance, bin_pred_prob) %>% + # Calculate the calibration results: + summarize(n_plays = n(), + n_complete = length(which(complete_pass == 1)), + bin_actual_prob = n_complete / n_plays) + +ann_text <- data.frame(x = c(.25, 0.75), y = c(0.75, 0.25), + lab = c("More times\nthan expected", "Fewer times\nthan expected") + ) +``` + +Plot the results: +```{r plot_cp, warning = FALSE, message = FALSE, results = 'hide', fig.keep = 'all', dpi = 700} +cp_cv_loso_calibration_results %>% + ungroup() %>% + mutate(distance = fct_relevel(distance, + "Short", "Intermediate", "Deep") + ) %>% + filter(n_plays > 10) %>% + ggplot() + + geom_point(aes(x = bin_pred_prob, y = bin_actual_prob, size = n_plays)) + + geom_smooth(aes(x = bin_pred_prob, y = bin_actual_prob), method = "loess") + + geom_abline(slope = 1, intercept = 0, color = "black", lty = 2) + + coord_equal() + + scale_x_continuous(limits = c(0,1)) + + scale_y_continuous(limits = c(0,1)) + + labs(size = "Number of plays", + x = "Estimated completion percentage", + y = "Observed completion percentage") + + geom_text(data = ann_text, aes(x = x, y = y, label = lab), size = 3) + + theme_bw() + + theme(plot.title = element_text(hjust = 0.5), + strip.background = element_blank(), + strip.text = element_text(size = 12), + axis.title = element_text(size = 12), + axis.text.y = element_text(size = 12), + axis.text.x = element_text(size = 10, angle = 90), + legend.title = element_text(size = 12), + legend.text = element_text(size = 12), + legend.position = "bottom") + + facet_wrap(~ distance, ncol = 3) +``` + +And get the calibration error: +```{r cp-error} +cp_cv_cal_error <- cp_cv_loso_calibration_results %>% + ungroup() %>% + mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>% + group_by(distance) %>% + summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE), + n_complete = sum(n_complete, na.rm = TRUE)) + +round(with(cp_cv_cal_error, weighted.mean(weight_cal_error, n_complete)), 4) +``` diff --git a/data-raw/MODEL-README.md b/data-raw/MODEL-README.md new file mode 100644 index 00000000..a7f8852a --- /dev/null +++ b/data-raw/MODEL-README.md @@ -0,0 +1,810 @@ +nflfastR EP, WP, and CP models +================ +Ben Baldwin + + - [About](#about) + - [Model features](#model-features) + - [EP Model Calibration Results](#ep-model-calibration-results) + - [WP Model Calibration Results](#wp-model-calibration-results) + - [WP Model Calibration Results: with point + spread](#wp-model-calibration-results-with-point-spread) + - [CP Model Calibration Results](#cp-model-calibration-results) + +## About + +This page describes the nflfastR Expected Points (EP), Win Probability +(WP), and Completion Percentage (CP) models before showing that they are +well calibrated using the procedure [introduced by Yurko, Ventura, and +Horowitz](https://arxiv.org/pdf/1802.00998.pdf). Because the 2020 season +will mark 22 seasons of nflfastR data, the main purpose behind creating +new models for EP and WP was to build in era adjustments to fascilitate +better cross-era comparisons. However, we also discovered that switching +to tree-based methods could improve model calibration, especially for +end-of-half situations with complicated nonlinear interactions between +variables. Because we are introducing new models, we compare our +calibation results to nflscrapR to show that these new models are +somewhat better calibrated. If they weren’t, there would be no point in +updating the models\! + +nflfastR switching from the nflscrapR EP and WP models to its own model +should not be thought of as a criticism of nflscrapR: the improvements +are relatively minor and nflscrapR provided the code base to perform +much of this analysis, breaking new ground in the process. + +## Model features + +The EP, WP, and CP models are trained using xgboost, which uses training +data to create [decision +trees](https://xgboost.readthedocs.io/en/latest/tutorials/model.html). + +**EP model features** + + - Seconds remaining in half + - Yard line + - Whether possession team is at home + - Roof type: retractable, dome, or outdoors + - Down + - Yards to go + - Era: 1999-2001 (pre-expansion), 2002-2005 (pre-CPOE), 2006-2013 + (pre-LOB rules change), 2014-2017, 2018 and beyond + - Week (the model is trained on regular season only, so for playoff + games, we impute week = 17) + - Timeouts remaining for each team + +**WP model features** + + - Seconds remaining in half + - Seconds remaining in game + - Yard line + - Expected Points + - Score differential + - Ratio of expected score differential (expected points + point + differential) to time remaining (feature borrowed from nflscrapR) + - Down + - Yards to go + - Timeouts remaining for each team + - Whether team will receive 2nd half kickoff + - Whether possession team is at home + - \[Model with Vegas line only: point spread \* log(3600 / (50 + + (seconds elapsed in game))\] + +**CP model features** + + - Yard line + - Whether possession team is at home + - Roof type: retractable, dome, or outdoors + - Down + - Yards to go + - Distance to sticks (air yards - yards to go) + - Era: 2006-2013, 2014-2017, 2018 and beyond (note that air yards data + only go back to 2006, so there is no CP for earlier years) + - Week (the model is trained on regular season only, so for playoff + games, we impute week = 17) + - Air yards + - Whether air yards is 0 (probably unnecessary with tree-based method + and a relic from earlier models where it was included because + completion percentage is much lower for 0 air yard passes) + - Pass location (binary: middle or not middle) + - Whether quarterback was hit on the play + +## EP Model Calibration Results + +The goal of this section is to show that the nflfastR EP model is well +calibrated. To measure calibration, we follow Yurko et al. and perform +leave-one-season-out (LOSO) calibration. In particular, for each of the +20 available seasons (2000-2019), we exclude one season, train the EP +model on the other 19 seasons, and then compare the model’s predictions +in the holdout season to what actually happened in that season. If the +model is well calibrated, we would expect that, for example, 50 percent +of plays with a touchdown probability of 50 percent prior to the play +would have the next score be a touchdown for the possession team. + +Let’s start with some setup. The file used here isn’t pushed because +it’s large, but its creation can [be seen +here](https://github.com/mrcaseb/fastscraper/blob/ben-upstream/data-raw/MODELS.R). + +``` r +set.seed(2013) #GoHawks +library(tidyverse) +library(xgboost) + +#some helper files are in these +source('../R/helper_add_nflscrapr_mutations.R') +source('../R/helper_add_ep_wp.R') +source('../R/helper_add_cp_cpoe.R') + +pbp_data <- readRDS('cal_data.rds') +model_data <- pbp_data %>% + #in '../R/helper_add_nflscrapr_mutations.R' + make_model_mutations() %>% + mutate( + label = case_when( + Next_Score_Half == "Touchdown" ~ 0, + Next_Score_Half == "Opp_Touchdown" ~ 1, + Next_Score_Half == "Field_Goal" ~ 2, + Next_Score_Half == "Opp_Field_Goal" ~ 3, + Next_Score_Half == "Safety" ~ 4, + Next_Score_Half == "Opp_Safety" ~ 5, + Next_Score_Half == "No_Score" ~ 6 + ), + label = as.factor(label), + #use nflscrapR weights + Drive_Score_Dist = Drive_Score_Half - drive, + Drive_Score_Dist_W = (max(Drive_Score_Dist) - Drive_Score_Dist) / + (max(Drive_Score_Dist) - min(Drive_Score_Dist)), + ScoreDiff_W = (max(abs(score_differential), na.rm=T) - abs(score_differential)) / + (max(abs(score_differential), na.rm=T) - min(abs(score_differential), na.rm=T)), + Total_W = Drive_Score_Dist_W + ScoreDiff_W, + Total_W_Scaled = (Total_W - min(Total_W, na.rm=T)) / + (max(Total_W, na.rm=T) - min(Total_W, na.rm=T)) + ) %>% + filter( + !is.na(defteam_timeouts_remaining), !is.na(posteam_timeouts_remaining), + !is.na(yardline_100) + ) %>% + select( + label, + season, + half_seconds_remaining, + yardline_100, + home, + retractable, + dome, + outdoors, + ydstogo, + era0, era1, era2, era3, era4, + down1, down2, down3, down4, + posteam_timeouts_remaining, + defteam_timeouts_remaining, + model_week, + Total_W_Scaled + ) + +#idk why this is all necessary for xgb but it is +model_data <- model_data %>% + mutate(label = as.numeric(label), + label = label - 1) + +rm(pbp_data) + +seasons <- unique(model_data$season) +``` + +Input the stuff we’ll need to fit the model. The parameters were +obtained from cross-validation, where each season was forced to be +entirely contained in a given CV fold to prevent leakage in labels from +one fold to another (for example, if a given drive were split up between +folds). + +``` r +nrounds = 70 +params <- + list( + booster = "gbtree", + objective = "multi:softprob", + eval_metric = c("mlogloss"), + num_class = 7, + eta = 0.2, + gamma = .2, + subsample=0.8, + colsample_bytree=0.8, + max_depth = 4, + min_child_weight = .9 + ) +``` + +Now do the LOSO model fitting. + +``` r +cv_results <- map_dfr(seasons, function(x) { + + test_data <- model_data %>% + filter(season == x) %>% + select(-season) + train_data <- model_data %>% + filter(season != x) %>% + select(-season) + + full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = train_data %>% select(-label, -Total_W_Scaled)), + label = train_data$label, weight = train_data$Total_W_Scaled) + ep_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2) + + preds <- as.data.frame( + matrix(predict(ep_model, as.matrix(test_data %>% select(-label, -Total_W_Scaled))), ncol=7, byrow=TRUE) + ) + colnames(preds) <- c("Touchdown","Opp_Touchdown","Field_Goal","Opp_Field_Goal", + "Safety","Opp_Safety","No_Score") + + cv_data <- bind_cols(test_data, preds) %>% mutate(season = x) + return(cv_data) + +}) + +#get the BINS for the calibration plot +plot <- cv_results %>% + select(Touchdown, Opp_Touchdown, Field_Goal, Opp_Field_Goal, Safety, Opp_Safety, No_Score, label) %>% + pivot_longer(-label, names_to = 'type', values_to = 'pred_prob') %>% + mutate(bin_pred_prob = round(pred_prob / 0.05) * .05) %>% + mutate(outcome = case_when( + label == 0 ~ "Touchdown", + label == 1 ~ "Opp_Touchdown", + label == 2 ~ "Field_Goal", + label == 3 ~ "Opp_Field_Goal", + label == 4 ~ "Safety", + label == 5 ~ "Opp_Safety", + label == 6 ~ "No_Score" + )) %>% + group_by(type, bin_pred_prob) %>% + mutate(correct = if_else(outcome == type, 1, 0)) %>% + summarize(n_plays = n(), + n_outcome = sum(correct), + bin_actual_prob = n_outcome / n_plays) +``` + +Here is the EP calibration plot. Points close to the diagonal dotted +line are consistent with a well-calibrated model: + +``` r +ann_text <- data.frame(x = c(.25, 0.75), y = c(0.75, 0.25), + lab = c("More times\nthan expected", "Fewer times\nthan expected"), + next_score_type = factor("No Score (0)")) +plot %>% + #about .75M plays in total + #filter(n_plays >= 50) %>% + ungroup() %>% + mutate(type = fct_relevel(type, + "Opp_Safety", "Opp_Field_Goal", + "Opp_Touchdown", "No_Score", "Safety", + "Field_Goal", "Touchdown" + ), + type = fct_recode(type, + "-Field Goal (-3)" = "Opp_Field_Goal", + "-Safety (-2)" = "Opp_Safety", + "-Touchdown (-7)" = "Opp_Touchdown", + "Field Goal (3)" = "Field_Goal", + "No Score (0)" = "No_Score", + "Touchdown (7)" = "Touchdown", + "Safety (2)" = "Safety")) %>% + ggplot() + + geom_point(aes(x = bin_pred_prob, y = bin_actual_prob, size = n_plays)) + + geom_smooth(aes(x = bin_pred_prob, y = bin_actual_prob), method = "loess") + + geom_abline(slope = 1, intercept = 0, color = "black", lty = 2) + + coord_equal() + + scale_x_continuous(limits = c(0,1)) + + scale_y_continuous(limits = c(0,1)) + + labs(size = "Number of plays", + x = "Estimated next score probability", + y = "Observed next score probability") + + geom_text(data = ann_text, aes(x = x, y = y, label = lab), size = 2) + + theme_bw() + + theme(plot.title = element_text(hjust = 0.5), + strip.background = element_blank(), + strip.text = element_text(size = 12), + axis.title = element_text(size = 12), + axis.text.y = element_text(size = 12), + axis.text.x = element_text(size = 10, angle = 90), + legend.title = element_text(size = 12), + legend.text = element_text(size = 12), + legend.position = c(1, .05), legend.justification = c(1, 0)) + + facet_wrap(~ type, ncol = 4) +``` + + + +There is some weirdness with the opponent safety predictions, but these +dots represent an *extremely* small number of plays (10-50 plays out of +about 750,000). + +Now let’s get the calibration error using the measure developed in Yurko +et al., and compare it to nflscrapR. First we need to get the nflscrapR +predictions, which we have saved from the previous version of nflfastR +which applied the nflscrapR models. + +``` r +#calibration error +cv_cal_error <- plot %>% + ungroup() %>% + mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>% + group_by(type) %>% + summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE), + n_scoring_event = sum(n_outcome, na.rm = TRUE)) + +pbp_data <- readRDS('cal_data_nflscrapr.rds') +#nflscrapr calibration error +nflscrapr <- pbp_data %>% + select(td_prob, opp_td_prob, fg_prob, opp_fg_prob, safety_prob, opp_safety_prob, no_score_prob, Next_Score_Half) %>% + pivot_longer(-Next_Score_Half, names_to = 'type', values_to = 'pred_prob') %>% + mutate(bin_pred_prob = round(pred_prob / 0.05) * .05) %>% + mutate(outcome = Next_Score_Half, + type = case_when( + type == "td_prob" ~ 'Touchdown', + type == 'fg_prob' ~ "Field_Goal", + type == "opp_td_prob" ~ "Opp_Touchdown", + type == 'opp_fg_prob' ~ "Opp_Field_Goal", + type == 'safety_prob' ~ "Safety", + type == 'opp_safety_prob' ~ "Opp_Safety", + type == "no_score_prob" ~ "No_Score" + )) %>% + group_by(type, bin_pred_prob) %>% + mutate(correct = if_else(outcome == type, 1, 0)) %>% + summarize(n_plays = n(), + n_outcome = sum(correct), + bin_actual_prob = n_outcome / n_plays) %>% + ungroup() %>% + mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>% + group_by(type) %>% + summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE), + n_scoring_event = sum(n_outcome, na.rm = TRUE)) +rm(pbp_data) + +message(glue::glue( +' +--CALIBRATION ERROR-- + +nflfastR: +{round(with(cv_cal_error, weighted.mean(weight_cal_error, n_scoring_event)), 4)} + +nflscrapR: +{round(with(nflscrapr, weighted.mean(weight_cal_error, n_scoring_event)), 4)} +' +)) +#> --CALIBRATION ERROR-- +#> +#> nflfastR: +#> 0.0063 +#> +#> nflscrapR: +#> 0.017 +``` + +We see that the new EP model is better calibrated. Note that nflscrapR +reports a calibration error of 0.01309723. The number is higher here +because of the additional seasons included outside of the time period +nflscrapR was trained on, and the lack of era adjustment in nflscrapR. + +## WP Model Calibration Results + +As with EP, do some initial setup to get the data ready for fitting. + +``` r +model_data <- readRDS('cal_data.rds') +model_data <- model_data %>% + make_model_mutations() %>% + prepare_wp_data() %>% + mutate(label = ifelse(posteam == Winner, 1, 0)) %>% + filter(!is.na(ep) & !is.na(score_differential) & !is.na(play_type) & !is.na(label)) %>% + select( + label, + receive_2h_ko, + spread_time, + half_seconds_remaining, + game_seconds_remaining, + ExpScoreDiff_Time_Ratio, + score_differential, + ep, + down, + ydstogo, + home, + posteam_timeouts_remaining, + defteam_timeouts_remaining, + season, + #only needed for the plots here, not used in model + qtr + ) %>% + filter(qtr <= 4) + +nrounds = 65 +params <- + list( + booster = "gbtree", + objective = "binary:logistic", + eval_metric = c("logloss"), + eta = 0.2, + gamma = 0, + subsample=0.8, + colsample_bytree=0.8, + max_depth = 4, + min_child_weight = 1 + ) +``` + +Do the LOSO fitting: + +``` r +cv_results <- map_dfr(seasons, function(x) { + + test_data <- model_data %>% + filter(season == x) %>% + select(-season) + train_data <- model_data %>% + filter(season != x) %>% + select(-season) + + full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = train_data %>% select(-label, -qtr, -spread_time)), + label = train_data$label) + wp_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2) + + preds <- as.data.frame( + matrix(predict(wp_model, as.matrix(test_data %>% select(-label, -qtr, -spread_time)))) + ) %>% + dplyr::rename(wp = V1) + + cv_data <- bind_cols(test_data, preds) %>% mutate(season = x) + return(cv_data) + +}) + +#TIME FOR BINNING +wp_cv_loso_calibration_results <- cv_results %>% + # Create BINS for wp: + mutate(bin_pred_prob = round(wp / 0.05) * .05) %>% + # Group by both the qtr and bin_pred_prob: + group_by(qtr, bin_pred_prob) %>% + # Calculate the calibration results: + summarize(n_plays = n(), + n_wins = length(which(label == 1)), + bin_actual_prob = n_wins / n_plays) +``` + +The WP plot. Looks good\! + +``` r +# Create a label data frame for the chart: +ann_text <- data.frame(x = c(.25, 0.75), y = c(0.75, 0.25), + lab = c("More times\nthan expected", "Fewer times\nthan expected"), + qtr = factor("1st Quarter")) + +# Create the calibration chart: +wp_cv_loso_calibration_results %>% + ungroup() %>% + mutate(qtr = fct_recode(factor(qtr), "1st Quarter" = "1", "2nd Quarter" = "2", + "3rd Quarter" = "3", "4th Quarter" = "4")) %>% + ggplot() + + geom_point(aes(x = bin_pred_prob, y = bin_actual_prob, size = n_plays)) + + geom_smooth(aes(x = bin_pred_prob, y = bin_actual_prob), method = "loess") + + geom_abline(slope = 1, intercept = 0, color = "black", lty = 2) + + coord_equal() + + scale_x_continuous(limits = c(0,1)) + + scale_y_continuous(limits = c(0,1)) + + labs(size = "Number of plays", + x = "Estimated win probability", + y = "Observed win probability") + + geom_text(data = ann_text, aes(x = x, y = y, label = lab), size = 2) + + theme_bw() + + theme(plot.title = element_text(hjust = 0.5), + strip.background = element_blank(), + strip.text = element_text(size = 12), + axis.title = element_text(size = 12), + axis.text.y = element_text(size = 12), + axis.text.x = element_text(size = 10, angle = 90), + legend.title = element_text(size = 12), + legend.text = element_text(size = 12), + legend.position = "bottom") + + facet_wrap(~ qtr, ncol = 4) +``` + + + +And get the WP calibration error: + +``` r +# Calculate the calibration error values: +wp_cv_cal_error <- wp_cv_loso_calibration_results %>% + ungroup() %>% + mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>% + group_by(qtr) %>% + summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE), + n_wins = sum(n_wins, na.rm = TRUE)) + +#get nflscrapR to compare +pbp_data <- readRDS('cal_data_nflscrapr.rds') %>% + mutate(label = ifelse(posteam == Winner, 1, 0)) %>% + filter(qtr <= 4, !is.na(label), !is.na(posteam), !is.na(wp)) + +nflscrapR <- pbp_data %>% + # Create binned probability column: + mutate(bin_pred_prob = round(wp / 0.05) * .05) %>% + # Group by both the qtr and bin_pred_prob: + group_by(qtr, bin_pred_prob) %>% + # Calculate the calibration results: + summarize(n_plays = n(), + n_wins = length(which(label == 1)), + bin_actual_prob = n_wins / n_plays) %>% + ungroup() %>% + mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>% + group_by(qtr) %>% + summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE), + n_wins = sum(n_wins, na.rm = TRUE)) + +message(glue::glue( + '--CALIBRATION ERROR-- + +nflfastR: +{round(with(wp_cv_cal_error, weighted.mean(weight_cal_error, n_wins)), 4)} + +nflscrapR: +{round(with(nflscrapR, weighted.mean(weight_cal_error, n_wins)), 4)}' +)) +#> --CALIBRATION ERROR-- +#> +#> nflfastR: +#> 0.0059 +#> +#> nflscrapR: +#> 0.0397 +``` + +Again, the new WP model represents an improvement. + +## WP Model Calibration Results: with point spread + +`nflfastR` has a secondary win probability model that also incorporates +the pregame spread to more accurately reflect a team’s chances of +winning. Below are calibration results for this model. + +``` r +nrounds = 170 +params <- + list( + booster = "gbtree", + objective = "binary:logistic", + eval_metric = c("logloss"), + eta = 0.075, + gamma = 3, + subsample=0.8, + colsample_bytree=0.8, + max_depth = 5, + min_child_weight = .9 + ) +``` + +Do the LOSO fitting: + +``` r +cv_results <- map_dfr(seasons, function(x) { + + test_data <- model_data %>% + filter(season == x) %>% + select(-season) + train_data <- model_data %>% + filter(season != x) %>% + select(-season) + + full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = train_data %>% select(-label, -qtr)), + label = train_data$label) + wp_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2) + + preds <- as.data.frame( + matrix(predict(wp_model, as.matrix(test_data %>% select(-label, -qtr)))) + ) %>% + dplyr::rename(wp = V1) + + cv_data <- bind_cols(test_data, preds) %>% mutate(season = x) + return(cv_data) + +}) + +#TIME FOR BINNING +wp_cv_loso_calibration_results <- cv_results %>% + # Create BINS for wp: + mutate(bin_pred_prob = round(wp / 0.05) * .05) %>% + # Group by both the qtr and bin_pred_prob: + group_by(qtr, bin_pred_prob) %>% + # Calculate the calibration results: + summarize(n_plays = n(), + n_wins = length(which(label == 1)), + bin_actual_prob = n_wins / n_plays) +``` + +The WP plot. + +``` r +# Create a label data frame for the chart: +ann_text <- data.frame(x = c(.25, 0.75), y = c(0.75, 0.25), + lab = c("More times\nthan expected", "Fewer times\nthan expected"), + qtr = factor("1st Quarter")) + +# Create the calibration chart: +wp_cv_loso_calibration_results %>% + ungroup() %>% + mutate(qtr = fct_recode(factor(qtr), "1st Quarter" = "1", "2nd Quarter" = "2", + "3rd Quarter" = "3", "4th Quarter" = "4")) %>% + ggplot() + + geom_point(aes(x = bin_pred_prob, y = bin_actual_prob, size = n_plays)) + + geom_smooth(aes(x = bin_pred_prob, y = bin_actual_prob), method = "loess") + + geom_abline(slope = 1, intercept = 0, color = "black", lty = 2) + + coord_equal() + + scale_x_continuous(limits = c(0,1)) + + scale_y_continuous(limits = c(0,1)) + + labs(size = "Number of plays", + x = "Estimated win probability", + y = "Observed win probability") + + geom_text(data = ann_text, aes(x = x, y = y, label = lab), size = 2) + + theme_bw() + + theme(plot.title = element_text(hjust = 0.5), + strip.background = element_blank(), + strip.text = element_text(size = 12), + axis.title = element_text(size = 12), + axis.text.y = element_text(size = 12), + axis.text.x = element_text(size = 10, angle = 90), + legend.title = element_text(size = 12), + legend.text = element_text(size = 12), + legend.position = "bottom") + + facet_wrap(~ qtr, ncol = 4) +``` + + + +And get the WP calibration error: + +``` r +# Calculate the calibration error values: +wp_cv_cal_error <- wp_cv_loso_calibration_results %>% + ungroup() %>% + mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>% + group_by(qtr) %>% + summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE), + n_wins = sum(n_wins, na.rm = TRUE)) +message(glue::glue( + '--CALIBRATION ERROR-- + +nflfastR with Vegas line: +{round(with(wp_cv_cal_error, weighted.mean(weight_cal_error, n_wins)), 4)} + +nflscrapR: +{round(with(nflscrapR, weighted.mean(weight_cal_error, n_wins)), 4)}' +)) +#> --CALIBRATION ERROR-- +#> +#> nflfastR with Vegas line: +#> 0.0059 +#> +#> nflscrapR: +#> 0.0397 +``` + +Again, the new WP model is better calibrated than nflscrapR. In our testing, +incorporating the spread substantially improved the performance of the +model as measured by cross-validation classification accuracy (reduced +error rate from 27% to 23%) and log loss (reduced from .52 to .45). We +include a time-decaying function of spread on its own as including +spread on its own increases the LOSO calibration error, especially in +the fourth quarter. We also tried removing the `home` indicator in the +spread model, but this worsened the calibration results. + +## CP Model Calibration Results + +By now, the process should be familiar. + +``` r +pbp <- readRDS('cal_data.rds') + +model_data <- pbp %>% + filter(season >= 2006) %>% + make_model_mutations() %>% + dplyr::mutate(receiver_player_name = + stringr::str_extract(desc, "(?<=((to)|(for))\\s[:digit:]{0,2}\\-{0,1})[A-Z][A-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"), + pass_middle = dplyr::if_else(pass_location == 'middle', 1, 0), + air_is_zero= dplyr::if_else(air_yards == 0,1,0), + distance_to_sticks = air_yards - ydstogo + ) %>% + dplyr::filter(complete_pass == 1 | incomplete_pass == 1 | interception == 1) %>% + dplyr::filter(!is.na(air_yards) & air_yards >= -15 & air_yards <70 & !is.na(receiver_player_name) & !is.na(pass_location)) %>% + dplyr::select( + season, complete_pass, air_yards, yardline_100, ydstogo, + down1, down2, down3, down4, air_is_zero, pass_middle, + era2, era3, era4, qb_hit, home, model_week, + outdoors, retractable, dome, distance_to_sticks + ) +rm(pbp) + + +nrounds = 70 +params <- + list( + booster = "gbtree", + objective = "binary:logistic", + eval_metric = c("logloss"), + eta = 0.2, + gamma = 5, + subsample=0.8, + colsample_bytree=0.8, + max_depth = 4, + min_child_weight = 6, + base_score = mean(model_data$complete_pass) + ) + +cv_results <- map_dfr(2006:2019, function(x) { + + test_data <- model_data %>% + filter(season == x) %>% + select(-season) + train_data <- model_data %>% + filter(season != x) %>% + select(-season) + + full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = train_data %>% select(-complete_pass)), + label = train_data$complete_pass) + cp_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2) + + preds <- as.data.frame( + matrix(predict(cp_model, as.matrix(test_data %>% select(-complete_pass)))) + ) %>% + dplyr::rename(cp = V1) + + cv_data <- bind_cols(test_data, preds) %>% mutate(season = x) + return(cv_data) + +}) + +#TIME FOR BINNING +cp_cv_loso_calibration_results <- cv_results %>% + # Create BINS for wp: + mutate( + bin_pred_prob = round(cp / 0.05) * .05, + distance = case_when( + air_yards < 5 ~ "Short", + air_yards >= 5 & air_yards < 15 ~ "Intermediate", + air_yards >= 15 ~ "Deep" + ) + ) %>% + # Group by both the qtr and bin_pred_prob: + group_by(distance, bin_pred_prob) %>% + # Calculate the calibration results: + summarize(n_plays = n(), + n_complete = length(which(complete_pass == 1)), + bin_actual_prob = n_complete / n_plays) + +ann_text <- data.frame(x = c(.25, 0.75), y = c(0.75, 0.25), + lab = c("More times\nthan expected", "Fewer times\nthan expected") + ) +``` + +Plot the results: + +``` r +cp_cv_loso_calibration_results %>% + ungroup() %>% + mutate(distance = fct_relevel(distance, + "Short", "Intermediate", "Deep") + ) %>% + filter(n_plays > 10) %>% + ggplot() + + geom_point(aes(x = bin_pred_prob, y = bin_actual_prob, size = n_plays)) + + geom_smooth(aes(x = bin_pred_prob, y = bin_actual_prob), method = "loess") + + geom_abline(slope = 1, intercept = 0, color = "black", lty = 2) + + coord_equal() + + scale_x_continuous(limits = c(0,1)) + + scale_y_continuous(limits = c(0,1)) + + labs(size = "Number of plays", + x = "Estimated completion percentage", + y = "Observed completion percentage") + + geom_text(data = ann_text, aes(x = x, y = y, label = lab), size = 3) + + theme_bw() + + theme(plot.title = element_text(hjust = 0.5), + strip.background = element_blank(), + strip.text = element_text(size = 12), + axis.title = element_text(size = 12), + axis.text.y = element_text(size = 12), + axis.text.x = element_text(size = 10, angle = 90), + legend.title = element_text(size = 12), + legend.text = element_text(size = 12), + legend.position = "bottom") + + facet_wrap(~ distance, ncol = 3) +``` + + + +And get the calibration error: + +``` r +cp_cv_cal_error <- cp_cv_loso_calibration_results %>% + ungroup() %>% + mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>% + group_by(distance) %>% + summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE), + n_complete = sum(n_complete, na.rm = TRUE)) + +round(with(cp_cv_cal_error, weighted.mean(weight_cal_error, n_complete)), 4) +#> [1] 0.0059 +``` diff --git a/data-raw/MODELS.R b/data-raw/MODELS.R new file mode 100644 index 00000000..7f687b2d --- /dev/null +++ b/data-raw/MODELS.R @@ -0,0 +1,220 @@ +################################################################################ +# Author: Ben Baldwin +# Purpose: Estimate nflfastR models for EP, CP, Field Goals, and WP +################################################################################ + +library(tidyverse) +library(xgboost) +source('R/helper_add_ep_wp.R') +source('R/helper_add_cp_cpoe.R') +source('R/helper_add_nflscrapr_mutations.R') + +set.seed(2013) #GoHawks + +################################################################################ +# Estimate EP model +################################################################################ + +pbp_data <- readRDS('data-raw/cal_data.rds') + +#function in helper_add_ep_wp.R +model_data <- pbp_data %>% + make_model_mutations() %>% + mutate( + label = case_when( + Next_Score_Half == "Touchdown" ~ 0, + Next_Score_Half == "Opp_Touchdown" ~ 1, + Next_Score_Half == "Field_Goal" ~ 2, + Next_Score_Half == "Opp_Field_Goal" ~ 3, + Next_Score_Half == "Safety" ~ 4, + Next_Score_Half == "Opp_Safety" ~ 5, + Next_Score_Half == "No_Score" ~ 6 + ), + label = as.factor(label), + # Calculate the drive difference between the next score drive and the + # current play drive: + Drive_Score_Dist = Drive_Score_Half - drive, + # Create a weight column based on difference in drives between play and next score: + Drive_Score_Dist_W = (max(Drive_Score_Dist) - Drive_Score_Dist) / + (max(Drive_Score_Dist) - min(Drive_Score_Dist)), + # Create a weight column based on score differential: + ScoreDiff_W = (max(abs(score_differential), na.rm=T) - abs(score_differential)) / + (max(abs(score_differential), na.rm=T) - min(abs(score_differential), na.rm=T)), + # Add these weights together and scale again: + Total_W = Drive_Score_Dist_W + ScoreDiff_W, + Total_W_Scaled = (Total_W - min(Total_W, na.rm=T)) / + (max(Total_W, na.rm=T) - min(Total_W, na.rm=T)) + ) %>% + filter( + !is.na(defteam_timeouts_remaining), !is.na(posteam_timeouts_remaining), + !is.na(yardline_100) + ) %>% + select( + label, + half_seconds_remaining, + yardline_100, + home, + retractable, + dome, + outdoors, + ydstogo, + era0, era1, era2, era3, era4, + down1, down2, down3, down4, + posteam_timeouts_remaining, + defteam_timeouts_remaining, + model_week, + Total_W_Scaled + ) + +nrounds = 70 +params <- + list( + booster = "gbtree", + objective = "multi:softprob", + eval_metric = c("mlogloss"), + num_class = 7, + eta = 0.2, + gamma = .2, + subsample=0.8, + colsample_bytree=0.8, + max_depth = 4, + min_child_weight = .9 + ) + +model_data <- model_data %>% + mutate(label = as.numeric(label), + label = label - 1) + +full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = model_data %>% select(-label, -Total_W_Scaled)), + label = model_data$label, weight = model_data$Total_W_Scaled) +ep_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2) + +################################################################################ +# Estimate FG model +################################################################################ + +fg_model_data <- pbp_data %>% + filter(play_type %in% c("field_goal","extra_point","run") & + (!is.na(extra_point_result) | !is.na(field_goal_result))) %>% + make_model_mutations() + +#estimate model +fg_model <- mgcv::bam(sp ~ s(yardline_100, by = interaction(era, model_roof)) + model_roof + era, + data = fg_model_data, family = "binomial") + +################################################################################ +# Estimate CP model +################################################################################ + +model_vars <- pbp_data %>% + filter(season >= 2006) %>% + make_model_mutations() %>% + prepare_cp_data() %>% + filter(valid_pass == 1) %>% + select(-valid_pass) + +nrounds = 70 +params <- + list( + booster = "gbtree", + objective = "binary:logistic", + eval_metric = c("logloss"), + eta = 0.2, + gamma = 5, + subsample=0.8, + colsample_bytree=0.8, + max_depth = 4, + min_child_weight = 6, + base_score = mean(model_vars$complete_pass) + ) + +full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = model_vars %>% dplyr::select(-complete_pass)), + label = model_vars$complete_pass) +cp_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2) + + +################################################################################ +# Estimate WP model: spread +################################################################################ + +model_data <- pbp_data %>% + make_model_mutations() %>% + prepare_wp_data() %>% + mutate(label = ifelse(posteam == Winner, 1, 0)) %>% + filter(qtr <= 4 & !is.na(ep) & !is.na(score_differential) & !is.na(play_type) & !is.na(label)) %>% + select( + label, + receive_2h_ko, + spread_time, + half_seconds_remaining, + game_seconds_remaining, + ExpScoreDiff_Time_Ratio, + ep, + score_differential, + down, + ydstogo, + home, + posteam_timeouts_remaining, + defteam_timeouts_remaining + ) + + +nrounds = 170 +params <- + list( + booster = "gbtree", + objective = "binary:logistic", + eval_metric = c("logloss"), + eta = 0.075, + gamma = 3, + subsample=0.8, + colsample_bytree=0.8, + max_depth = 5, + min_child_weight = .9 + ) + + +full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = model_data %>% select(-label)), + label = model_data$label) +wp_model_spread <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2) + +#importance <- xgboost::xgb.importance(feature_names = colnames(wp_model_spread), model = wp_model_spread) +#xgboost::xgb.ggplot.importance(importance_matrix = importance) + +#xgboost::xgb.plot.tree(model = wp_model_spread, trees = 1, show_node_id = TRUE) + + +################################################################################ +# Estimate WP model: no spread +################################################################################ + +model_data <- model_data %>% + select( + -spread_time + ) + +nrounds = 65 +params <- + list( + booster = "gbtree", + objective = "binary:logistic", + eval_metric = c("error", "logloss"), + eta = 0.2, + gamma = 0, + subsample=0.8, + colsample_bytree=0.8, + max_depth = 4, + min_child_weight = 1 + ) + + +full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = model_data %>% select(-label)), + label = model_data$label) +wp_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2) + + +# save models to use in package +usethis::use_data(ep_model, wp_model, wp_model_spread, fg_model, cp_model, internal = TRUE, overwrite = TRUE) + + + diff --git a/data-raw/MODELS_DATA.R b/data-raw/MODELS_DATA.R new file mode 100644 index 00000000..710b97c2 --- /dev/null +++ b/data-raw/MODELS_DATA.R @@ -0,0 +1,154 @@ +################################################################################ +# Author: Ben Baldwin +# Purpose: Prepare data for nflfastR models for EP, CP, Field Goals, and WP +# This takes a long time (especially finding next score half) +# Save a pre-prepared df +################################################################################ + +library(tidyverse) +source('data-raw/EP_functions.R') + +################################################################################ +# DATA PREP +################################################################################ + +#read in data from data repo +pbp_data <- purrr::map_df(1999 : 2019, function(x) { + readRDS( + url( + glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{x}.rds") + ) + ) %>% filter(season_type == 'REG') +}) %>% + mutate( + Winner = if_else(home_score > away_score, home_team, + if_else(home_score < away_score, away_team, "TIE")) + ) + +#get next score half using the provided function +pbp_next_score_half <- map_dfr(unique(pbp_data$game_id), + function(x) { + pbp_data %>% + filter(game_id == x) %>% + find_game_next_score_half() + }) + +#bind to original df +pbp_data <- bind_cols(pbp_data, pbp_next_score_half) + +#for estimating the models, apply some filters +pbp_data <- pbp_data %>% + filter(Next_Score_Half %in% c("Opp_Field_Goal", "Opp_Safety", "Opp_Touchdown", + "Field_Goal", "No_Score", "Safety", "Touchdown") & + play_type %in% c("field_goal", "no_play", "pass", "punt", "run", + "qb_spike") & is.na(two_point_conv_result) & is.na(extra_point_result) & + !is.na(down) & !is.na(game_seconds_remaining)) %>% + #to keep file size manageable + select( + game_id, + Next_Score_Half, + Drive_Score_Half, + play_type, + game_seconds_remaining, + half_seconds_remaining, + yardline_100, + roof, + posteam, + defteam, + home_team, + ydstogo, + season, + qtr, + down, + week, + drive, + ep, + score_differential, + posteam_timeouts_remaining, + defteam_timeouts_remaining, + desc, + receiver_player_name, + pass_location, + air_yards, + complete_pass, incomplete_pass, interception, + qb_hit, + extra_point_result, + field_goal_result, + sp, + Winner, + spread_line + ) + +#for doing calibation etc +saveRDS(pbp_data, 'data-raw/cal_data.rds') + +#fix roof types +#now fixing spread_line +#delete this after re-scraping and updating +games_data <- readRDS(url("https://github.com/leesharpe/nfldata/blob/master/data/games.rds?raw=true")) %>% + select(home_team, season, week, spread_line) %>% + mutate( + home_team = case_when( + home_team == 'STL' ~ 'LA', + home_team == 'SD' ~ 'LAC', + home_team == 'OAK' ~ 'LV', + TRUE ~ home_team + ) + ) + +pbp_data <- readRDS('data-raw/cal_data.rds') +pbp_data <- pbp_data %>% + left_join(games_data, by = c('home_team', 'season', 'week')) +saveRDS(pbp_data, 'data-raw/cal_data.rds') + + +################################################################################ +# DATA PREP FOR NFLSCRAPR COMPARISON +# This is only used in the readme describing the models +# Where we compare nflscrapR and nflfastR calibration errors +################################################################################ + +pbp_data <- purrr::map_df(2000 : 2019, function(x) { + readRDS( + url( + glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/legacy-data/play_by_play_{x}.rds") + ) + ) %>% filter(season_type == 'REG') +}) + +games <- readRDS(url("http://www.habitatring.com/games.rds")) %>% + filter(!is.na(result)) %>% + mutate( + game_id = as.numeric(old_game_id), + Winner = if_else(home_score > away_score, home_team, + if_else(home_score < away_score, away_team, "TIE")) + ) %>% + select(game_id, Winner, result, roof) + +pbp_data <- pbp_data %>% + left_join( + games, by = c('game_id') + ) + +#get next score half using the provided function +pbp_next_score_half <- map_dfr(unique(pbp_data$game_id), + function(x) { + pbp_data %>% + filter(game_id == x) %>% + find_game_next_score_half() + }) + +#bind to original df +pbp_data <- bind_cols(pbp_data, pbp_next_score_half) + +#apply filters +pbp_data <- pbp_data %>% + filter(Next_Score_Half %in% c("Opp_Field_Goal", "Opp_Safety", "Opp_Touchdown", + "Field_Goal", "No_Score", "Safety", "Touchdown") & + play_type %in% c("field_goal", "no_play", "pass", "punt", "run", + "qb_spike") & is.na(two_point_conv_result) & is.na(extra_point_result) & + !is.na(down) & !is.na(game_seconds_remaining)) %>% + select(posteam, wp, qtr, Winner, td_prob, opp_td_prob, fg_prob, opp_fg_prob, safety_prob, opp_safety_prob, no_score_prob, Next_Score_Half) + +#for doing calibation etc +saveRDS(pbp_data, 'data-raw/cal_data_nflscrapr.rds') diff --git a/data-raw/man/figures/README-plot-1.png b/data-raw/man/figures/README-plot-1.png new file mode 100644 index 00000000..efcee8bc Binary files /dev/null and b/data-raw/man/figures/README-plot-1.png differ diff --git a/data-raw/man/figures/README-plot_cp-1.png b/data-raw/man/figures/README-plot_cp-1.png new file mode 100644 index 00000000..1ccba2d4 Binary files /dev/null and b/data-raw/man/figures/README-plot_cp-1.png differ diff --git a/data-raw/man/figures/README-plot_wp-1.png b/data-raw/man/figures/README-plot_wp-1.png new file mode 100644 index 00000000..943ebe74 Binary files /dev/null and b/data-raw/man/figures/README-plot_wp-1.png differ diff --git a/data-raw/man/figures/README-plot_wp_spread-1.png b/data-raw/man/figures/README-plot_wp_spread-1.png new file mode 100644 index 00000000..ebdf9402 Binary files /dev/null and b/data-raw/man/figures/README-plot_wp_spread-1.png differ diff --git a/data-raw/package_building_workflow.R b/data-raw/package_building_workflow.R index 05f67737..c80e3885 100644 --- a/data-raw/package_building_workflow.R +++ b/data-raw/package_building_workflow.R @@ -13,13 +13,12 @@ usethis::use_package("jsonlite", type = "Imports", min_version = NULL) usethis::use_package("lubridate", type = "Imports", min_version = NULL) usethis::use_package("magrittr", type = "Imports", min_version = NULL) usethis::use_package("mgcv", type = "Imports", min_version = NULL) -usethis::use_package("nflscrapR", type = "Imports", min_version = NULL) usethis::use_package("purrr", type = "Imports", min_version = NULL) +usethis::use_package("progressr", type = "Imports", min_version = NULL) usethis::use_package("stringr", type = "Imports", min_version = NULL) usethis::use_package("tibble", type = "Imports", min_version = NULL) usethis::use_package("tidyr", type = "Imports", min_version = NULL) usethis::use_package("tidyselect", type = "Imports", min_version = NULL) -usethis::use_package("xml2", type = "Imports", min_version = NULL) usethis::use_tidy_description() # add license diff --git a/data-raw/variable_explanation.xlsx b/data-raw/variable_explanation.xlsx index c3d784fe..075a3316 100644 Binary files a/data-raw/variable_explanation.xlsx and b/data-raw/variable_explanation.xlsx differ diff --git a/man/fix_fumbles.Rd b/man/add_qb_epa.Rd similarity index 73% rename from man/fix_fumbles.Rd rename to man/add_qb_epa.Rd index 1a762dcd..b3588958 100644 --- a/man/fix_fumbles.Rd +++ b/man/add_qb_epa.Rd @@ -1,13 +1,13 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/helper_additional_functions.R -\name{fix_fumbles} -\alias{fix_fumbles} +\name{add_qb_epa} +\alias{add_qb_epa} \title{Compute QB epa} \usage{ -fix_fumbles(d) +add_qb_epa(d) } \arguments{ -\item{d}{is a dataframe of play-by-play data scraped using \code{\link{fast_scraper}}.} +\item{d}{is a Data frame of play-by-play data scraped using \code{\link{fast_scraper}}.} } \description{ Compute QB epa diff --git a/man/calculate_expected_points.Rd b/man/calculate_expected_points.Rd new file mode 100644 index 00000000..84e875be --- /dev/null +++ b/man/calculate_expected_points.Rd @@ -0,0 +1,38 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/ep_calculator.R +\name{calculate_expected_points} +\alias{calculate_expected_points} +\title{Compute expected points} +\usage{ +calculate_expected_points(pbp_data) +} +\arguments{ +\item{pbp_data}{Play-by-play dataset to estimate expected points for.} +} +\value{ +The original pbp_data with the following columns appended to it: +\itemize{ +\item{ep} - expected points. +\item{no_score_prob} - probability of no more scoring this half. +\item{opp_fg_prob} - probability next score opponent field goal this half. +\item{opp_safety_prob} - probability next score opponent safety this half. +\item{opp_td_prob} - probability of next score opponent touchdown this half. +\item{fg_prob} - probability next score field goal this half. +\item{safety_prob} - probability next score safety this half. +\item{td_prob} - probability text score touchdown this half. +} +} +\description{ +for provided plays. Returns the data with +probabilities of each scoring event and EP added. The following columns +must be present: season, week, home_team, posteam, roof (coded as 'open', +'closed', or 'retractable'), half_seconds_remaining, yardline_100, +ydstogo, posteam_timeouts_remaining, defteam_timeouts_remaining +} +\details{ +Computes expected points for provided plays. Returns the data with +probabilities of each scoring event and EP added. The following columns +must be present: season, week, home_team, posteam, roof (coded as 'open', +'closed', or 'retractable'), half_seconds_remaining, yardline_100, +ydstogo, posteam_timeouts_remaining, defteam_timeouts_remaining +} diff --git a/man/clean_pbp.Rd b/man/clean_pbp.Rd index 653376d0..c28a65ce 100644 --- a/man/clean_pbp.Rd +++ b/man/clean_pbp.Rd @@ -7,19 +7,37 @@ clean_pbp(pbp) } \arguments{ -\item{pbp}{is a dataframe of play-by-play data scraped using \code{\link{fast_scraper}}.} +\item{pbp}{is a Data frame of play-by-play data scraped using \code{\link{fast_scraper}}.} +} +\value{ +The input Data Frame of the paramter 'pbp' with the following columns +added: +\itemize{ +\item{success} - Binary indicator wheter epa > 0 in the given play. +\item{passer} - Name of the dropback player (scrambles included) including plays with penalties. +\item{rusher} - Name of the rusher (no scrambles) including plays with penalties. +\item{receiver} - Name of the receiver including plays with penalties. +\item{pass} - Binary indicator if the play was a pass play (sacks and scrambles included). +\item{rush} - Binary indicator if the play was a rushing play. +\item{special} - Binary indicator if the play was a special teams play. +\item{first_down} - Binary indicator if the play ended in a first down. +\item{play} - Binary indicator: 1 if the play was a 'normal' play (including penalties), 0 otherwise. +\item{passer_id} - ID of the player in the 'passer' column (NOTE: ids vary pre and post 2011) +\item{rusher_id} - ID of the player in the 'rusher' column (NOTE: ids vary pre and post 2011) +\item{receiver_id} - ID of the player in the 'receiver' column (NOTE: ids vary pre and post 2011) +\item{name} - Name of the 'passer' if it is not 'NA', or name of the 'rusher' otherwise. +\item{id} - ID of the player in the 'name' column. +} } \description{ Clean Play by Play Data } \details{ Build columns that capture what happens on all plays, including -penalties, using string extraction from play description. The created 'name' -column denotes the dropback player on dropbacks or the rusher on rush attempts. +penalties, using string extraction from play description. Loosely based on Ben's nflscrapR guide (https://gist.github.com/guga31bb/5634562c5a2a7b1e9961ac9b6c568701) but updated to work with the RS data, which has a different player format in the play description; e.g. 24-M.Lynch instead of M.Lynch. The function also standardizes team abbreviations so that, for example, the Chargers are always represented by 'LAC' regardless of which year it was. -Also creates a 'play' column denoting 'normal' plays (Ie pass play or run play) } diff --git a/man/fast_scraper.Rd b/man/fast_scraper.Rd index 07617d74..bab43a76 100644 --- a/man/fast_scraper.Rd +++ b/man/fast_scraper.Rd @@ -4,24 +4,26 @@ \alias{fast_scraper} \title{Get NFL Play by Play Data} \usage{ -fast_scraper(game_ids, source = "rs", pp = FALSE) +fast_scraper(game_ids, source = "nfl", pp = FALSE) } \arguments{ -\item{game_ids}{Vector of numeric or character ids (see details for further information)} +\item{game_ids}{Vector of character ids (see details for further information)} -\item{source}{Character - either "rs" or "gc" (see details for further information)} +\item{source}{Character - must now be \code{nfl} or unspecified (see details for further information)} \item{pp}{Logical - either \code{TRUE} or \code{FALSE} (see details for further information)} } \value{ Data frame where each individual row represents a single play for -all passed game_ids scraped from the choosen source containing the following -detailed information (description mostly extracted from nflscrapR): +all passed game_ids containing the following +detailed information (description partly extracted from nflscrapR): \itemize{ \item{play_id} - Numeric play id that when used with game_id and drive provides the unique identifier for a single play. \item{game_id} - Ten digit identifier for NFL game. \item{home_team} - String abbreviation for the home team. \item{away_team} - String abbreviation for the away team. +\item{season_type} - 'REG' or 'POST' indicating if the game belongs to regular or post season. +\item{week} - Season week. \item{posteam} - String abbreviation for the team with possession. \item{posteam_type} - String indicating whether the posteam team is home or away. \item{defteam} - String abbreviation for the team on defense. @@ -112,6 +114,8 @@ detailed information (description mostly extracted from nflscrapR): \item{wpa} - Win probability added (WPA) for the posteam. \item{home_wp_post} - Estimated win probability for the home team at the start of the play. \item{away_wp_post} - Estimated win probability for the away team at the start of the play. +\item{vegas_wp} - Estimated win probabiity for the posteam given the current situation at the start of the given play, incorporating pre-game Vegas line. +\item{vegas_home_wp} - Estimated win probability for the home team incorporating pre-game Vegas line. \item{total_home_rush_wpa} - Cumulative total rushing WPA for the home team in the game so far. \item{total_away_rush_wpa} - Cumulative total rushing WPA for the away team in the game so far. \item{total_home_pass_wpa} - Cumulative total passing WPA for the home team in the game so far. @@ -137,8 +141,8 @@ detailed information (description mostly extracted from nflscrapR): \item{fourth_down_converted} - Binary indicator for if the first down was converted on fourth down. \item{fourth_down_failed} - Binary indicator for if the posteam failed to convert first down on fourth down. \item{incomplete_pass} - Binary indicator for if the pass was incomplete. -\item{interception} - Binary indicator for if the pass was intercepted. \item{touchback} - Binary indicator for if a touchback occurred on the play. +\item{interception} - Binary indicator for if the pass was intercepted. \item{punt_inside_twenty} - Binary indicator for if the punt ended inside the twenty yard line. \item{punt_in_endzone} - Binary indicator for if the punt was in the endzone. \item{punt_out_of_bounds} - Binary indicator for if the punt went out of bounds. @@ -275,42 +279,52 @@ detailed information (description mostly extracted from nflscrapR): \item{defensive_extra_point_attempt} - Binary indicator whether or not the defense was able to have an attempt on an extra point attempt, this results following a blocked attempt that the defense recovers the ball. \item{defensive_extra_point_conv} - Binary indicator whether or not the defense successfully scored on an extra point attempt. \item{season} - 4 digit number indicating to which season the game belongs to. -\item{cp} - Numeric value indicationg the probability for a complete pass based on air yards, field position, down, yards to go, pass location, and season based on comparable game situations. +\item{cp} - Numeric value indicating the probability for a complete pass based on comparable game situations. \item{cpoe} - For a single pass play this is 1 - cp when the pass was completed or 0 - cp when the pass was incomplete. Analyzed for a whole game or season an indicator for the passer how much over or under expectation his completion percentage was. -\item{season_type} - 'REG' or 'POST' indicating if the game belongs to regular or post season. -\item{week} - Season week. -\item{game_key} - RS feed game identifier. -\item{game_time_eastern} - Kickoff time in eastern time zone. -\item{game_time_local} - Kickoff time in local time zone. -\item{iso_time} - Kickoff time according ISO 8601. -\item{game_type} - One of 'REG', 'WC', 'DIV', 'CON', 'SB' indicating if a game was a regular season game or one of the playoff rounds. -\item{site_id} - RS feed id for game site. -\item{site_city} - Game site city. -\item{site_fullname} - Game site name. -\item{site_state} - Game site state. -\item{roof_type} - Game site roof type. -\item{drive_start_time} - Game time at the beginning of a given drive. -\item{drive_end_time} - Game time at the end of a given drive. -\item{drive_start_yardline} - String indicating where a given drive started consisting of team half and yard line number. -\item{drive_end_yardline} - String indicating where a given drive ended consisting of team half and yard line number. -\item{drive_how_started} - String indicating how the offense got the ball. -\item{drive_how_ended} - String indicating how the offense lost the ball. +\item{series} - Starts at 1, each new first down increments, numbers shared across both teams NA: kickoffs, extra point/two point conversion attempts, non-plays, no posteam +\item{series_success} - 1: scored touchdown, gained enough yards for first down 0: punt, interception, fumble lost, turnover on downs, FG attempt NA: series is NA, series contains QB spike/kneel +\item{start_time} - Kickoff time in eastern time zone. +\item{stadium} - Game site name. +\item{weather} - String describing the weather including temperature, humidity and wind (direction and speed). Doesn't change during the game! +\item{nfl_api_id} - UUID of the game in the new NFL API. +\item{play_clock} - Time on the playclock when the ball was snapped. +\item{play_deleted} - Binary indicator for deleted plays. +\item{play_type_nfl} - Play type as listed in the NFL source. Slightly different to the regular play_type variable. +\item{end_clock_time} - Game time at the end of a given play. +\item{end_yard_line} - String indicating the yardline at the end of the given play consisting of team half and yard line number. +\item{drive_real_start_time} - Local day time when the drive started (currently not used by the NFL and therefore mostly 'NA'). \item{drive_play_count} - Numeric value of how many regular plays happened in a given drive. -\item{drive_yards_penalized} - Numeric value of how many yards the offense gained or lost through penalties. \item{drive_time_of_possession} - Time of possession in a given drive. -\item{drive_inside20} - Binary indicator if the offense was able to get inside the opponents 20 yard line. \item{drive_first_downs} - Number of forst downs in a given drive. -\item{drive_possession_team_abbr} - Abbreviation of the possession team in a given drive. -\item{scoring_team_abbr} - Abbreviation of the scoring team if the play was a scoring play. -\item{scoring_type} - String indicating the scoring type. One of 'FG', 'TD', 'PAT', 'SFTY', 'PAT2'. -\item{alert_play_type} - String describing the play type of a play the NFL has listed as alert play. For most of those plays there are highlight clips available through \code{\link{fast_scraper_clips}}. -\item{play_type_nfl} - Play type as listed in the rs feed. Slightly different to the regular play_type variable. -\item{time_of_day} - Local time at the beginning of the play. -\item{yards} - Analogue yards_gained but with the kicking team being the possession team (which means that there are many yards gained through kickoffs and punts). -\item{end_yardline_side} - String indicating the side of the field at the end of the given play. -\item{end_yardline_number} - Yardline number within the above given side at the end of the given play. -\item{series} - Starts at 1, each new first down increments, numbers shared across both teams. Is NA for: kickoffs, extra point/two point conversion attempts, no posteam. -\item{series_success} - 1 when scored touchdown, gained enough yards for first down. 0 when punt, interception, fumble lost, turnover on downs, 4th down FG attempt. NA when series is NA, series contains QB spike/kneel. +\item{drive_inside20} - Binary indicator if the offense was able to get inside the opponents 20 yard line. +\item{drive_ended_with_score} - Binary indicator the drive ended with a score. +\item{drive_quarter_start} - Numeric value indicating in which quarter the given drive has started. +\item{drive_quarter_end} - Numeric value indicating in which quarter the given drive has ended. +\item{drive_yards_penalized} - Numeric value of how many yards the offense gained or lost through penalties in the given drive. +\item{drive_start_transition} - String indicating how the offense got the ball. +\item{drive_end_transition} - String indicating how the offense lost the ball. +\item{drive_game_clock_start} - Game time at the beginning of a given drive. +\item{drive_game_clock_end} - Game time at the end of a given drive. +\item{drive_start_yard_line} - String indicating where a given drive started consisting of team half and yard line number. +\item{drive_end_yard_line} - String indicating where a given drive ended consisting of team half and yard line number. +\item{drive_play_id_started} - Play_id of the first play in the given drive. +\item{drive_play_id_ended} - Play_id of the last play in the given drive. +\item{away_score} - Total points scored by the away team. +\item{home_score} - Total points scored by the home team. +\item{location} - Either 'Home' o 'Neutral' indicating if the home team played at home or at a neutral site. +\item{result} - Equals home_score - away_score and means the game outcome from the perspective of the home team. +\item{total} - Equals home_score + away_score and means the total points scored in the given game. +\item{spread_line} - The closing spread line for the game. A positive number means the home team was favored by that many points, a negative number means the away team was favored by that many points. (Source: Pro-Football-Reference) +\item{total_line} - The closing total line for the game. (Source: Pro-Football-Reference) +\item{div_game} - Binary indicator for if the given game was a division game. +\item{roof} - One of 'dome', 'outdoors', 'closed', 'open' indicating indicating the roof status of the stadium the game was played in. (Source: Pro-Football-Reference) +\item{surface} - What type of ground the game was played on. (Source: Pro-Football-Reference) +\item{temp} - The temperature at the stadium only for 'roof' = 'outdoors' or 'open'.(Source: Pro-Football-Reference) +\item{wind} - The speed of the wind in miles/hour only for 'roof' = 'outdoors' or 'open'. (Source: Pro-Football-Reference) +\item{home_coach} - First and last name of the home team coach. (Source: Pro-Football-Reference) +\item{away_coach} - First and last name of the away team coach. (Source: Pro-Football-Reference) +\item{stadium_id} - ID of the stadium the game was played in. (Source: Pro-Football-Reference) +\item{game_stadium} - Name of the stadium the game was played in. (Source: Pro-Football-Reference) } } \description{ @@ -320,23 +334,16 @@ Get NFL Play by Play Data To load valid game_ids please use the package function \code{\link{fast_scraper_schedules}}. The \code{source} parameter controls from which source the data is being -scraped. The following sources are available: -\itemize{ -\item{\code{rs}} - the RS Feed on NFL.com. It is being considered the more complete - source (data available back to 2000) but is not able to scrape live games. -\item{\code{gc}} - the gamecenter. It is less complete (back to 2009) and includes -less variables but is able to scrape live games. -} +scraped. The old parameters \code{rs} as well as \code{gc} +are not valid anymore. Please use \code{nfl} or leave unspecified. The \code{pp} parameter controls if the scraper should use parallel processing. Please note that the initiating process takes a few seconds which means it may be better to set \code{pp = FALSE} if you are scraping just a few games. } \examples{ -# Get pbp data for two 2006 games using the rs feed and parallel processing -# game_ids <- c("2006091009", "2006123103") -# pbp <- fast_scraper(game_ids, source = "rs", pp = TRUE) - -# Get pbp data for two 2019 games using gamecenter and no parallel processing -# game_ids <- c("2019090804", "2019101700") -# pbp <- fast_scraper(game_ids, source = "gc", pp = FALSE) +\dontrun{ +# Get pbp data for two games using parallel processing +game_ids <- c("2019_01_GB_CHI", "2013_21_SEA_DEN") +pbp <- fast_scraper(game_ids, pp = TRUE) +} } diff --git a/man/fast_scraper_clips.Rd b/man/fast_scraper_clips.Rd deleted file mode 100644 index b2bb6f56..00000000 --- a/man/fast_scraper_clips.Rd +++ /dev/null @@ -1,32 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/top-level_scraper.R -\name{fast_scraper_clips} -\alias{fast_scraper_clips} -\title{Get NFL Play by Play Highlight Clips} -\usage{ -fast_scraper_clips(game_ids, pp = FALSE) -} -\arguments{ -\item{game_ids}{Vector of numeric or character ids} - -\item{pp}{Logical - either \code{TRUE} or \code{FALSE} (see details for further information)} -} -\value{ -Data frame containing game_id, play_id for all plays with available -highlightclip and the clip url -} -\description{ -Get NFL Play by Play Highlight Clips -} -\details{ -To load valid game_ids please use the package function \code{\link{fast_scraper_schedules}}. -The \code{pp} parameter controls if the scraper should use parallel processing. -Please note that the initiating process takes a few seconds which means it -may be better to set \code{pp = FALSE} if you are scraping just a few games. -} -\examples{ - -# Get highlight clips for two 2019 games using parallel processing -# game_ids <- c("2019090804", "2019101700") -# clips <- fast_scraper_clips(game_ids, pp = TRUE) -} diff --git a/man/fast_scraper_roster.Rd b/man/fast_scraper_roster.Rd deleted file mode 100644 index 81192abf..00000000 --- a/man/fast_scraper_roster.Rd +++ /dev/null @@ -1,69 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/top-level_scraper.R -\name{fast_scraper_roster} -\alias{fast_scraper_roster} -\title{Get team rosters for multiple seasons and teams} -\usage{ -fast_scraper_roster(team_ids, seasons, pp = FALSE) -} -\arguments{ -\item{team_ids}{A string vector containing the IDs for NFL Team(s) -(see details for more information)} - -\item{seasons}{A string vector of 4-digit years associated with given NFL seasons} - -\item{pp}{Logical - either \code{TRUE} or \code{FALSE} (see details for further information)} -} -\value{ -Data frame where each individual row represents a player in -the roster of the given team and season listed by the NFL -containing the following information: -\itemize{ -\item{team.season} -\item{teamPlayers.displayName} -\item{teamPlayers.firstName} -\item{teamPlayers.middleName} -\item{teamPlayers.lastName} -\item{teamPlayers.suffix} -\item{teamPlayers.status} -\item{teamPlayers.position} -\item{teamPlayers.positionGroup} -\item{teamPlayers.nflId} -\item{teamPlayers.esbId} -\item{teamPlayers.gsisId} -\item{teamPlayers.birthDate} -\item{teamPlayers.homeTown} -\item{teamPlayers.collegeId} -\item{teamPlayers.collegeName} -\item{teamPlayers.jerseyNumber} -\item{teamPlayers.height} -\item{teamPlayers.weight} -\item{team.teamId} -\item{team.abbr} -\item{team.cityState} -\item{team.fullName} -\item{team.nick} -\item{team.conferenceAbbr} -\item{team.divisionAbbr} -\item{teamPlayers.headshot_url} -\item{teamPlayers.profile_url} -} -} -\description{ -Given team_ids and years, return a dataset with each -player the NFL has listed as part of the roster. -} -\details{ -To find team associated Team IDs use the \code{\link{teams_colors_logos}} -dataset stored in this package! -The \code{pp} parameter controls if the scraper should use parallel processing. -Please note that the initiating process takes a few seconds which means it -may be better to set \code{pp = FALSE} if you are scraping just a few teams/seasons. -} -\examples{ -# Roster of Steelers in 2018, no parallel processing -# rosters <- fast_scraper_roster("3900", 2018, pp = FALSE) - -# Roster of Steelers and Seahawks in 2016 & 2019 using parallel processing -# rosters <- fast_scraper_roster(c("3900", "4600"), c("2016", "2019"), pp = TRUE) -} diff --git a/man/fast_scraper_schedules.Rd b/man/fast_scraper_schedules.Rd index 9e865a7c..5a1103a5 100644 --- a/man/fast_scraper_schedules.Rd +++ b/man/fast_scraper_schedules.Rd @@ -12,45 +12,42 @@ fast_scraper_schedules(seasons, pp = FALSE) \item{pp}{Logical - either \code{TRUE} or \code{FALSE} (see details for further information)} } \value{ -Data frame containing the follwoing detailed game information: +Data frame containing the following detailed game information: \itemize{ +\item{game_id} - Character identifier including season, week, away team and home team \item{season} - 4 digit season year. -\item{season_type} - Either 'PRE', 'REG', 'POST', 'PRO'. -\item{week} - Numeric week number. -\item{game_id} - Unique game identifier. -\item{alt_game_id} - Alternative and much more intuitive identifier introduced by Lee Sharpe and set to \code{NA} for Hall of Fame Week, Preseason and Pro Bowl -\item{game_date} - Game date in format dd/mm/yyyy. -\item{game_time_eastern} - Kickoff time in eastern time zone. -\item{game_time_local} - Kickoff time in local time zone. -\item{iso_time} - Kickoff time according ISO 8601. -\item{home_team} - Home team abbreviation -\item{away_team} - Away team abbreviation -\item{home_team_name} - Home team full name -\item{away_team_name} - Away team full name -\item{home_nickname} - Home team nick name -\item{away_nickname} - Away team nick name -\item{home_team_id} - Home team id (can be used with the package function \code{\link{fast_scraper_roster}}). -\item{away_team_id} - Away team id (can be used with the package function \code{\link{fast_scraper_roster}}). \item{game_type} - One of 'REG', 'WC', 'DIV', 'CON', 'SB' indicating if a game was a regular season game or one of the playoff rounds. -\item{week_name} - Full description of week -\item{site_city} - Game site city. -\item{site_fullname} - Game site name. -\item{site_state} - Game site state. -\item{site_roof_type} - Game site roof type. -\item{network_channel} - Name of broadcasting network channel. +\item{week} - Numeric week number. +\item{gameday} - Game date in format yyyy/mm/dd. +\item{weekday} - The day of the week on which the game occcured. +\item{gametime} - The kickoff time of the game. This is represented in 24-hour time and the Eastern time zone, regardless of what time zone the game was being played in. +\item{away_team} - Away team abbreviation. +\item{home_team} - Home team abbreviation. +\item{away_score} - The number of points the away team scored. Is 'NA' for games which haven't yet been played. +\item{home_score} - The number of points the home team scored. Is 'NA' for games which haven't yet been played. +\item{home_result} - Equals home_score - away_score and means the game outcome from the perspective of the home team. +\item{stadium} - Name of the stadium the game was or will be played in. (Source: Pro-Football-Reference) +\item{location} - Either 'Home' o 'Neutral' indicating if the home team played at home or at a neutral site. +\item{roof} - One of 'dome', 'outdoors', 'closed', 'open' indicating indicating the roof status of the stadium the game was played in. (Source: Pro-Football-Reference) +\item{surface} - What type of ground the game was played on. (Source: Pro-Football-Reference) +\item{old_game_id} - Unique game identifier of the old NFL API. } } \description{ Get NFL Season Schedules } \details{ +This functions now incorporates the games file provided and maintained +by Lee Sharpe. + The \code{pp} parameter controls if the scraper should use parallel processing. Please note that the initiating process takes a few seconds which means it may be better to set \code{pp = FALSE} if you are scraping less than 10 seasons. } \examples{ - +\dontrun{ # Get schedules for the whole 2015 - 2018 seasons -# seasons <- 2015:2018 -# schedules <- fast_scraper_schedules(seasons) +seasons <- 2015:2018 +schedules <- fast_scraper_schedules(seasons) +} } diff --git a/man/figures/README-ex5-1.png b/man/figures/README-ex5-1.png new file mode 100644 index 00000000..afbd4dc3 Binary files /dev/null and b/man/figures/README-ex5-1.png differ diff --git a/man/pipe.Rd b/man/pipe.Rd deleted file mode 100644 index 0eec7526..00000000 --- a/man/pipe.Rd +++ /dev/null @@ -1,12 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/utils-pipe.R -\name{\%>\%} -\alias{\%>\%} -\title{Pipe operator} -\usage{ -lhs \%>\% rhs -} -\description{ -See \code{magrittr::\link[magrittr:pipe]{\%>\%}} for details. -} -\keyword{internal} diff --git a/man/teams_colors_logos.Rd b/man/teams_colors_logos.Rd index 45417d7e..b3bcf9ba 100644 --- a/man/teams_colors_logos.Rd +++ b/man/teams_colors_logos.Rd @@ -19,6 +19,9 @@ information, including franchises in multiple cities: \item{team_logo_wikipedia}{Url to Team logo on wikipedia} \item{team_logo_espn}{Url to higher quality logo on espn} } +The colors are taken from Lee Sharpe's teamcolors.csv who has taken them from the +`teamcolors` package created by Ben Baumer and Gregory Matthews. +The Wikipeadia logo urls are taken from Lee Sharpe's logos.csv } \usage{ teams_colors_logos