According to Yahoo Finance, the sport analytics market is projected to be valued at US$ 31.4 billion by 2034. This is due to the incredible prospect of improving athletic performance, team strategies, and business aspects within the sports industry through sports analytics.
The machine learning model proposed includes:
- Feature selection for calculating swing probability
- Swing probability prediction
- Hit probability prediction
- Analysis of pitcher and batter performance
First, the model will identify the most important features through various feature selection techniques from both game situations (ie. score, inning, runners, count, outs, pitch hand, bat side, etc.) and pitch features (pitch location, release point, and Statcast metrics, etc.) to calculate the swing probability. This model will use Recursive Feature Elimination (RFE) using a logistic regression estimator to identify the most relevant features by recursively removing less important features. The swing probability calculated will then be used as an input along with pitch features to calculate a hit probability. Finally, we will use logistic regression to find the swing and hit probability, accuracy, and F1 score.
Logistic regression applies a sigmoid function to represent the probability:
Where z is the output of the linear equation, w are the weights, x are the feature data, b is the bias:
Using this information, both pitcher and batter performance can be analyzed in comparison to these probabilities. This means we could predict a positive pitch or hit from various features and potentially know the play before it happens, thus improving athletic performance and game strategy.
Yahoo Finance, https://finance.yahoo.com/news/sport-analytics-market-expected-reach-180000614.html
Data was made available by Major League Baseball. It is a combination of game details and Statcast pitch metrics. Pitch location is originally provided as coordinates. This was converted to a grid as defined here:
The 2023 MLB regular season pitch data set is available here: https://drive.google.com/file/d/1VdqB_q9YrbgDUZklgIYjYFNeFRTIUJ4-/view?usp=drive_link
Sample videos to illustrate pitch zone location are available here: https://drive.google.com/drive/folders/1ytJQ8hHBSVvnIjO1c7S1ykq3eqEmdHYY?usp=sharing
Both the swing probability and hit probability models were created in a Jupyter notebook (swing-probabilities-cm.ipynb)
Preliminary research was done in LogisticRegression.ipynb
Unique identifier for a game
Uniquely idenifies a given play. Plays generally begin with a pitch and continue until play stops. Plays may include batted balls and runner movement. In certain situations, a play may be associated with a non-pitch such as a pick off attempt or runner advancement
Numeric value representing the month of the game date
Numeric value representing the four digit year of the game date
The original date a game was scheduled to be played. The actual game date may differ if the game was delayed.
A sequential number associated with each "at bat" during a game
A sequential number associated with each pitch of an at bat
Unique identifier associated with a pitcher
Unique identifier associated with the team a pitcher plays for
Character value either L or R indicating the hand the pitcher uses to throw (meaning Left or Right)
Unique identifier associated with a batter
Unique identifier associated with the team a batter plays for
Character value either L or R indicating the side of the plate the batter stands during the at bat (meaning Left or Right)
A value of 1 for a right handed batter or -1 for a left hand batter. The value is used to normalize the pitch zone location
A character string that indicates the numeric value for balls and strikes, seperated by a hyphon. For example, a count with 2 balls and 1 strike is '2-1'
Numeric value indicating the number of outs at the time of the pitch
A character string that indicates runner positions on the bases. The first character is first base, the second character is second base, and the third character is third base. If the base is occupied, a 1 is assigned. If the base is not occupied, a 0 is assigned. For example, a runner on second and third is '011'.
Numeric value indicating the inning
Numeric value indicating the score of the defensive team
Numeric value indicating the score of the offensive team
If the ball is in play, a value of 0 or 1 indicates if the scorer ruled it was a hit. This does not mean the ball was simply put into play.
A value of 0 or 1 indicating whether or not the batter swung at the pitch
A value of 0 or 1 indicating whether or not the batter swung and missed. Bunts are not included.
A value of 0 or 1 indicating whether or not the batter made contact with the pitch.
A categorical value indicating what happened on the pitch. Typical values include: Ball, Called Strike, Swinging Strike, Hit Into Play No Outs, Hit Into Play Outs
A value of 0 or 1 indicating if the pitch was a strike (called, swinging, foul)
A value of 0 or 1 indicating if the pitch was called a ball
Two character code used to indicate the type of pitch thrown. pitch_type_desc expands the name.
A value of 0 or 1 indicating if the batter swung and put the ball in play. Foul balls do not count unless the foul ball is caught for an out.
Floating point value indicating the number of revolutions per minute the pitch spins during flight.
Floating point value indicating the distance at which the pitcher releases the ball. The pitching plate is located 60' 6" from the apex of home plate. Extension is measured as a y value perpindicular to a straight line between home and the pitcher's plate.
Floating point value indicating the pitch speed in miles per hour when the ball left the pitcher's hand
Floating point value that identifies the angle of the axis at which the pitch is spinning during flight
Floating point value that indicates the distance (in inches) the pitch moved from a line it would have traveled with only gravity affecting it
Floating point value that indicates the distance (in inches) the pitch moved on a horizontal plane
Floating point value that indicates the speed the ball is traveling when it reaches home plate (in miles per hour)
Floating point value that is the difference between 'pitch speed' and 'plate speed'
Floating point value that indicates the number of revolutions the ball spins on a horizontal axis
Floating point value that indicates the number of revolutions the ball spins on a vertical axis
Floating point value that indicates the number of revolutions the ball spins on a 45 degree axis
Floating point value that indicates the distance (in inches) from the center of the strike zone as the ball crosses home plate
Floating point value that indicates the distance (in inches) on a horizontal plane from the center of home plate
Floating point value that indicates the distance (in inches) on a vertical plane from the center of the strike zone
This value is calculated by multiplying bat_side_multiplier by relative_strike_zone_location_x to normalize pitch zone location
Floating point value that indicates the distance (in inches) on a horizontal plane from the center of home plate
Floating point value that indicates the distance (in inches) on a vertical plane from the ground
Seems to be the same as inferred_backspin_rate
Seems to be the same as inferred_sidespin_rate
Seems to be the same as inferred_gyrospin_rate
Floating point value that indicates the distance (in feet) from a perpendicular line from the apex of home plate through the center of the pitcher's plate in which the ball is released
Floating point value that indicates the distance from the ground (in feet) where the ball is released
Floating point value that indicates the distance from the apex of home plate (in feet) where the ball was released
Normalized zone decribing where the pitch crosses the plate. A visual representation is available here: https://docs.google.com/document/d/1eRzfZ7Q4lfMKI-P5wE7UvGP3jzJEbd10sX2-Bh9jK6Y/edit?usp=sharing
Text field that describes the type of pitch thrown