This repository contains the data, models, and web app for my paper Trouble with the Curve: Predicting Future MLB Players Using Scouting Reports.
To the best of my knowledge, this is the only existing dataset of its kind for baseball prospect profiles. Almost 10,000 profiles were acquired from MLB.com and FanGraphs containing players' scouting reports and 20-80 scale grades, as well as select metadata.
With the above data, an obvious question arises: Can we predict if a player will make the major leagues? We use a variety of deep learning methods to attempt to answer this question, and achieve a strong "maybe". We also present an analysis of the language variations within the reports between successful players, as well as between positions.
Model | Accuracy | F1 |
---|---|---|
Bag-Of-Embeddings | 64.65% | 53.78% |
TextCNN | 69.02% | 56.42% |
LSTM+SelfAttn | 68.64% | 54.65% |
BCN | 73.52% | 43.33% |
HAN | 66.00% | 54.07% |
A Hierarchical Attention Network is trained as part of the above question, allowing not only a demonstration of the research problem, but also an interpretable visualization for each prediction using attention weights.