A machine learning project that predicts English Premier League match outcomes using Random Forest and Gradient Boosting classifiers. The model predicts whether a team will win based on historical match statistics and rolling performance metrics.
This project uses historical EPL match data to build predictive models that forecast match outcomes. The approach combines:
- Feature Engineering: Converting categorical data and extracting time-based features
- Rolling Averages: Capturing team form using 3-match rolling statistics
- Ensemble Methods: Random Forest and Gradient Boosting classifiers
- Dual Perspective Analysis: Merging predictions from both teams to identify high-confidence predictions
epl-match-prediction/
├── prediction.ipynb # Main Jupyter notebook with ML pipeline
├── matches.csv # EPL match data (not included)
├── requirements.txt # Python dependencies list
└── README.md
-
Clone the repository
git clone https://github.com/yourusername/epl-match-prediction.git cd epl-match-prediction -
Create a virtual environment
python -m venv .venv source .venv/bin/activate -
Install dependencies
pip install -r requirements.txt
-
Add the dataset
Place a
matches.csvfile in the project root directory. The data can be obtained from FBRef (Premier League match logs). The dataset should contain EPL match data with the following columns:date: Match datetime: Match start timevenue: Home or Awayteam: Team nameopponent: Opponent team nameresult: Match result (W/D/L)gf,ga: Goals for/againstsh,sot: Shots, shots on targetdist: Average shot distancefk: Free kickspk,pkatt: Penalties scored, penalty attempts
Match data sourced from FBRef, which provides comprehensive football statistics and match data.
| Feature | Description |
|---|---|
venue_code |
Home (0) vs Away (1) |
opponent_code |
Numeric code for opponent team |
hour |
Match start hour |
day_code |
Day of week (0=Monday, 6=Sunday) |
| Feature | Description |
|---|---|
gf_rolling |
Goals scored |
ga_rolling |
Goals conceded |
sh_rolling |
Total shots |
sot_rolling |
Shots on target |
dist_rolling |
Average shot distance |
fk_rolling |
Free kicks |
pk_rolling |
Penalties scored |
pkatt_rolling |
Penalty attempts |
- Configuration: 50 estimators, min_samples_split=10
- Purpose: Baseline model with rolling features
- Configuration: 100 estimators, learning_rate=0.1, max_depth=3
- Enhancement: Probability thresholding (≥55%) for confident predictions
-
Data Preprocessing
- Convert dates and extract temporal features
- Encode categorical variables (venue, opponent)
- Create binary target variable (Win=1, Not Win=0)
-
Feature Engineering
- Calculate 3-match rolling averages using
closed='left'to prevent data leakage - Normalize team names for consistent merging
- Calculate 3-match rolling averages using
-
Training Strategy
- Time-based split at January 1, 2022 (not random) to simulate real prediction scenarios
- Train on historical data, predict future matches
- 2021-2022+ season data was used for testing, seen in the
prediction.ipynb
-
Confidence Filtering
- Merge home and away perspectives for each match
- Filter for matches where Team A is predicted to win AND Team B is predicted to lose
- These "confident" predictions yield higher precision
The primary evaluation metric is Precision - optimizing for confident win predictions:
Open the Jupyter notebook and run all cells:
jupyter notebook prediction.ipynbOr in VS Code with the Jupyter extension installed.