EPL Match Prediction

A machine learning project that predicts English Premier League match outcomes using Random Forest and Gradient Boosting classifiers. The model predicts whether a team will win based on historical match statistics and rolling performance metrics.

Overview

This project uses historical EPL match data to build predictive models that forecast match outcomes. The approach combines:

Feature Engineering: Converting categorical data and extracting time-based features
Rolling Averages: Capturing team form using 3-match rolling statistics
Ensemble Methods: Random Forest and Gradient Boosting classifiers
Dual Perspective Analysis: Merging predictions from both teams to identify high-confidence predictions

Project Structure

epl-match-prediction/
├── prediction.ipynb    # Main Jupyter notebook with ML pipeline
├── matches.csv         # EPL match data (not included)
├── requirements.txt    # Python dependencies list
└── README.md

Installation

Clone the repository

git clone https://github.com/yourusername/epl-match-prediction.git
cd epl-match-prediction

Create a virtual environment

python -m venv .venv
source .venv/bin/activate

Install dependencies
```
pip install -r requirements.txt
```
Add the dataset

Place a matches.csv file in the project root directory. The data can be obtained from FBRef (Premier League match logs). The dataset should contain EPL match data with the following columns:
- date: Match date
- time: Match start time
- venue: Home or Away
- team: Team name
- opponent: Opponent team name
- result: Match result (W/D/L)
- gf, ga: Goals for/against
- sh, sot: Shots, shots on target
- dist: Average shot distance
- fk: Free kicks
- pk, pkatt: Penalties scored, penalty attempts

Data Source

Match data sourced from FBRef, which provides comprehensive football statistics and match data.

Features

Input Features

Feature	Description
`venue_code`	Home (0) vs Away (1)
`opponent_code`	Numeric code for opponent team
`hour`	Match start hour
`day_code`	Day of week (0=Monday, 6=Sunday)

Rolling Statistics (3-match window)

Feature	Description
`gf_rolling`	Goals scored
`ga_rolling`	Goals conceded
`sh_rolling`	Total shots
`sot_rolling`	Shots on target
`dist_rolling`	Average shot distance
`fk_rolling`	Free kicks
`pk_rolling`	Penalties scored
`pkatt_rolling`	Penalty attempts

Models

1. Random Forest Classifier

Configuration: 50 estimators, min_samples_split=10
Purpose: Baseline model with rolling features

2. Gradient Boosting Classifier

Configuration: 100 estimators, learning_rate=0.1, max_depth=3
Enhancement: Probability thresholding (≥55%) for confident predictions

Methodology

Data Preprocessing
- Convert dates and extract temporal features
- Encode categorical variables (venue, opponent)
- Create binary target variable (Win=1, Not Win=0)
Feature Engineering
- Calculate 3-match rolling averages using closed='left' to prevent data leakage
- Normalize team names for consistent merging
Training Strategy
- Time-based split at January 1, 2022 (not random) to simulate real prediction scenarios
- Train on historical data, predict future matches
- 2021-2022+ season data was used for testing, seen in the prediction.ipynb
Confidence Filtering
- Merge home and away perspectives for each match
- Filter for matches where Team A is predicted to win AND Team B is predicted to lose
- These "confident" predictions yield higher precision

Evaluation

The primary evaluation metric is Precision - optimizing for confident win predictions:

$$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$

Usage

Open the Jupyter notebook and run all cells:

jupyter notebook prediction.ipynb

Or in VS Code with the Jupyter extension installed.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
prediction.ipynb		prediction.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EPL Match Prediction

Overview

Project Structure

Installation

Data Source

Features

Input Features

Rolling Statistics (3-match window)

Models

1. Random Forest Classifier

2. Gradient Boosting Classifier

Methodology

Evaluation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EPL Match Prediction

Overview

Project Structure

Installation

Data Source

Features

Input Features

Rolling Statistics (3-match window)

Models

1. Random Forest Classifier

2. Gradient Boosting Classifier

Methodology

Evaluation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages