This project aims to improve a course recommendation system currently in production.
Emagister is a company whose objective is to be a meeting point for students and course providers, and they aim to do so by helping people find the right training. That's why the recommender system is one of the most important parts of the web. So the primary motivation of this project is to improve the current recommendation system.
An essential part of the company's business model is the cost per lead business model. Users generate leads on courses offered by centres. Also, users can rate the course from 1 to 10.
Therefore, I use this data to measure the popularity of the courses based on two metrics: the number of leads generated and rating.
The data available for this project are real data extracted from the Emagister UK database. As an employee of Emagister, I requested authorization from the company to use the data, after consulting with our lawyers, the company permitted me.
For security and legal reasons, user data is encrypted.
I use the following four methods of recommendations:
- Knowledge-based recommendations
- Content-based filtering
- Neighbourhood-based collaborative filtering
- Model-based collaborative filtering
The project is divided into five parts:
- ETL pipeline
- Exploratory data analysis
- Models creation
- Make recommendations
- A demo web application
The pipeline retrieves raw data from the database, then performs the data wrangling process on this data and finally loads the resulting clean data to database and files, ready to be used in the web application.
The code is in this notebook: 1_Extract_transform_load.ipynb
I have taken the code written in this section and arranged into several classes and a script, which allows you to automate the ETL process. To execute the ETL pipeline script, run the following commands:
$ cd automate/
$ python etl.py <username> <password>
Once the data has been cleaned, it is time to perform an exploratory data analysis. I search for patterns and trends in data, and I create visualizations for this data as well.
The code is in this notebook: 2_Exploratory_data_analysis.ipynb
In this part, I create models from which I make the recommendations.
The code is in this notebook: 3_Create_models.ipynb
I have taken the code written in this section and arranged into a class and a script, which allows automating the models' creation. To execute the process, run the following commands:
$ cd automate/
$ python model.py <username> <password>
After the exploratory data analysis, is time to play around with structures created in the first part and trying to make recommendations.
The code is in this notebook: 4_Make_recommendations.ipynb
The observations and models derived from the ETL phase and analysis of the project are implemented in a web application that can be accessed here. You can get more information about this demo web here
To run this project properly, you need the following:
- Python >=3.5
- numpy 1.18.1
- pandas 0.24.2
- scikit-learn 0.20.3
- scipy 1.2.1
- sqlalchemy 1.3.2
- matplotlib 3.0.3
- halo 0.0.28 (Spinner for terminal. PyPi)
- pymysql 0.9.3 (Python MySQL client library. PyPi)
Also, you need to install texcptulz
, a library designed to transform the raw ingested text into a form that is ready for calculation and modelling.
I have developed this library for this project. To install texcptulz
run the following command:
pip install texcptulz
More information here.
- Bengford, Bilbro & Ojeda (2018) Applied Text Analysis with Python. O'Reilly Media, Inc.
- Segaran (2007) Programming Collective Intelligence. O'Reilly Media, Inc.
- Phillips (2019) Python 3 Object-Oriented Programming. Third Edition. Packt Publishing Ltd.
- Grinberg (2014) Flask Web Development. Developing Web Applications with Python. O'Reilly Media, Inc.