Skip to content

fdelgados/courses_recommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Courses recommendation

This project aims to improve a course recommendation system currently in production.

Table of Contents

  1. Introduction
  2. Dependencies
  3. References

Introduction

Emagister is a company whose objective is to be a meeting point for students and course providers, and they aim to do so by helping people find the right training. That's why the recommender system is one of the most important parts of the web. So the primary motivation of this project is to improve the current recommendation system.

An essential part of the company's business model is the cost per lead business model. Users generate leads on courses offered by centres. Also, users can rate the course from 1 to 10.

Therefore, I use this data to measure the popularity of the courses based on two metrics: the number of leads generated and rating.

Data used for the project

The data available for this project are real data extracted from the Emagister UK database. As an employee of Emagister, I requested authorization from the company to use the data, after consulting with our lawyers, the company permitted me.

For security and legal reasons, user data is encrypted.

The recommender system

I use the following four methods of recommendations:

  • Knowledge-based recommendations
  • Content-based filtering
  • Neighbourhood-based collaborative filtering
  • Model-based collaborative filtering

The project is divided into five parts:

  1. ETL pipeline
  2. Exploratory data analysis
  3. Models creation
  4. Make recommendations
  5. A demo web application

ETL pipeline

The pipeline retrieves raw data from the database, then performs the data wrangling process on this data and finally loads the resulting clean data to database and files, ready to be used in the web application.

The code is in this notebook: 1_Extract_transform_load.ipynb

I have taken the code written in this section and arranged into several classes and a script, which allows you to automate the ETL process. To execute the ETL pipeline script, run the following commands:

$ cd automate/
$ python etl.py <username> <password>

ETL Pipeline

Exploratory data analysis

Once the data has been cleaned, it is time to perform an exploratory data analysis. I search for patterns and trends in data, and I create visualizations for this data as well.

The code is in this notebook: 2_Exploratory_data_analysis.ipynb

Models creation

In this part, I create models from which I make the recommendations.

The code is in this notebook: 3_Create_models.ipynb

I have taken the code written in this section and arranged into a class and a script, which allows automating the models' creation. To execute the process, run the following commands:

$ cd automate/
$ python model.py <username> <password>

Make recommendations

After the exploratory data analysis, is time to play around with structures created in the first part and trying to make recommendations.

The code is in this notebook: 4_Make_recommendations.ipynb

Demo web

The observations and models derived from the ETL phase and analysis of the project are implemented in a web application that can be accessed here. You can get more information about this demo web here

Dependencies

To run this project properly, you need the following:

  • Python >=3.5
  • numpy 1.18.1
  • pandas 0.24.2
  • scikit-learn 0.20.3
  • scipy 1.2.1
  • sqlalchemy 1.3.2
  • matplotlib 3.0.3
  • halo 0.0.28 (Spinner for terminal. PyPi)
  • pymysql 0.9.3 (Python MySQL client library. PyPi)

Also, you need to install texcptulz, a library designed to transform the raw ingested text into a form that is ready for calculation and modelling. I have developed this library for this project. To install texcptulz run the following command:

pip install texcptulz

More information here.

References

Books

Online resources