A recommendation system by Mau Hernandes, Ph.D. - September 2020
Item Recommendation System for users of a (web) app.
Given the artificial files item_history.tsv
, user_master.tsv
and target_users.tsv
(see 'how to download' section), to build a recommendation system that maximizes the accuracy of the recommendation but with certain level of variaty (entropy) on the selection.
This repo contains 3 main componets:
-
A PDF report with the key concepts and ideas for this project,
-
Jupyter notebooks going through the steps of building the system and
-
a few python files with functions and classes used for some of the notebooks.
The final recommendation model is in the notebook called 5. Model
First check the pdf for an overview of the system. Start with the the abstract and introduction but them feel free to jump to the last section called Our Results
with numerical results with the performance of the system. Then open the last Jupyter Notebook (The Model) to see the Recommender System in Action.
spark 3.0.0 python 3.8
- tested on Ubuntu 20
A quick tutorial on how to install Spark on Ubuntu: https://medium.com/solving-the-human-problem/installing-spark-on-ubuntu-20-on-digital-ocean-in-2020-a7e4b5b65ffb
- Cleaner: for cleaning the data for training and testing used by the other notebooks
- Feature Exploration: A few histograms of the distribution of some of the features in the
user_master.tsv
file. - ALS Training: An Alternating Least Squares training notebook for Collaborative Filtering. Including gridsearch and evaluation with nDCG.
- kMeans: A Kmeans training notebook for Contente Based Filtering. Including gridsearch and evaluation with nDCG.
- Evaluation: A notebook for quick trying and evaluating different models.
- Model: The notebook with the recommendation system proposed by this project. A mix of ALS + Kmeans.
The pdf file is an overview of the methods and technologies applied in the development of the recommendation system. It includes some mathematical discussions of key concepts, some plots and tables from our benchmark tests and a extensive description and diagrams of how our system works.
- model: - als_trainer.py: Convenience functions for cleaning the data and training an ALS model - kmeans_trainer.py: Convenience functions for cleaning the data and fitting a k-means cluster.
- utils:
- evaluate.py: File containing our
evaluate
function to benchmark different models - make_Y.py: File for cleaning the data to make training and testing data - metrics.py: File containg functions to calculate the nDCG metrics.
A dataset for the code can be downloaded here
- data: Contain all the data (
.tsv
files) that the different functions and methods read and write to.