This repository demonstrates the usage of Collaborative Filtering (CF) on a dataset of ratings for musical instrument (source: http://jmcauley.ucsd.edu/data/amazon/). CF relies on finding similarity between users and items to recommend a product to user A based on the preference of a similar user (user B). Alternatively, there is another method called Content-based filtering which uses similarity between item characteristics to recommend a new item based on what the user liked previously. Unlike CF, this latter method has the advantage of not needing data from other users, but it requires a lot of domain knowledge to select or engineer features to track.
This dataset is used only for CF, it contains 4 columns, user, item, rating, timestamp.
UPDATE: I have added the code for Alternating Least Square (ALS) recommender using Spark MLlib's API. This was written to be ran on Amazon's EMR cluster. The performance of ALS did not match that of the matrix factorization by Stochastic Gradient Descend.
The python notebook provides a quick comparison of algorithms for Memory based CF, Model based CF (rank factorization via SVD), as well as Stochastic Gradient Descent for solving the rank factorization. The script is based on the algorithms presented on these two blogs: http://online.cambridgecoding.com/notebooks/mhaller/implementing-your-own-recommender-systems-in-python-using-stochastic-gradient-descent-4, http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/, but some details were modified to accomodate the dataset and streamline the process.
Memory based CF
In Collaborative Filtering, Memory based CF algorithm look for similarity between users or between items. In user-user filter, cosine similarity is calculated between every pair of users within the data set resulting in a similarity matrix that's n_users X n_users. Similary for item-item, the cosine similarity is calculated between items. The similiarity matrix is the weight that will yield model prediction.
Model based CF (SVD)
Memory based CF algorithm can be very computationally expensive and is known to have issue with scalability, plus it doesn't learn well with highly sparse matrix like the one used in this python example. A useful alternative is model based CF using SVD for matrix decomposition. This method breaks down the user-item matrix into small matrices with latent-feature vectors, representing implicit characteristics for the user and items. However, SVD can also become computationally expensive with large matrix.
Model based CF (Stochastic Gradient Descend)
Another way to solve the matrix factorization is by using Stochastic Gradient Descend(SGD) to estimate the low-rank matrices corresponding to user and item. This method will save on memory thus more computationally efficient because it incrementally updates the model to find a better fit that minimizes the regularized squared error loss function.
After comparing the three algorithms, SGD matrix factorization yielded the best result with 100 iterations, learning rate of 0.01 to calculate the optimal latent feature matrices (see ipynb file). It would be interesting to use the same data set and run it with Alternative Least Squre (ALS) on Spark since it's an API within the current spark.ml package.
References
1)Koren et al. (2009) Koren, Y., Bell, R.M., Volinsky, C.: Matrix factorization techniques for recommender systems. IEEE Computer 42(8), 30–37 (2009) 32 Francesco Ricci, Lior Rokach and Bracha Shapira
2) Yu, H.F., Hsieh, C. J., Si, S., Dhillon, I.: Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In IEEE 12th International Conference on Data Mining, pp. 765–774 (2012)