Accompanying source code to the paper "Matrix Factorization for Collaborative Filtering is just Solving an Adjoint Latent Dirichlet Allocation Model After All" by Florian Wilhelm and "An Interpretable Model for Collaborative Filtering Using an Extended Latent Dirichlet Allocation Approach" by Florian Wilhelm, Marisa Mohr and Lien Michiels. Check out git tag v1.0 for the former and v2.0 for the latter.
The preprint of "Matrix Factorization for Collaborative Filtering is just Solving an Adjoint Latent Dirichlet Allocation Model After All" can be found here along with the following statement:
"ยฉ Florian Wilhelm 2021. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version was published in RecSys '21: Fifteenth ACM Conference on Recommender Systems Proceedings, https://doi.org/10.1145/3460231.3474266."
The preprint of "An Interpretable Model for Collaborative Filtering Using an Extended Latent Dirichlet Allocation Approach" can be found here and the final paper here.
In order to set up the necessary environment:
- review and uncomment what you need in
environment.yml
and create an environmentlda4rec
with the help of conda:conda env create -f environment.yml
- activate the new environment with:
conda activate lda4rec
- (optionally) get a free neptune.ai account for experiment tracking and save the api token
under
~/.neptune_api_token
(default).
First check out and adapt the default experiment config configs/default.yaml
and run it with:
lda4rec -c configs/default.yaml run
A config like configs/default.yaml
can also be used as a template to create an experiment set with:
lda4rec -c configs/default.yaml create
Check out cli.py
for more details.
Commands for setting up an Ubuntu 20.10 VM with at least 20 GiB of HD on e.g. a GCP c2-standard-30 instance:
tmux
sudo apt-get install -y build-essential
curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env
cargo install pueue
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O
sh Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
git clone https://github.com/FlorianWilhelm/lda4rec.git
cd lda4rec
conda env create -f environment.yml
conda activate lda4rec
vim ~/.neptune_api_token # and copy it over
Then create and run all experiments for full control over parallelism with pueue:
pueued -d # only once to start the daemon
pueue parallel 10
export OMP_NUM_THREADS=4 # to limit then number of threads per model
lda4rec -c configs/default.yaml create # to create the config files
find ./configs -maxdepth 1 -name "exp_*.yaml" -exec pueue add "lda4rec -c {} run" \; -exec sleep 30 \;
Remark: -exec sleep 30
avoids race condition when reading datasets if parallelism is too high.
- Always keep your abstract (unpinned) dependencies updated in
environment.yml
and eventually insetup.cfg
if you want to ship and install your package viapip
later on. - Create concrete dependencies as
environment.lock.yml
for the exact reproduction of your environment with:For multi-OS development, consider usingconda env export -n lda4rec -f environment.lock.yml
--no-builds
during the export. - Update your current environment with respect to a new
environment.lock.yml
using:conda env update -f environment.lock.yml --prune
โโโ AUTHORS.md <- List of developers and maintainers.
โโโ CHANGELOG.md <- Changelog to keep track of new features and fixes.
โโโ LICENSE.txt <- License as chosen on the command-line.
โโโ README.md <- The top-level README for developers.
โโโ configs <- Directory for configurations of model & application.
โโโ data <- Downloaded datasets will be stored here.
โโโ docs <- Directory for Sphinx documentation in rst or md.
โโโ environment.yml <- The conda environment file for reproducibility.
โโโ notebooks <- Jupyter notebooks. Naming convention is a number (for
โ ordering), the creator's initials and a description,
โ e.g. `1.0-fw-initial-data-exploration`.
โโโ logs <- Generated logs are collected here.
โโโ results <- Results as exported from neptune.ai.
โโโ setup.cfg <- Declarative configuration of your project.
โโโ setup.py <- Use `python setup.py develop` to install for development or
โ or create a distribution with `python setup.py bdist_wheel`.
โโโ src
โ โโโ lda4rec <- Actual Python package where the main functionality goes.
โโโ tests <- Unit tests which can be run with `py.test`.
โโโ .coveragerc <- Configuration for coverage reports of unit tests.
โโโ .isort.cfg <- Configuration for git hook that sorts imports.
โโโ .pre-commit-config.yaml <- Configuration of pre-commit git hooks.
Please cite LDA4Rec/LDAext if it helps your research. You can use the following BibTeX entry:
@inproceedings{wilhelm2021lda4rec,
author = {Wilhelm, Florian},
title = {Matrix Factorization for Collaborative Filtering Is Just Solving an Adjoint Latent Dirichlet Allocation Model After All},
year = {2021},
month = sep,
isbn = {978-1-4503-8458-2/21/09},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3460231.3474266},
doi = {10.1145/3460231.3474266},
booktitle = {Fifteenth ACM Conference on Recommender Systems},
numpages = {8},
location = {Amsterdam, Netherlands},
series = {RecSys '21}
}
@article{Wilhelm_Mohr_Michiels_2022,
title={An Interpretable Model for Collaborative Filtering Using an Extended Latent Dirichlet Allocation Approach},
volume={35},
url={https://journals.flvc.org/FLAIRS/article/view/130567},
DOI={10.32473/flairs.v35i.130567},
abstractNote={With the increasing use of AI and ML-based systems, interpretability is becoming an increasingly important issue to ensure user trust and safety. This also applies to the area of recommender systems, where methods based on matrix factorization (MF) are among the most popular methods for collaborative filtering tasks with implicit feedback. Despite their simplicity, the latent factors of users and items lack interpretability in the case of the effective, unconstrained MF-based methods. In this work, we propose an extended latent Dirichlet Allocation model (LDAext) that has interpretable parameters such as user cohorts of item preferences and the affiliation of a user with different cohorts. We prove a theorem on how to transform the factors of an unconstrained MF model into the parameters of LDAext. Using this theoretical connection, we train an MF model on different real-world data sets, transform the latent factors into the parameters of LDAext and test their interpretation in several experiments for plausibility. Our experiments confirm the interpretability of the transformed parameters and thus demonstrate the usefulness of our proposed approach.},
journal={The International FLAIRS Conference Proceedings},
author={Wilhelm, Florian and Mohr, Marisa and Michiels, Lien},
year={2022},
month={May}
}
This sourcecode is AGPL-3-only licensed. If you require a more permissive licence, e.g. for commercial reasons, contact me to obtain a licence for your business.
Special thanks goes to Du Phan and Fritz Obermeyer from the (Num)Pyro project for their kind help and helpful comments on my code.
This project has been set up using PyScaffold 4.0 and the dsproject extension 0.6. Some source code was taken from Spotlight (MIT-licensed) by Maciej Kula as well as lrann (MIT-licensed) by Florian Wilhelm and Marcel Kurovski.