Code for the paper "Leveraging Pre-trained Autoencoders for Interpretable Prototype Learning of Music Audio."
The autoencoder code is available in this repository.
Sonification results are available in the companion web site.
If results, insights, or code developed within this project are useful for you, please consider citing our work:
@inproceedings{alonso2024leveraging,
author = "Alonso-Jim\'{e}nez, Pablo and Pepino, Leonardo and Batlle-Roca, Roser and Zinemanas, Pablo and Bogdanov, Dmitry and Serra, Xavier and Rocamora, Mart\'{i}n",
title = "Leveraging Pre-trained Autoencoders for Interpretable Prototype Learning of Music Audio",
maintitle = "IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
booktitle = "ICASSP Workshop on Explainable AI for Speech and Audio (XAI-SA)",
year = 2024,
}
- Create a virtual environment (recommended):
python -m venv venv && source venv/bin/activate
- Initialize submodules and install dependencies:
./setup.sh
NOTE
The setup script was only tested with Python 3.11 using CentOS 7.5. It may not work in other environments.
- Download a dataset:
python src/download.py --dataset gtzan
Note: For now we have only implemented download functionality for GTZAN, but we could also implement medley-solos-db in the future.
- Extract the EnCodecMAE-based features:
python src/encode_encodecmae.py audio/gtzan/ feats/gtzan/ --model diffusion_4sThe available options are: base, large, diffusion_1s, diffusion_4s, and diffusion_10s.
base, and large are EnCodecMAE embeddings (not intended to operate with the diffusion decoder).
diffusion_4s is the model that we used in the paper, and diffusion_10s is a newer version that was not included on the paper, but we provide sonification examples in the companion website.
- We provide a script to train the PECMAE model with the GTZAN dataset:
./scripts/train_pecmae_5_gtzan.sh
These parameters of the script can be easily modified to train with other configurations.
- Train the baseline models
TODO
To use PECMAE with your custom dataset, follow these steps:
- Given an audio dataset located at
/your/dataaset/, extract the EnCodecMAE features:
python src/encode_encodecmae.py /your/dataset/ feats/your_dataset/ --model diffusion_4s --format .your_format
- Create a training script similar to
./scripts/train_pecmae_5_gtzan.sh.
Your should modify the fields --data-dir, --metadata-file-train, --metadata-file-val, and --metadata-file-test to point to your ground truth file.
Have a look at groundtruth/ to see examples of the expected format.
TODO