If you use any part of this code for your work, we ask that you include the following citation:
@article{koo2025smitin,
title = {{SMITIN}: Self-Monitored Inference-Time INtervention for Generative Music Transformers},
author = {Koo, Junghyun and Wichern, Gordon and Germain, Fran\c{c}ois G. and Khurana, Sameer and {Le Roux}, Jonathan},
journal = {IEEE Open Journal of Signal Processing},
year = 2025
}
- Environment Setup
- Run SMITIN with Pre-computed Weights
- Overall Procedure of SMITIN: Instrument Addition
- Building Your Own Knob
- Contributing
- License
The code has been tested using python 3.9
on both Linux and macOS.
The code depends on audiocraft, which at the time of this code release
had broken dependency requirements, so we need to install the working versions of audiocraft dependencies first.
Then, the necessary dependencies for SMITIN can be installed using the included requirements.txt
:
pip install -r requirements_audiocraft.txt
pip install -r requirements.txt
We provide pre-computed probes' weights for testing SMITIN. The weights we provide are organized in the following structure.
./activations
│ ### Instrument Recognition ###
│ # drums
├── pos_mdrums_neg_rest_musdb/large/
│ └── weight file(s)
│ # bass
├── pos_mbass_neg_rest_musdb/large/
│ └── weight file(s)
│ # guitar
├── pos_mguitar_neg_rest_moisesdb/large/
│ └── weight file(s)
│ # piano
├── pos_mpiano_neg_rest_moisesdb/large/
│ └── weight file(s)
│
│ ### Real vs. Fake music ###
├── pos_DISCOXacc_neg_MusicGenall/large/
│
│ ### More Cowbell ###
├── pos_morecowbell_neg_mixture/large/
│ └── weight file(s)
└── ...
You can try SMITIN on the Text-to-Music generation approach without needing to acquire audio data (text prompts will be randomly selected from the MusicCaps dataset). Below are script examples for generating text-to-music using the Text-to-Music generation approach:
# add drums
python smitin/iti.py --task pos_mdrums_neg_rest_musdb --audio_continuation f --tgt_dataset musiccaps --input_posneg neg --intervention_strength 10.0 --generate_seconds 30.0 --num_steps_to_test 5
# add guitar
python smitin/iti.py --task pos_mguitar_neg_rest_moisesdb --audio_continuation f --tgt_dataset musiccaps --input_posneg neg --intervention_strength 10.0 --generate_seconds 30.0 --num_steps_to_test 5
# add realism
python smitin/iti.py --task pos_DISCOXacc_neg_MusicGenall --duration_sec 1.0 --tgt_observe_sec 10.0 --audio_continuation f --intervention_strength 10.0 --tgt_dataset musiccaps --input_posneg neg --generate_seconds 30.0 --num_steps_to_test 5
# more cowbell
python smitin/iti.py --task pos_morecowbell_neg_mixture --duration_sec 10.0 --audio_continuation f --tgt_dataset musiccaps --intervention_strength 10.0 --input_posneg neg --generate_seconds 30.0 --num_steps_to_test 5
You can also try SMITIN with your own text prompts:
# add drums
python smitin/iti.py --task pos_mdrums_neg_rest_musdb --audio_continuation f --tgt_dataset text_prompts --text_prompts_ttm "rock music" "calm lofi-music" "jazzy holiday vibe house music" --intervention_strength 10.0 --generate_seconds 30.0 --num_steps_to_test 3
# more cowbell
python smitin/iti.py --task pos_morecowbell_neg_mixture --duration_sec 10.0 --audio_continuation f --tgt_dataset text_prompts --text_prompts_ttm "rock music" "calm lofi-music" "jazzy holiday vibe house music" --intervention_strength 10.0 --generate_seconds 30.0 --num_steps_to_test 3
This section outlines the overall procedure of SMITIN, focusing specifically on the task of adding a desired instrument. To follow the steps in this section, you need to download the MUSDB18 and MoisesDB datasets.
-
download MUSDB18 and MoisesDB datasets and place the uncompressed folders (e.g.,
musdb/
andmoisesdb/moisesdb_v0.1/
) in thedata/
directory. -
run
python smitin/track_trim_silence.py --data_dir_path_musdb PATH_TO_MUSDB --data_dir_path_moisesdb PATH_TO_MOISESDB
If you only want to try out on a single instrument (e.g., drums), you can instead run
python smitin/track_trim_silence.py --preprocess_moisesdb false --data_dir_path_musdb PATH_TO_MUSDB --instruments_musdb drums
Extract activations: After data preprocessing, we first need to retrieve MusicGen's intermediate activations.
# retrieve activations of MUSDB
python smitin/get_activations.py -t musdb -d ./data/musdbhq/silence_trimmed/
# retrieve activations of MoisesDB
python smitin/get_activations.py -t moisesdb -d ./data/moisesdb/silence_trimmed/
If you wish to only retrieve MusicGen's large model configuration, you can simply add the -m large
option.
Musicgen probing: With the extracted activations, we can examine MusicGen's head-wise probe accuracy. The results will be saved in the probes_acc/
directory.
Some example scripts for probing are as follow:
# probing binary instrument recognition - drums and bass
python smitin/search_probes.py -t pos_mdrums_neg_rest_musdb pos_mbass_neg_rest_musdb
# probing binary instrument recognition - guitar (only with MusicGen large model)
python smitin/search_probes.py -t pos_mguitar_neg_rest_moisesdb -m large
# probing binary instrument recognition - piano (setting number of probing data)
python smitin/search_probes.py -t pos_mpiano_neg_rest_moisesdb --num_data_points 100
-
Script examples of Audio Continuation generation approach:
# add drums - base SMITIN configuration python smitin/iti.py --task pos_mdrums_neg_rest_musdb --tgt_dataset musdb_inst --tgt_inst drums --input_posneg neg --generate_seconds 30.0 --batch_size 5 --num_steps_to_test 5 # add guitar ### SMITIN: alpha=10.0 and number of heads to intervene = full - 128 ### also generate text-conditioned method "add guitar" python smitin/iti.py --task pos_mguitar_neg_rest_moisesdb --tgt_dataset moisesdb_inst --tgt_inst guitar --input_posneg neg --intervention_strength 10.0 --num_to_intervene -128 --generate_seconds 30.0 --batch_size 5 --num_steps_to_test 5 --compare_text_gen t --text_prompt_cond "add guitar"
-
Script examples of Text-to-Music generation approach:
# add drums ### SMITIN: base configuration ### also generate text + ", add drums" python smitin/iti.py --task pos_mdrums_neg_rest_musdb --audio_continuation f --tgt_dataset musiccaps --input_posneg neg --generate_seconds 30.0 --num_steps_to_test 5 --compare_text_gen t --text_prompt_cond ", add drums" # add guitar ### SMITIN: sparse addition (s) = 1 python smitin/iti.py --task pos_mguitar_neg_rest_moisesdb --audio_continuation f --tgt_dataset musiccaps --input_posneg neg --sparse_add_freq 1 --generate_seconds 30.0 --num_steps_to_test 5
Success Rate
-
Examine success rate with the saved monitored results during ITI.
python smitin/compute_success_rate.py -d output_results/monitored_results/{SAVE_NAME}/ # for the guitar example above python smitin/compute_success_rate.py -d output_results/smitin/large/pos_mguitar_neg_rest_moisesdb_3.0sec/audio_continuation/K-128_exp_A10.0_neg_sparse5_coef_autoK16dlin/
-
Or run monitoring with another probe to compute success rate corresponding to the monitoring probes:
python smitin/run_monitoring.py -a ./output_results/smitin/large/{TASK_NAME}/audio_continuation/{EXP_NAME}/ -t {MONITOR_TASK_NAME} -o ./output_results/monitored_results/{SAVE_NAME}/ # for the guitar example above python smitin/run_monitoring.py -a ./output_results/smitin/large/pos_mguitar_neg_rest_moisesdb_3.0sec/audio_continuation/K-128_exp_A10.0_neg_sparse5_coef_autoK16dlin/ -t pos_mguitar_neg_rest_moisesdb -o ./output_results/monitored_results/audio_continuation/pos_mguitar_neg_rest_moisesdb
FAD
We use the fadtk package to compute the FAD score. Install the package with pip install fadtk
and run the evaluation using fadtk package.
Similarity
We use the pre-trained model weights and inference code from the e2e_music_remastering_system repository. Please follow the instructions from there.
You can try performing SMITIN on your custom dataset, where the procedure is similar to Overall Procedure of SMITIN: Instrument Addition.
In this repository, we provided a small number of data points for performing the task "adding drums". This section provides examples on performing ITI using this custom dataset.
-
Prepare your own custom data in the
data/
directory.-
Place positive and negative samples as the following structure.
./data ├── YOUR_DATA │ ├── positive │ │ ├── audio_sample_#1.wav │ │ ├── audio_sample_#2.wav │ │ └── ... │ └── negative │ ├── audio_sample_#1.wav │ ├── audio_sample_#2.wav │ └── ... └── ...
-
You do not need to split the dataset into training and test subset. But if you do, the dataset structure should look like this
./data ├── YOUR_DATA │ ├── train │ │ ├── positive │ │ │ ├── audio_sample_#1.wav │ │ │ └── ... │ │ └── negative │ │ ├── audio_sample_#1.wav │ │ └── ... │ └── test │ ├── positive │ │ ├── audio_sample_#1.wav │ │ └── ... │ └── negative │ ├── audio_sample_#1.wav │ └── ... └── ...
-
-
Extract activations
-
extracting activations on the custom dataset
# only extracting 'large' model configuration python smitin/get_activations.py -t custom -d ./data/custom/ --sample_rate 44100 -m large
-
other script examples
# audio data extension = '.wav', no data splits python smitin/get_activations.py -t custom -d ./data/PATH_TO_CUSTOM_DATA/ --sample_rate CUSTOM_SAMPLE_RATE # audio data extension = '.mp3', activation duration = 10 seconds, data splits = [train, test] python smitin/get_activations.py -t custom -d ./data/PATH_TO_CUSTOM_DATA/ --sample_rate CUSTOM_SAMPLE_RATE --audio_extension mp3 --segment_durations 10.0 --data_splits train test
-
This will save activations in the
activations/pos_custom_neg_custom/
directory.
-
-
Probing (optional)
-
Examine how well MusicGen comprehends the musical factor according to the custom data:
# only probing 'large' model configuration, not performing training / test split validation python smitin/search_probes.py -t pos_custom_neg_custom --split_test False -m large
-
other script examples:
# when you do not have a [train, test] data subset splits python smitin/search_probes.py -t pos_custom_neg_custom --split_test False # [train, test] splits, only examine MusicGen_large, activation duration = 10 seconds python smitin/search_probes.py -t pos_custom_neg_custom -m large --segment_durations 10.0
-
-
SMITIN
-
Script to run Audio Continuation generation approach with the custom probes:
python smitin/iti.py --task pos_custom_neg_custom --tgt_dataset custom --input_posneg neg --generate_seconds 30.0 --num_steps_to_test 5 --batch_size 1
(output results will be saved in the
./output_results/smitin/large/pos_custom_neg_custom_3.0sec/audio_continuation/K-1_exp_A5.0_neg_sparse5_coef_autoK16dlin/
directory)Other script examples:
python smitin/iti.py --task pos_custom_neg_custom --tgt_dataset custom --input_posneg neg --generate_seconds 30.0 --num_steps_to_test 5 # audio continuation with SMITIN + text prompt python smitin/iti.py --task pos_custom_neg_custom --tgt_dataset custom --input_posneg neg --audio_continuation_w_text true --text_prompt_cond "YOUR TEXT PROMPT" --generate_seconds 30.0 --num_steps_to_test 5 --batch_size 1
-
Script to run Text-to-Music generation approach with the custom probes:
# using MusicCaps text aspect_list python smitin/iti.py --task pos_custom_neg_custom --audio_continuation f --tgt_dataset musiccaps --input_posneg neg --generate_seconds 30.0 --num_steps_to_test 5 --batch_size 1
(output results will be saved in the
./output_results/smitin/large/pos_custom_neg_custom_3.0sec/text-to-music/K-1_exp_A5.0_neg_sparse5_coef_autoK16dlin/
directory)
-
-
Evaluation - Success Rate (optional)
-
Compute success rate with the monitored outputs saved with custom probes:
# evaluating audio_continuation results python smitin/compute_success_rate.py -d ./output_results/smitin/large/pos_custom_neg_custom_3.0sec/audio_continuation/K-1_exp_A5.0_neg_sparse5_coef_autoK16dlin/ -t pos_custom_neg_custom # evaluating text-to-music results python smitin/compute_success_rate.py -d ./output_results/smitin/large/pos_custom_neg_custom_3.0sec/text-to-music/K-1_exp_A5.0_neg_sparse5_coef_autoK16dlin/ -t pos_custom_neg_custom
-
Or evaluate with the probes fitted with more data on adding drums to compute "more correct" success rate:
# evaluating audio continuation results python smitin/run_monitoring.py -a ./output_results/smitin/large/pos_custom_neg_custom_3.0sec/audio_continuation/K-1_exp_A5.0_neg_sparse5_coef_autoK16dlin/ -t pos_mdrums_neg_rest_musdb -o ./output_results/monitored_results/audio_continuation/custom_add_drums/ # evaluating text-to-audio results python smitin/run_monitoring.py -a ./output_results/smitin/large/pos_custom_neg_custom_3.0sec/text-to-music/K-1_exp_A5.0_neg_sparse5_coef_autoK16dlin/ -t pos_mdrums_neg_rest_musdb -o ./output_results/monitored_results/text-to-music/custom_add_drums/
-
See CONTRIBUTING.md for our policy on contributions.
Released under AGPL-3.0-or-later
license, as found in the LICENSE.md file.
All files, except as noted below:
Copyright (c) 2023-2025 Mitsubishi Electric Research Laboratories (MERL)
SPDX-License-Identifier: AGPL-3.0-or-later
The following files:
smitin/musicgen/__init__.py
smitin/musicgen/builders.py
smitin/musicgen/lm.py
smitin/musicgen/loaders.py
smitin/musicgen/musicgen.py
smitin/musicgen/transformer.py
were adapted from https://github.com/facebookresearch/audiocraft (license included in LICENSES/MIT.md):
Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) Copyright (c) Meta Platforms, Inc. and affiliates.
The following file:
smitin/iti_utils.py
was adapted from https://github.com/likenneth/honest_llama (license included in LICENSES/MIT.md):
Copyright (c) 2023-2025 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) 2023 Kenneth Li