Skip to content

SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers

License

Notifications You must be signed in to change notification settings

merlresearch/smitin

Repository files navigation

SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers

IEEE DOI arXiv

If you use any part of this code for your work, we ask that you include the following citation:

@article{koo2025smitin,
    title   = {{SMITIN}: Self-Monitored Inference-Time INtervention for Generative Music Transformers},
    author  = {Koo, Junghyun and Wichern, Gordon and Germain, Fran\c{c}ois G. and Khurana, Sameer and {Le Roux}, Jonathan},
    journal = {IEEE Open Journal of Signal Processing},
    year    = 2025
 }

Table of contents

  1. Environment Setup
  2. Run SMITIN with Pre-computed Weights
  3. Overall Procedure of SMITIN: Instrument Addition
  4. Building Your Own Knob
  5. Contributing
  6. License

Environment Setup

The code has been tested using python 3.9 on both Linux and macOS. The code depends on audiocraft, which at the time of this code release had broken dependency requirements, so we need to install the working versions of audiocraft dependencies first. Then, the necessary dependencies for SMITIN can be installed using the included requirements.txt:

pip install -r requirements_audiocraft.txt
pip install -r requirements.txt

Run SMITIN with Pre-computed Weights

We provide pre-computed probes' weights for testing SMITIN. The weights we provide are organized in the following structure.

./activations
│   ### Instrument Recognition ###
│     # drums
├── pos_mdrums_neg_rest_musdb/large/
│   └── weight file(s)
│     # bass
├── pos_mbass_neg_rest_musdb/large/
│   └── weight file(s)
│     # guitar
├── pos_mguitar_neg_rest_moisesdb/large/
│   └── weight file(s)
│     # piano
├── pos_mpiano_neg_rest_moisesdb/large/
│   └── weight file(s)
│
│   ### Real vs. Fake music ###
├── pos_DISCOXacc_neg_MusicGenall/large/
│
│   ### More Cowbell ###
├── pos_morecowbell_neg_mixture/large/
│   └── weight file(s)
└── ...

You can try SMITIN on the Text-to-Music generation approach without needing to acquire audio data (text prompts will be randomly selected from the MusicCaps dataset). Below are script examples for generating text-to-music using the Text-to-Music generation approach:

# add drums
python smitin/iti.py --task pos_mdrums_neg_rest_musdb --audio_continuation f --tgt_dataset musiccaps --input_posneg neg --intervention_strength 10.0 --generate_seconds 30.0 --num_steps_to_test 5

# add guitar
python smitin/iti.py --task pos_mguitar_neg_rest_moisesdb --audio_continuation f --tgt_dataset musiccaps --input_posneg neg --intervention_strength 10.0 --generate_seconds 30.0 --num_steps_to_test 5

# add realism
python smitin/iti.py --task pos_DISCOXacc_neg_MusicGenall --duration_sec 1.0 --tgt_observe_sec 10.0 --audio_continuation f --intervention_strength 10.0 --tgt_dataset musiccaps --input_posneg neg --generate_seconds 30.0 --num_steps_to_test 5

# more cowbell
python smitin/iti.py --task pos_morecowbell_neg_mixture --duration_sec 10.0 --audio_continuation f --tgt_dataset musiccaps --intervention_strength 10.0 --input_posneg neg --generate_seconds 30.0 --num_steps_to_test 5

You can also try SMITIN with your own text prompts:

# add drums
python smitin/iti.py --task pos_mdrums_neg_rest_musdb --audio_continuation f --tgt_dataset text_prompts --text_prompts_ttm "rock music" "calm lofi-music" "jazzy holiday vibe house music" --intervention_strength 10.0 --generate_seconds 30.0 --num_steps_to_test 3

# more cowbell
python smitin/iti.py --task pos_morecowbell_neg_mixture --duration_sec 10.0 --audio_continuation f --tgt_dataset text_prompts --text_prompts_ttm "rock music" "calm lofi-music" "jazzy holiday vibe house music" --intervention_strength 10.0 --generate_seconds 30.0 --num_steps_to_test 3

Overall Procedure of SMITIN: Instrument Addition

This section outlines the overall procedure of SMITIN, focusing specifically on the task of adding a desired instrument. To follow the steps in this section, you need to download the MUSDB18 and MoisesDB datasets.

Data Preprocessing (Silence Trimming)

  1. download MUSDB18 and MoisesDB datasets and place the uncompressed folders (e.g., musdb/ and moisesdb/moisesdb_v0.1/) in the data/ directory.

  2. run

    python smitin/track_trim_silence.py --data_dir_path_musdb PATH_TO_MUSDB --data_dir_path_moisesdb PATH_TO_MOISESDB

    If you only want to try out on a single instrument (e.g., drums), you can instead run

    python smitin/track_trim_silence.py --preprocess_moisesdb false --data_dir_path_musdb PATH_TO_MUSDB --instruments_musdb drums

Probing MusicGen

Extract activations: After data preprocessing, we first need to retrieve MusicGen's intermediate activations.

# retrieve activations of MUSDB
python smitin/get_activations.py -t musdb -d ./data/musdbhq/silence_trimmed/
# retrieve activations of MoisesDB
python smitin/get_activations.py -t moisesdb -d ./data/moisesdb/silence_trimmed/

If you wish to only retrieve MusicGen's large model configuration, you can simply add the -m large option.

Musicgen probing: With the extracted activations, we can examine MusicGen's head-wise probe accuracy. The results will be saved in the probes_acc/ directory.

Some example scripts for probing are as follow:

# probing binary instrument recognition - drums and bass
python smitin/search_probes.py -t pos_mdrums_neg_rest_musdb pos_mbass_neg_rest_musdb

# probing binary instrument recognition - guitar (only with MusicGen large model)
python smitin/search_probes.py -t pos_mguitar_neg_rest_moisesdb -m large

# probing binary instrument recognition - piano (setting number of probing data)
python smitin/search_probes.py -t pos_mpiano_neg_rest_moisesdb --num_data_points 100

Inference-Time Intervention on MusicGen

  • Script examples of Audio Continuation generation approach:

    # add drums - base SMITIN configuration
    python smitin/iti.py --task pos_mdrums_neg_rest_musdb --tgt_dataset musdb_inst --tgt_inst drums --input_posneg neg --generate_seconds 30.0 --batch_size 5 --num_steps_to_test 5
    
    # add guitar
    ### SMITIN: alpha=10.0 and number of heads to intervene = full - 128
    ### also generate text-conditioned method "add guitar"
    python smitin/iti.py --task pos_mguitar_neg_rest_moisesdb --tgt_dataset moisesdb_inst --tgt_inst guitar --input_posneg neg --intervention_strength 10.0 --num_to_intervene -128 --generate_seconds 30.0 --batch_size 5 --num_steps_to_test 5 --compare_text_gen t --text_prompt_cond "add guitar"
  • Script examples of Text-to-Music generation approach:

    # add drums
    ### SMITIN: base configuration
    ### also generate text + ", add drums"
    python smitin/iti.py --task pos_mdrums_neg_rest_musdb --audio_continuation f --tgt_dataset musiccaps --input_posneg neg --generate_seconds 30.0 --num_steps_to_test 5 --compare_text_gen t --text_prompt_cond ", add drums"
    
    # add guitar
    ### SMITIN: sparse addition (s) = 1
    python smitin/iti.py --task pos_mguitar_neg_rest_moisesdb --audio_continuation f --tgt_dataset musiccaps --input_posneg neg --sparse_add_freq 1 --generate_seconds 30.0 --num_steps_to_test 5

Objective Evaluation

Success Rate

  • Examine success rate with the saved monitored results during ITI.

    python smitin/compute_success_rate.py -d output_results/monitored_results/{SAVE_NAME}/
    
    # for the guitar example above
    python smitin/compute_success_rate.py -d output_results/smitin/large/pos_mguitar_neg_rest_moisesdb_3.0sec/audio_continuation/K-128_exp_A10.0_neg_sparse5_coef_autoK16dlin/
  • Or run monitoring with another probe to compute success rate corresponding to the monitoring probes:

    python smitin/run_monitoring.py -a ./output_results/smitin/large/{TASK_NAME}/audio_continuation/{EXP_NAME}/ -t {MONITOR_TASK_NAME} -o ./output_results/monitored_results/{SAVE_NAME}/
    
    # for the guitar example above
    python smitin/run_monitoring.py -a ./output_results/smitin/large/pos_mguitar_neg_rest_moisesdb_3.0sec/audio_continuation/K-128_exp_A10.0_neg_sparse5_coef_autoK16dlin/ -t pos_mguitar_neg_rest_moisesdb -o ./output_results/monitored_results/audio_continuation/pos_mguitar_neg_rest_moisesdb

FAD

We use the fadtk package to compute the FAD score. Install the package with pip install fadtk and run the evaluation using fadtk package.

Similarity

We use the pre-trained model weights and inference code from the e2e_music_remastering_system repository. Please follow the instructions from there.


Building Your Own Knob

You can try performing SMITIN on your custom dataset, where the procedure is similar to Overall Procedure of SMITIN: Instrument Addition.

In this repository, we provided a small number of data points for performing the task "adding drums". This section provides examples on performing ITI using this custom dataset.

  1. Prepare your own custom data in the data/ directory.

    • Place positive and negative samples as the following structure.

      ./data
      ├── YOUR_DATA
      │   ├── positive
      │   │   ├── audio_sample_#1.wav
      │   │   ├── audio_sample_#2.wav
      │   │   └── ...
      │   └── negative
      │       ├── audio_sample_#1.wav
      │       ├── audio_sample_#2.wav
      │       └── ...
      └── ...
      
    • You do not need to split the dataset into training and test subset. But if you do, the dataset structure should look like this

      ./data
      ├── YOUR_DATA
      │   ├── train
      │   │   ├── positive
      │   │   │   ├── audio_sample_#1.wav
      │   │   │   └── ...
      │   │   └── negative
      │   │       ├── audio_sample_#1.wav
      │   │       └── ...
      │   └── test
      │       ├── positive
      │       │   ├── audio_sample_#1.wav
      │       │   └── ...
      │       └── negative
      │           ├── audio_sample_#1.wav
      │           └── ...
      └── ...
      
  2. Extract activations

    • extracting activations on the custom dataset

      # only extracting 'large' model configuration
      python smitin/get_activations.py -t custom -d ./data/custom/ --sample_rate 44100 -m large
    • other script examples

      # audio data extension = '.wav', no data splits
        python smitin/get_activations.py -t custom -d ./data/PATH_TO_CUSTOM_DATA/ --sample_rate CUSTOM_SAMPLE_RATE
      
      # audio data extension = '.mp3', activation duration = 10 seconds, data splits = [train, test]
        python smitin/get_activations.py -t custom -d ./data/PATH_TO_CUSTOM_DATA/ --sample_rate CUSTOM_SAMPLE_RATE --audio_extension mp3  --segment_durations 10.0 --data_splits train test
    • This will save activations in the activations/pos_custom_neg_custom/ directory.

  3. Probing (optional)

    • Examine how well MusicGen comprehends the musical factor according to the custom data:

      # only probing 'large' model configuration, not performing training / test split validation
      python smitin/search_probes.py -t pos_custom_neg_custom --split_test False -m large
    • other script examples:

      # when you do not have a [train, test] data subset splits
      python smitin/search_probes.py -t pos_custom_neg_custom --split_test False
      
      # [train, test] splits, only examine MusicGen_large, activation duration = 10 seconds
      python smitin/search_probes.py -t pos_custom_neg_custom -m large --segment_durations 10.0
  4. SMITIN

    • Script to run Audio Continuation generation approach with the custom probes:

       python smitin/iti.py --task pos_custom_neg_custom --tgt_dataset custom --input_posneg neg --generate_seconds 30.0 --num_steps_to_test 5 --batch_size 1

      (output results will be saved in the ./output_results/smitin/large/pos_custom_neg_custom_3.0sec/audio_continuation/K-1_exp_A5.0_neg_sparse5_coef_autoK16dlin/ directory)

      Other script examples:

      python smitin/iti.py --task pos_custom_neg_custom --tgt_dataset custom --input_posneg neg --generate_seconds 30.0 --num_steps_to_test 5
      
      # audio continuation with SMITIN + text prompt
      python smitin/iti.py --task pos_custom_neg_custom --tgt_dataset custom --input_posneg neg --audio_continuation_w_text true --text_prompt_cond "YOUR TEXT PROMPT" --generate_seconds 30.0 --num_steps_to_test 5 --batch_size 1
    • Script to run Text-to-Music generation approach with the custom probes:

      # using MusicCaps text aspect_list
       python smitin/iti.py --task pos_custom_neg_custom --audio_continuation f --tgt_dataset musiccaps --input_posneg neg --generate_seconds 30.0 --num_steps_to_test 5 --batch_size 1

      (output results will be saved in the ./output_results/smitin/large/pos_custom_neg_custom_3.0sec/text-to-music/K-1_exp_A5.0_neg_sparse5_coef_autoK16dlin/ directory)

  5. Evaluation - Success Rate (optional)

    1. Compute success rate with the monitored outputs saved with custom probes:

      # evaluating audio_continuation results
       python smitin/compute_success_rate.py -d ./output_results/smitin/large/pos_custom_neg_custom_3.0sec/audio_continuation/K-1_exp_A5.0_neg_sparse5_coef_autoK16dlin/ -t pos_custom_neg_custom
      
      # evaluating text-to-music results
       python smitin/compute_success_rate.py -d ./output_results/smitin/large/pos_custom_neg_custom_3.0sec/text-to-music/K-1_exp_A5.0_neg_sparse5_coef_autoK16dlin/ -t pos_custom_neg_custom
    2. Or evaluate with the probes fitted with more data on adding drums to compute "more correct" success rate:

       # evaluating audio continuation results
       python smitin/run_monitoring.py -a ./output_results/smitin/large/pos_custom_neg_custom_3.0sec/audio_continuation/K-1_exp_A5.0_neg_sparse5_coef_autoK16dlin/ -t pos_mdrums_neg_rest_musdb -o ./output_results/monitored_results/audio_continuation/custom_add_drums/
       # evaluating text-to-audio results
       python smitin/run_monitoring.py -a ./output_results/smitin/large/pos_custom_neg_custom_3.0sec/text-to-music/K-1_exp_A5.0_neg_sparse5_coef_autoK16dlin/ -t pos_mdrums_neg_rest_musdb -o ./output_results/monitored_results/text-to-music/custom_add_drums/

Contributing

See CONTRIBUTING.md for our policy on contributions.


Copyright and License

Released under AGPL-3.0-or-later license, as found in the LICENSE.md file.

All files, except as noted below:

Copyright (c) 2023-2025 Mitsubishi Electric Research Laboratories (MERL)

SPDX-License-Identifier: AGPL-3.0-or-later

The following files:

  • smitin/musicgen/__init__.py
  • smitin/musicgen/builders.py
  • smitin/musicgen/lm.py
  • smitin/musicgen/loaders.py
  • smitin/musicgen/musicgen.py
  • smitin/musicgen/transformer.py

were adapted from https://github.com/facebookresearch/audiocraft (license included in LICENSES/MIT.md):

Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) Copyright (c) Meta Platforms, Inc. and affiliates.

The following file:

  • smitin/iti_utils.py

was adapted from https://github.com/likenneth/honest_llama (license included in LICENSES/MIT.md):

Copyright (c) 2023-2025 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) 2023 Kenneth Li