Disentangling Textual and Acoustic Features of Neural Speech Representations

This official repository contains the code and the model checkpoints for the paper "Disentangling Textual and Acoustic Features of Neural Speech Representations". It proposes a disentanglement framework based on the Information Bottleneck principle, which effectively separates entangled representations of neural speech models into distinct textual and acoustic components. The framework retains only the features relevant to target tasks, improving interpretability while maintaining the model's original performance. The framework is also proving useful in providing a route to perform disentangled feature attribution, revealing the most significant speech frame representations from both textual and acoustic perspectives.

📃[Paper]

Approach

In stage 1, we train a decoder with two objectives: to map the internal representation of an existing speech model to text, but also minimize the presence of irrelevant information in these representations. The goal is to ensure that the latent representation $z^{textual}$ retains only the speech features necessary for accurate transcription while filtering out any extraneous characteristics.
In stage 2, we train a second decoder on the same speech representations. This decoder also has access to the latent 'textual' representation learned in stage 1, and is again trained with 2 objectives: to predict our target task, and to minimize the amount of information encoded in the vector. This objective ensures that the latent representation $z^{acoustic}$ learned in stage 2 avoids encoding textual information – since the decoder already has access to it and the information minimization term discourages redundancy. Instead, it is expected to capture additional acoustic characteristics that are beneficial for the target task.
The attention layer in stage 2 of our framework can be used to identify those frames in the original audio input whose latent representations contribute most to our target tasks. Crucially, the disentanglement mechanism allows us to clearly separate the contributions of acoustic features from those of textual features, providing insight into their individual roles.

Reproducibility

Training

To start the training process, use the vib/training.py script with the following command::

python vib/training.py --STAGE ["1" or "2"] --LEARNING_RATE [e.g., "0.0001"] --BETA_S1 ["incremental" or a constant coefficient] --BETA_S2 ["incremental" or a constant coefficient if you set stage to "2"] --DATA_S1 ["data_for_stage1"] --DATA_S2 ["data_for_stage2" if you set stage to "2"] --LATENT_DIM [the encoder bottleneck dimention e.g., "128"] --MODEL_NAME [an exising speech model] --LAYER_S1 ["all" for layer averaging, otherwise specify the model layer number] --LAYER_S2 ["all" for layer averaging, otherwise specify the model layer number] --SEED [e.g., "12"]

The disentangled models will be saved to directory/models/vib.

The textual latent representations produced in stage 1 are independent of the target task. So, you can load the textual encoder from directory/models/vib/1/CommonVoice_LibriSpeech/ and use them to directly start your training from stage 2 on new downstream tasks, using the same speech model:

python vib/training.py --STAGE "2" --LEARNING_RATE "0.0001" --BETA_S1 "incremental" --BETA_S2 "incremental" --DATA_S1 "CommonVoice_LibriSpeech" --DATA_S2 ["data_for_stage2"] --LATENT_DIM "128" --MODEL_NAME [the same speech model as stage 1] --LAYER_S1 "all" --LAYER_S2 ["all" for layer averaging, otherwise specify the model layer number] --SEED [e.g., "12"]

Sanity-check probing

In the probing folder, you will find scripts for evaluating the latent and original representations to verify if the latent representations are truly disentangled.

Localizing salient frames

The analysis folder contains code to extract textual and acoustic attention weights in stage 2 of the framework. These scores offer deeper insights than gradient-based feature attribution methods by identifying the most salient speech frame representations from both the textual and acoustic perspectives.

Citing

@misc{mohebbi2024disentangling,
    title={Disentangling Textual and Acoustic Features of Neural Speech Representations},
    author={Hosein Mohebbi and Grzegorz Chrupała and Willem Zuidema and Afra Alishahi and Ivan Titov},
    year={2024},
    eprint={2410.03037},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
analysis		analysis
directory/models/vib		directory/models/vib
preprocessing		preprocessing
probing		probing
vib		vib
.gitignore		.gitignore
README.md		README.md
diagram.JPG		diagram.JPG
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disentangling Textual and Acoustic Features of Neural Speech Representations

Approach

Reproducibility

Training

Sanity-check probing

Localizing salient frames

Citing

About

Releases

Packages

Languages

hmohebbi/disentangling_representations

Folders and files

Latest commit

History

Repository files navigation

Disentangling Textual and Acoustic Features of Neural Speech Representations

Approach

Reproducibility

Training

Sanity-check probing

Localizing salient frames

Citing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages