Paper Title: data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup. At ICASSP 2023 (arxiv link).
data2vec-aqc is a Self-Supervised Learning (SSL) algorithm for speech representation learning from unlabeled speech data. Our goal is to improve SSL for speech in domains where both unlabeled and labeled data are limited. Building on the recently introduced data2vec, we introduce additional modules to the data2vec framework that leverage the benefit of data augmentations, quantized representations, and clustering. The interaction between these modules helps solve the cross-contrastive loss as an additional self-supervised objective.
Primary Contributions:
- We make data2vec simultaneously solve a masked acoustic modeling based cross-contrastive task between the student and teacher networks by passing randomly augmented version(s) of the same audio sample passed through each network.
- We add a quantizer module similar to wav2vec 2.0, as sampling negatives from the quantized representations has been proven to be effective.
- Additionally, we introduce a clustering module from ccc-wav2vec 2.0, to cluster the quantized representations and diminish the effect of negatives in the contrastive loss computation that fall into the same cluster as the positive.
The data2vec-aqc BASE model pre-trained on LibriSpeech-960h has been evaluated on the multiple downstream tasks over the SUPERB benchmark. The proposed method comprehensively outperforms the baseline data2vec BASE model over the array of downstream tasks presented over SUPERB.
The WERs specified are without the use of any language model.
Model | Pre-training data | Fine-tuning data | Model Link | WER (test-clean | test-other) |
---|---|---|---|---|
wav2vec Base | LibriSpeech-360h | No fine-tuning | download | --- |
wav2vec Base | LibriSpeech-360h | LibriSpeech-100h | download | 7.5 | 20.2 |
data2vec Base | LibriSpeech-360h | No fine-tuning | download | --- |
data2vec Base | LibriSpeech-360h | LibriSpeech-100h | download | 6.4 | 17.7 |
data2vec-aqc Base | LibriSpeech-360h | No fine-tuning | download | --- |
data2vec-aqc Base | LibriSpeech-360h | LibriSpeech-100h | download | 5.5 | 14.0 |
data2vec-aqc Base | LibriSpeech-960h | No fine-tuning | download | --- |
data2vec-aqc Base | LibriSpeech-960h | LibriSpeech-100h | download | 4.8 | 9.5 |
data2vec-aqc Base SUPERB | LibriSpeech-960h | No fine-tuning | SUPERB benchmark submission | --- |
- Pre-training and fine-tuning procedures can be found here.
- PyTorch version >= 1.10.0
- Python version >= 3.8
- For training new models, you'll also need an NVIDIA GPU and NCCL
- To install fairseq with data2vec-aqc and develop locally:
git clone https://github.com/Speech-Lab-IITM/data2vec-aqc
cd fairseq
pip install --editable ./
- For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" --global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./
-
For large datasets install PyArrow:
pip install pyarrow
-
If you use Docker make sure to increase the shared memory size either with
--ipc=host
or--shm-size
as command line options tonvidia-docker run
. -
For Augmentations to work install torchaudio-augmentations:
git clone https://github.com/Speech-Lab-IITM/torchaudio-augmentations
cd torchaudio-augmentations
pip install --editable ./
- The clustering module functions on GPU needs fast-pytorch-kmeans to be installed:
pip install fast-pytorch-kmeans
- The
cluster_factor
andscale_factor
parameters (for the clustering module) can be modified from themodel
section of the pre-training configs which can be found from the pre-training config. - The augmentations used for data2vec-aqc requires the noise set of MUSAN dataset. The path to the same is to be specified in the
path_to_musan_noise_set
variable of the getitem method of the raw_audio_dataset file.
- Facebook AI Research Sequence-to-Sequence Toolkit written in Python. fairseq