Skip to content

Repository having the code and models from the paper: data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup

License

Notifications You must be signed in to change notification settings

Speech-Lab-IITM/data2vec-aqc

Repository files navigation

Paper Title: data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup. At ICASSP 2023 (arxiv link).

data2vec-aqc is a Self-Supervised Learning (SSL) algorithm for speech representation learning from unlabeled speech data. Our goal is to improve SSL for speech in domains where both unlabeled and labeled data are limited. Building on the recently introduced data2vec, we introduce additional modules to the data2vec framework that leverage the benefit of data augmentations, quantized representations, and clustering. The interaction between these modules helps solve the cross-contrastive loss as an additional self-supervised objective.

Primary Contributions:

  • We make data2vec simultaneously solve a masked acoustic modeling based cross-contrastive task between the student and teacher networks by passing randomly augmented version(s) of the same audio sample passed through each network.
  • We add a quantizer module similar to wav2vec 2.0, as sampling negatives from the quantized representations has been proven to be effective.
  • Additionally, we introduce a clustering module from ccc-wav2vec 2.0, to cluster the quantized representations and diminish the effect of negatives in the contrastive loss computation that fall into the same cluster as the positive.

SUPERB Benchmark

The data2vec-aqc BASE model pre-trained on LibriSpeech-960h has been evaluated on the multiple downstream tasks over the SUPERB benchmark. The proposed method comprehensively outperforms the baseline data2vec BASE model over the array of downstream tasks presented over SUPERB.

Models

The WERs specified are without the use of any language model.

Model Pre-training data Fine-tuning data Model Link WER (test-clean | test-other)
wav2vec Base LibriSpeech-360h No fine-tuning download ---
wav2vec Base LibriSpeech-360h LibriSpeech-100h download 7.5 | 20.2
data2vec Base LibriSpeech-360h No fine-tuning download ---
data2vec Base LibriSpeech-360h LibriSpeech-100h download 6.4 | 17.7
data2vec-aqc Base LibriSpeech-360h No fine-tuning download ---
data2vec-aqc Base LibriSpeech-360h LibriSpeech-100h download 5.5 | 14.0
data2vec-aqc Base LibriSpeech-960h No fine-tuning download ---
data2vec-aqc Base LibriSpeech-960h LibriSpeech-100h download 4.8 | 9.5
data2vec-aqc Base SUPERB LibriSpeech-960h No fine-tuning SUPERB benchmark submission ---
  • Pre-training and fine-tuning procedures can be found here.

Requirements and Installation

  • PyTorch version >= 1.10.0
  • Python version >= 3.8
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • To install fairseq with data2vec-aqc and develop locally:
git clone https://github.com/Speech-Lab-IITM/data2vec-aqc
cd fairseq
pip install --editable ./
  • For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
  • For large datasets install PyArrow: pip install pyarrow

  • If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run .

  • For Augmentations to work install torchaudio-augmentations:

git clone https://github.com/Speech-Lab-IITM/torchaudio-augmentations
cd torchaudio-augmentations
pip install --editable ./
  • The clustering module functions on GPU needs fast-pytorch-kmeans to be installed: pip install fast-pytorch-kmeans

Parameters of interest

  • The cluster_factor and scale_factor parameters (for the clustering module) can be modified from the model section of the pre-training configs which can be found from the pre-training config.
  • The augmentations used for data2vec-aqc requires the noise set of MUSAN dataset. The path to the same is to be specified in the path_to_musan_noise_set variable of the getitem method of the raw_audio_dataset file.

Reference Code

  1. Facebook AI Research Sequence-to-Sequence Toolkit written in Python. fairseq

About

Repository having the code and models from the paper: data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages