Skip to content
Aditya Singh edited this page May 12, 2021 · 4 revisions

Documentation: EE698R DEC based Diarization Model

This speaker diarization model uses Deep Embedding Clustering with a deep neural network initialized via a Residual Autoencoder to assign speaker labels to segments of the raw audio signal. Clustering is perfomed on x-vectors extracted using Desplanques et al.'s ECAPA-TDNN framework. We use Silero-VAD for voice audio detection.

Baseline Model: Spectral clustering is used for audio-label assignment.

DataSet

Model is tested on VoxConverse dataset (total 216 audio files). We randomly split the dataset into two parts: ‘test’ and ‘train’ with test data having 50 audio files.

ipynb Notebook Files

  • DEC_ResAE.ipynb: To evaluate the DER score for the DEC models described in the report. Use the link available in Tutorial section to open it on google colab
  • ExtractVAD.ipynb: Used to extract and save all the VAD mapping for the audio files in VoxConverse dataset.
  • ExtractXvectors.ipynb: Used to precompute X-vectors for the audio files in VoxConverse dataset and save it into a zip file to use it in the DiarizationDataset.
  • Baseline.ipynb: To evaluate the DER score for the baseline models described in the report. Use the link available in the Tutorial section to open it on google colab.

Tutorial

DEC Speaker Diarization
Open In Colab

Baseline Speaker Diarization
Open In Colab

API Documentation

Index