Home

Documentation: EE698R DEC based Diarization Model

This speaker diarization model uses Deep Embedding Clustering with a deep neural network initialized via a Residual Autoencoder to assign speaker labels to segments of the raw audio signal. Clustering is perfomed on x-vectors extracted using Desplanques et al.'s ECAPA-TDNN framework. We use Silero-VAD for voice audio detection.

Baseline Model: Spectral clustering is used for audio-label assignment.

DataSet

Model is tested on VoxConverse dataset (total 216 audio files). We randomly split the dataset into two parts: ‘test’ and ‘train’ with test data having 50 audio files.

ipynb Notebook Files

DEC_ResAE.ipynb: To evaluate the DER score for the DEC models described in the report. Use the link available in Tutorial section to open it on google colab
ExtractVAD.ipynb: Used to extract and save all the VAD mapping for the audio files in VoxConverse dataset.
ExtractXvectors.ipynb: Used to precompute X-vectors for the audio files in VoxConverse dataset and save it into a zip file to use it in the DiarizationDataset.
Baseline.ipynb: To evaluate the DER score for the baseline models described in the report. Use the link available in the Tutorial section to open it on google colab.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Documentation: EE698R DEC based Diarization Model

DataSet

ipynb Notebook Files

Tutorial

API Documentation

Index

API Documentation

Clone this wiki locally