-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Aditya Singh edited this page May 12, 2021
·
4 revisions
This speaker diarization model uses Deep Embedding Clustering with a deep neural network initialized via a Residual Autoencoder to assign speaker labels to segments of the raw audio signal. Clustering is perfomed on x-vectors extracted using Desplanques et al.'s ECAPA-TDNN framework. We use Silero-VAD for voice audio detection.
Baseline Model: Spectral clustering is used for audio-label assignment.
Model is tested on VoxConverse dataset (total 216 audio files). We randomly split the dataset into two parts: ‘test’ and ‘train’ with test data having 50 audio files.
- DEC_ResAE.ipynb: To evaluate the DER score for the DEC models described in the report. Use the link available in Tutorial section to open it on google colab
- ExtractVAD.ipynb: Used to extract and save all the VAD mapping for the audio files in VoxConverse dataset.
- ExtractXvectors.ipynb: Used to precompute X-vectors for the audio files in VoxConverse dataset and save it into a zip file to use it in the DiarizationDataset.
- Baseline.ipynb: To evaluate the DER score for the baseline models described in the report. Use the link available in the Tutorial section to open it on google colab.
- Defined in: utils.py
- Defined in: baselineMethods.py
- Defined in: optimumSpeaker.py
- Defined in: DEC.py
- Defined in: colab_demo_utils.py