A repository for comparing potential speaker diarization tools to be used in the MEXCA pipeline.
The repository contains subdirectories for different parts of the experiment:
automatic-speech-recognition/
: Contains files for exploring automatic speech recognition in Dutch on the DED21 datasetspeaker-diarization/
: Contains all files for the speaker diarization partembeddings/
: Contains the encoded speaker embeddings as .pt filesresults/
: Contains the .rttm files with speaker annotationsclustering.py
: Script for clustering the speaker embeddings and assigning the speaker labels to speaker segmentscompare_sd.ipynb
: Notebook for comparing the speaker diarization approachesdefault_parser.py
: Helper functions for argument parsingplot_pipelines_results.R
: Script for visualizing the pipeline comparisonpyannote_sd_compare.ipynb
: Notebook for analyzing the results of thepyannote.audio
pipelinesd_*.py
: Scripts for applying the respective speaker encoding modelssd_pipeline_ded21_performance.ipynb
: Notebook for analyzing the results of the most promising pipelines on the DED21 datasetspeaker_diarization.py
: Script to run all speaker encoding scripts after each otherspeaker_representation.py
: Helper functions for performing speaker diarization
speaker-segmentation/
: Contains all files for the speaker segmentation partresults/
: Cotains the .rttm files with speech segmentsseg_pyannote.py
: Script for applying speaker segmentation using thepyannote.audio
package
voice-activity-detection/
: Contains all files for the voice activity detection partresults/
: Contains the .rttm files with speech segmentscompare_vad.ipynb
: Notebook for comparing the voice activity detection approachescustom.conf
: Configuration file for the opensmile feature extractoropensmile_helper_functions
: Helper functions for extracting opensmile voice activity featuresvad_*.py
: Scripts for applying the voice activity detection models
create_ded21_corpus.ipynb
: Notebook for creating and exploring the DED21 datasetexplore_ami_corpus.ipynb
: Notebook for exploring the properties of the AMI corpusrttm.py
: Functions for creating, reading, modifying, and writing .rttm files and objectsrttm_test.py
: Preliminary test suite forrttm.py
We compare multiple pipelines for voice activity detection (VAD), speaker segmentation, speaker diarization, and explore automatic speech recognition. We apply these tools to one public dataset (AMI corpus; single channel; microphone; test set) using the default parameters and minimal postprocessing steps. The second data set (Dutch Election Debate 2021) is not openly available yet (due to copyright issues).
The results of our pipeline comparisons are shown in the notebooks:
In the speaker diarization comparison, the pyannote.audio
pipeline outperformed all other candidates, achieving an average diarization error rate of 0.32 on the AMI test set (0.21 without speech overlap) and [0.35, 0.38] on the two parts of the DED21 data set (no speech overlap).
Bredin, H. et al. (2020). Pyannote.audio: Neural building blocks for speaker diarization. ICASSP 2020, pp. 7124-7128. URL
Carletta, J. (2006). Announcing the AMI meeting corpus. The ELRA Newsletter 11(1), pp. 3-5. URL
Schumacher, G., Homan, M., & Pipal, C. (March, 2021). Welke dijsttrekker lacht het meest? En hoe? Stuk Rood Vlees. URL