Awesome Speaker Diarization Awesome Contribution

Table of contents

Overview

This is a curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.

The purpose of this repo is to organize the world’s resources for speaker diarization, and make them universally accessible and useful.

To add items to this page, simply send a pull request. (contributing guide)

Publications

Special topics

Review & survey papers

Large language model (LLM)

Supervised diarization

Joint diarization and ASR

Online speaker diarization

Challenges

Audio-Visual Speaker Diarization

Other

2021

2020

2019

2018

2017

2016

2015

2014

2013

2011

2009

2008

2006

Software

Framework

Link Language Description
FunASR GitHub stars Python & PyTorch FunASR is an open-source speech toolkit based on PyTorch, which aims at bridging the gap between academic researchs and industrial applications.
MiniVox GitHub stars MATLAB MiniVox is an open-source evaluation system for the online speaker diarization task.
SpeechBrain GitHub stars Python & PyTorch SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.
SIDEKIT for diarization (s4d) Python An open source package extension of SIDEKIT for Speaker diarization.
pyAudioAnalysis GitHub stars Python Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.
AaltoASR GitHub stars Python & Perl Speaker diarization scripts, based on AaltoASR.
LIUM SpkDiarization Java LIUM_SpkDiarization is a software dedicated to speaker diarization (i.e. speaker segmentation and clustering). It is written in Java, and includes the most recent developments in the domain (as of 2013).
kaldi-asr Build Status Bash Example scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation.
Alize LIA_SpkSeg C++ ALIZÉ is an opensource platform for speaker recognition. LIA_SpkSeg is the tools for speaker diarization.
pyannote-audio GitHub stars Python Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding.
pyBK GitHub stars Python Speaker diarization using binary key speaker modelling. Computationally light solution that does not require external training data.
Speaker-Diarization GitHub stars Python Speaker diarization using uis-rnn and GhostVLAD. An easier way to support openset speakers.
EEND GitHub stars Python & Bash & Perl End-to-End Neural Diarization.
VBx GitHub stars Python Variational Bayes HMM over x-vectors diarization. x-vector extractor recipe
RE-VERB GitHub stars Python & JavaScript RE: VERB is speaker diarization system, it allows the user to send/record audio of a conversation and receive timestamps of who spoke when.
StreamingSpeakerDiarization GitHub stars Python Streaming speaker diarization, extends pyannote.audio to online processing
simple_diarizer Python Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diarized segments.
Picovoice Falcon GitHub stars C & Python A lightweight, accurate, and fast speaker diarization engine written in C and available in Python, running on CPU with minimal overhead.
DiaPer GitHub stars Python Pytorch implementation for DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors including models pre-trained on free and public data.

Evaluation

Link Language Description
pyannote-metrics GitHub stars Build Status Python A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems.
SimpleDER GitHub stars Build Status Python A lightweight library to compute Diarization Error Rate (DER).
NIST md-eval Perl (1) modified md-eval.pl from Mary Tai Knox; (2) md-eval-v21.pl from jitendra; (3) md-eval-22.pl from nryant
dscore GitHub stars Python & Perl Diarization scoring tools.
Sequence Match Accuracy Python Match the accuracy of two sequences with Hungarian algorithm.
spyder GitHub stars Python & C++ Simple Python package for fast DER computation.

Clustering

Link Language Description
uis-rnn GitHub stars Build Status Python & PyTorch Google's Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, for Fully Supervised Speaker Diarization. This clustering algorithm is supervised.
uis-rnn-sml GitHub stars Python & PyTorch A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data.
DNC GitHub stars Python & ESPnet Transformer-based Discriminative Neural Clustering (DNC) for Speaker Diarisation. Like UIS-RNN, it is supervised.
SpectralCluster GitHub stars Build Status Python Spectral clustering with affinity matrix refinement operations, auto-tune, and speaker turn constraints.
sklearn.cluster Build Status Python scikit-learn clustering algorithms.
PLDA GitHub stars Python Probabilistic Linear Discriminant Analysis & classification, written in Python.
PLDA GitHub stars C++ Open-source implementation of simplified PLDA (Probabilistic Linear Discriminant Analysis).
Auto-Tuning Spectral Clustering GitHub stars Python Auto-tuning Spectral Clustering method that does not need development set or supervised tuning.

Speaker embedding

Link Method Language Description
resemble-ai/Resemblyzer GitHub stars d-vector Python & PyTorch PyTorch implementation of generalized end-to-end loss for speaker verification, which can be used for voice cloning and diarization.
Speaker_Verification GitHub stars d-vector Python & TensorFlow Tensorflow implementation of generalized end-to-end loss for speaker verification.
PyTorch_Speaker_Verification GitHub stars d-vector Python & PyTorch PyTorch implementation of "Generalized End-to-End Loss for Speaker Verification" by Wan, Li et al. With UIS-RNN integration.
Real-Time Voice Cloning GitHub stars d-vector Python & PyTorch Implementation of "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" (SV2TTS) with a vocoder that works in real-time.
deep-speaker GitHub stars d-vector Python & Keras Third party implementation of the Baidu paper Deep Speaker: an End-to-End Neural Speaker Embedding System.
x-vector-kaldi-tf GitHub stars x-vector Python & TensorFlow & Perl Tensorflow implementation of x-vector topology on top of Kaldi recipe.
kaldi-ivector GitHub stars i-vector C++ & Perl Extension to Kaldi implementing the standard i-vector hyperparameter estimation and i-vector extraction procedure.
voxceleb-ivector GitHub stars i-vector Perl Voxceleb1 i-vector based speaker recognition system.
pytorch_xvectors GitHub stars x-vector Python & PyTorch PyTorch implementation of Voxceleb x-vectors. Additionaly, includes meta-learning architectures for embedding training. Evaluated with speaker diarization and speaker verification.
ASVtorch i-vector Python & PyTorch ASVtorch is a toolkit for automatic speaker recognition.
asv-subtools GitHub stars i-vector & x-vector Kaldi & PyTorch ASV-Subtools is developed based on Pytorch and Kaldi for the task of speaker recognition, language identification, etc. The 'sub' of 'subtools' means that there are many modular tools and the parts constitute the whole.
WeSpeaker GitHub stars x-vector & r-vector Python & C++ & PyTorch WeSpeaker is a research and production oriented speaker verification, recognition and diarization toolkit, which supports very strong recipes with on-the-fly data preparation, model training and evaluation, as well as runtime C++ codes.

Speaker change detection

Link Language Description
change_detection GitHub stars Python & Keras Code for Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks.

Audio feature extraction

Link Language Description
LibROSA GitHub stars Python Python library for audio and music analysis. https://librosa.github.io/
python_speech_features GitHub stars Python This library provides common speech features for ASR including MFCCs and filterbank energies. https://python-speech-features.readthedocs.io/en/latest/
pyAudioAnalysis GitHub stars Python Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.

Audio data augmentation

Link Language Description
pyroomacoustics GitHub stars Python Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios. https://pyroomacoustics.readthedocs.io
gpuRIR GitHub stars Python Python library for Room Impulse Response (RIR) simulation with GPU acceleration
rir_simulator_python GitHub stars Python Room impulse response simulator using python
WavAugment GitHub stars Python & PyTorch WavAugment performs data augmentation on audio data. The audio data is represented as pytorch tensors
EEND_dataprep GitHub stars Bash & Python Recipes for generating simulated conversations used to train end-to-end diarization models.

Other software

Link Language Description
VB Diarization GitHub stars Build Status Python VB Diarization with Eigenvoice and HMM Priors.
DOVER-Lap GitHub stars Python Python package for combining diarization system outputs

Datasets

Diarization datasets

Audio Diarization ground truth Language Pricing Additional information
2000 NIST Speaker Recognition Evaluation Disk-6 (Switchboard), Disk-8 (CALLHOME) Multiple $2400.00 Evaluation Plan
2003 NIST Rich Transcription Evaluation Data Together with audios en, ar, zh $2000.00 telephone speech, broadcast news
CALLHOME American English Speech CALLHOME American English Transcripts en $1500.00 + $1000.00 CH109 whitelist
The ICSI Meeting Corpus Together with audios en Free License
The AMI Meeting Corpus Together with audios (need to be processed) Multiple Free License
Fisher English Training Speech Part 1 Speech Fisher English Training Speech Part 1 Transcripts en $7000.00 + $1000.00
Fisher English Training Part 2, Speech Fisher English Training Part 2, Transcripts en $7000.00 + $1000.00
VoxConverse TBD TBD Free VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos
MiniVox Benchmark MiniVox Benchmark en Free MiniVox is an automatic framework to transform any speaker-labelled dataset into continuous speech datastream with episodically revealed label feedbacks.
The AliMeeting Corpus Together with audios zh Free

Speaker embedding training sets

Name Utterances Speakers Language Pricing Additional information
TIMIT 6K+ 630 en $250.00 Published in 1993, the TIMIT corpus of read speech is one of the earliest speaker recognition datasets.
VCTK 43K+ 109 en Free Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent.
LibriSpeech 292K 2K+ en Free Large-scale (1000 hours) corpus of read English speech.
Multilingual LibriSpeech (MLS) ? ? en, de, nl, es, fr, it, pt, po Free Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.
LibriVox 180K 9K+ Multiple Free Free public domain audiobooks. LibriSpeech is a processed subset of LibriVox. Each original unsegmented utterance could be very long.
VoxCeleb 1&2 1M+ 7K Multiple Free VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.
The Spoken Wikipedia Corpora 5K 879 en, de, nl Free Volunteer readers reading Wikipedia articles.
CN-Celeb 130K+ 1K zh Free A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University.
BookTubeSpeech 8K 8K en Free Audio samples extracted from BookTube videos - videos where people share their opinions on books - from YouTube. The dataset can be downloaded using BookTubeSpeech-download.
DeepMine 540K 1850 fa, en Unknown A speech database in Persian and English designed to build and evaluate speaker verification, as well as Persian ASR systems.
NISP-Dataset ? 345 hi, kn, ml, ta, te (all Indian languages) Free This dataset contains speech recordings along with speaker physical parameters (height, weight, ... ) as well as regional information and linguistic information.

Augmentation noise sources

Name Utterances Pricing Additional information
AudioSet 2M Free A large-scale dataset of manually annotated audio events.
MUSAN N/A Free MUSAN is a corpus of music, speech, and noise recordings.

Conferences

Conference/Workshop Frequency Page Limit Organization Blind Review
ICASSP Annual 4 + 1 (ref) IEEE No
InterSpeech Annual 4 + 1 (ref) ISCA No
Speaker Odyssey Biennial 8 + 2 (ref) ISCA No
SLT Biennial 6 + 2 (ref) IEEE Yes
ASRU Biennial 6 + 2 (ref) IEEE Yes
WASPAA Biennial 4 + 1 (ref) IEEE No
IJCB Annual 8 IEEE & IAPR TC-4 Yes

Other learning materials

Online courses

Books

Tech blogs

Video tutorials

Products

Company Product
Google Recorder app
Google Google Cloud Speech-to-Text API
Amazon Amazon Transcribe
IBM Watson Speech To Text API
DeepAffects Speaker Diarization API
Alibaba Tingwu (听悟)
Microsoft Azure Conversation Transcription API

Star History

Star History Chart