- HuggingFace Space for Audio Transcription (File, Microphone and YouTube)
- Pretrained models available in 14+ languages
- Automatic Speech Recognition (ASR)
- Supported ASR models:
- Jasper, QuartzNet, CitriNet, ContextNet
- Conformer-CTC, Conformer-Transducer, FastConformer-CTC, FastConformer-Transducer
- Squeezeformer-CTC and Squeezeformer-Transducer
- LSTM-Transducer (RNNT) and LSTM-CTC
- Supports the following decoders/losses:
- CTC
- Transducer/RNNT
- Hybrid Transducer/CTC
- NeMo Original Multi-blank Transducers and Token-and-Duration Transducers (TDT)
- Streaming/Buffered ASR (CTC/Transducer) - Chunked Inference Examples
- Cache-aware Streaming Conformer with multiple lookaheads (including microphone streaming tutorial.
- Beam Search decoding
- Language Modelling for ASR (CTC and RNNT): N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer
- Support of long audios for Conformer with memory efficient local attention
- Supported ASR models:
- Speech Classification, Speech Command Recognition and Language Identification: MatchboxNet (Command Recognition), AmberNet (LangID)
- Voice activity Detection (VAD): MarbleNet
- ASR with VAD Inference - Example
- Speaker Recognition: TitaNet, ECAPA_TDNN, SpeakerNet
- Speaker Diarization
- Clustering Diarizer: TitaNet, ECAPA_TDNN, SpeakerNet
- Neural Diarizer: MSDD (Multi-scale Diarization Decoder)
- Speech Intent Detection and Slot Filling: Conformer-Transformer
You can also get a high-level overview of NeMo ASR by watching the talk NVIDIA NeMo: Toolkit for Conversational AI, presented at PyData Yerevan 2022: