Automatic Speech Recognition (ASR)

Key Features

HuggingFace Space for Audio Transcription (File, Microphone and YouTube)
Pretrained models available in 14+ languages
Automatic Speech Recognition (ASR)
- Supported ASR models:
  - Jasper, QuartzNet, CitriNet, ContextNet
  - Conformer-CTC, Conformer-Transducer, FastConformer-CTC, FastConformer-Transducer
  - Squeezeformer-CTC and Squeezeformer-Transducer
  - LSTM-Transducer (RNNT) and LSTM-CTC
- Supports the following decoders/losses:
  - CTC
  - Transducer/RNNT
  - Hybrid Transducer/CTC
  - NeMo Original Multi-blank Transducers and Token-and-Duration Transducers (TDT)
- Streaming/Buffered ASR (CTC/Transducer) - Chunked Inference Examples
- Cache-aware Streaming Conformer with multiple lookaheads (including microphone streaming tutorial.
- Beam Search decoding
- Language Modelling for ASR (CTC and RNNT): N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer
- Support of long audios for Conformer with memory efficient local attention
Speech Classification, Speech Command Recognition and Language Identification: MatchboxNet (Command Recognition), AmberNet (LangID)
Voice activity Detection (VAD): MarbleNet
- ASR with VAD Inference - Example
Speaker Recognition: TitaNet, ECAPA_TDNN, SpeakerNet
Speaker Diarization
- Clustering Diarizer: TitaNet, ECAPA_TDNN, SpeakerNet
- Neural Diarizer: MSDD (Multi-scale Diarization Decoder)
Speech Intent Detection and Slot Filling: Conformer-Transformer

You can also get a high-level overview of NeMo ASR by watching the talk NVIDIA NeMo: Toolkit for Conversational AI, presented at PyData Yerevan 2022:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Automatic Speech Recognition (ASR)

Key Features

Files

README.md

Latest commit

History

README.md

File metadata and controls

Automatic Speech Recognition (ASR)

Key Features