Skip to content

v0.6.0: Meeting mode, SenseVoice, and 4 new ASR engines#210

Merged
peteonrails merged 22 commits intomainfrom
release/0.6.0
Feb 17, 2026
Merged

v0.6.0: Meeting mode, SenseVoice, and 4 new ASR engines#210
peteonrails merged 22 commits intomainfrom
release/0.6.0

Conversation

@peteonrails
Copy link
Owner

Summary

  • Meeting mode: Continuous transcription with speaker attribution, pause/resume, export (text/markdown/JSON/SRT/VTT), and AI summarization via Ollama
  • 5 new ONNX engines: SenseVoice (zh/en/ja/ko/yue with auto language detection), Paraformer, Dolphin, Omnilingual (50+ languages), plus shared fbank/CTC infrastructure
  • ML speaker diarization: ONNX embedding-based speaker identification as alternative to simple energy-based attribution
  • Dual audio capture: Mic + system audio loopback for capturing remote meeting participants
  • Binary rename: voxtype-parakeet-* binaries renamed to voxtype-onnx-* (now include all ONNX engines, not just Parakeet)
  • Removed: FireRedASR (license incompatible), Pro license gate
  • Updated: Docker build configs for all ONNX engines, smoke test procedures, model selection setup

21 commits, 65 files changed, ~12,700 lines added across meeting mode, new engines, shared preprocessing, CLI commands, configuration, and tests.

Test plan

  • cargo test passes (526 tests)
  • cargo clippy clean
  • All 7 binary variants built and version-verified
  • AVX-512 instruction validation (AVX2 and Vulkan clean)
  • Multi-engine transcription smoke tests (Whisper EN, SenseVoice EN, SenseVoice ZH)
  • Signal handling and daemon lifecycle tests
  • CLI commands and config compatibility tests
  • Extended manual testing of meeting mode across real meetings
  • SenseVoice English accuracy comparison against sherpa-onnx reference

The ONNX binaries now include both Parakeet and Moonshine engines, so
the "parakeet" name no longer fits. Binary names change from
voxtype-parakeet-{avx2,avx512,cuda,rocm} to
voxtype-onnx-{avx2,avx512,cuda,rocm}.

Backward compatible: symlink detection checks both old and new names,
and gpu setup looks for both voxtype-onnx-* and voxtype-parakeet-*
files on disk so existing v0.5.6 installations keep working.

Cargo features, engine config, and CLI commands (setup parakeet) are
unchanged.
Integrate Alibaba's SenseVoice model via ONNX Runtime for local
transcription. SenseVoice is a CTC encoder-only model supporting
zh/en/ja/ko/yue with a single forward pass. The preprocessing pipeline
converts audio to 80-dim Fbank features, stacks via LFR to 560-dim,
then CMVN-normalizes before ONNX inference.

Cherry-picked from spike/sensevoice-onnx (3320058).
Cherry-picked from spike/sensevoice-onnx (5ae82b8).
Cherry-picked from spike/sensevoice-onnx (33ec709).
Move Fbank feature extraction from sensevoice.rs into shared
fbank.rs with parameterizable FbankConfig (window type, frame
length/shift, pre-emphasis). Add CTC greedy decoder in ctc.rs.
Both modules will be reused by Paraformer, Dolphin, Omnilingual,
and FireRedASR engines.
Consolidate ONNX dependencies under onnx-common feature flag.
Add TranscriptionEngine variants, config structs, CLI parsing,
daemon match arms, VAD auto-selection, and notification icons
for Paraformer, Dolphin, Omnilingual, and FireRedASR engines.
Four new ONNX-based transcription backends using the shared fbank
and CTC decoder infrastructure:

- Paraformer: FunASR CTC encoder (zh/en), same preprocessing as
  SenseVoice with LFR from model metadata
- Dolphin: dictation-optimized CTC encoder with Hann window,
  31.25ms frame, no LFR or pre-emphasis
- Omnilingual: FunASR 50+ language model with 20ms frame shift
  and per-utterance instance normalization
- FireRedASR: autoregressive encoder-decoder (sherpa-onnx exports)
  following the Moonshine pattern for greedy decoding
Introduces continuous meeting transcription with chunked processing,
speaker attribution, and export capabilities (Pro feature).

New modules: meeting/chunk.rs, meeting/data.rs, meeting/state.rs,
meeting/storage.rs, meeting/export/{json,markdown,txt}.rs

CLI: voxtype meeting {start,stop,pause,resume,status,list,show,export}

Adapted to multi-engine architecture (accepts Config instead of
WhisperConfig for engine-agnostic transcriber creation).
Adds meeting lifecycle management to the daemon: start, stop, pause,
resume, and chunk processing. Uses file-based IPC for state
communication with CLI.

Adapted send_notification calls for multi-engine signature and
MeetingDaemon::new to accept full Config.
Introduces loopback audio capture alongside microphone for
You/Remote speaker attribution. Adds diarization module
with simple energy-based speaker detection.
Adds ML-based speaker embedding extraction using ONNX Runtime
for improved speaker diarization accuracy. Includes spectral
clustering for speaker assignment.

Uses existing onnx-common deps via ml-diarization feature flag.
Adds meeting summarization using local Ollama LLM: key points,
action items, and speaker attribution. Includes configurable
prompt templates and async processing.
Comprehensive tests for meeting data types, storage, state
transitions, chunk processing, and export formats. Updates
smoke test documentation with meeting mode test procedures.
Auto-fix push_str single-char to push, unneeded returns,
derivable impls, collapsed if-else, redundant closures.
Manual fixes: rename ExportFormat::from_str to parse,
remove wildcard-with-pattern in match arm.
…edASR

The setup model command only knew about Whisper, Parakeet, Moonshine, and
SenseVoice. Add model catalog entries, download logic, and interactive
menu sections for the four remaining ONNX engines.

Uses generic shared handlers (validate_onnx_ctc_model, download_onnx_model,
handle_onnx_engine_selection, update_config_engine) to avoid duplicating
~200 lines per engine.
FireRedASR dropped: autoregressive encoder-decoder architecture is too
complex for the value, v2 has no ONNX export, and the 1.74GB model is
Chinese-primary niche.

Replaced SenseVoice, Paraformer, Dolphin, Omnilingual, CTC, and Fbank
implementations with improved versions that extract shared preprocessing
into dedicated modules (ctc.rs for CTC decoding, fbank.rs for mel
filterbank extraction).
Meeting mode ships as a standard feature, not Pro-gated. Remove
license.rs module, Pro feature checks, and related error variants.

Also incorporates code quality improvements from meeting mode review:
storage path handling, summary module cleanup, VAD integration test
coverage for all ONNX engine variants.

Cherry-picked from feature/meeting-mode (22873ef, 51548ab).
Dolphin: Add Fbank preprocessing pipeline. The model expects [N,T,80]
Fbank features, not raw waveform. Add CMVN normalization from model
metadata (mean/invstd keys with already-negated values). Fix input
tensor name (x_len) and type (i64). Add lob_probs output name.

Paraformer: Fix BPE marker stripping for 3D logits path. Add
ctc_decode_to_ids() that returns token IDs, then route through
tokens_to_text() which handles @@ marker removal and special token
filtering. Previously the CTC greedy decode path left @@ artifacts
and </s> tokens in the output.

Both engines now pass smoke tests with correct transcription output.
Cover all 7 engine variants (Whisper, Parakeet, Moonshine, SenseVoice,
Paraformer, Dolphin, Omnilingual) with quick validation, daemon
integration, error handling, and performance comparison procedures.
Dockerfile.onnx and Dockerfile.onnx-cuda now build with all ONNX engines
(sensevoice, paraformer, dolphin, omnilingual) instead of just parakeet +
moonshine. Added .worktrees/ to .dockerignore to prevent 16GB of worktree
data from being sent as build context over SSH.
The ONNX binaries now include six engines beyond Parakeet, so the
command name should reflect that. Users with existing scripts using
`voxtype setup parakeet` will continue to work via the hidden alias.
@peteonrails peteonrails merged commit dd23552 into main Feb 17, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant