Skip to content

v0.6.0 docs, website, and setup onnx rename#212

Closed
peteonrails wants to merge 23 commits intomainfrom
release/0.6.0
Closed

v0.6.0 docs, website, and setup onnx rename#212
peteonrails wants to merge 23 commits intomainfrom
release/0.6.0

Conversation

@peteonrails
Copy link
Owner

Summary

Follow-up to #210. These changes were pushed to release/0.6.0 after the PR was merged.

  • Rename setup parakeet to setup onnx with hidden backwards-compatible alias
  • Full documentation pass: model selection guide rewrite, user manual update, new meeting mode guide, README update
  • Website: v0.6.0 news article, model grid with all ONNX engines, download URLs bumped to 0.6.0

Test plan

  • cargo test passes
  • cargo clippy clean
  • Website renders correctly (news article, model grid, download URLs)

The ONNX binaries now include both Parakeet and Moonshine engines, so
the "parakeet" name no longer fits. Binary names change from
voxtype-parakeet-{avx2,avx512,cuda,rocm} to
voxtype-onnx-{avx2,avx512,cuda,rocm}.

Backward compatible: symlink detection checks both old and new names,
and gpu setup looks for both voxtype-onnx-* and voxtype-parakeet-*
files on disk so existing v0.5.6 installations keep working.

Cargo features, engine config, and CLI commands (setup parakeet) are
unchanged.
Integrate Alibaba's SenseVoice model via ONNX Runtime for local
transcription. SenseVoice is a CTC encoder-only model supporting
zh/en/ja/ko/yue with a single forward pass. The preprocessing pipeline
converts audio to 80-dim Fbank features, stacks via LFR to 560-dim,
then CMVN-normalizes before ONNX inference.

Cherry-picked from spike/sensevoice-onnx (3320058).
Cherry-picked from spike/sensevoice-onnx (5ae82b8).
Cherry-picked from spike/sensevoice-onnx (33ec709).
Move Fbank feature extraction from sensevoice.rs into shared
fbank.rs with parameterizable FbankConfig (window type, frame
length/shift, pre-emphasis). Add CTC greedy decoder in ctc.rs.
Both modules will be reused by Paraformer, Dolphin, Omnilingual,
and FireRedASR engines.
Consolidate ONNX dependencies under onnx-common feature flag.
Add TranscriptionEngine variants, config structs, CLI parsing,
daemon match arms, VAD auto-selection, and notification icons
for Paraformer, Dolphin, Omnilingual, and FireRedASR engines.
Four new ONNX-based transcription backends using the shared fbank
and CTC decoder infrastructure:

- Paraformer: FunASR CTC encoder (zh/en), same preprocessing as
  SenseVoice with LFR from model metadata
- Dolphin: dictation-optimized CTC encoder with Hann window,
  31.25ms frame, no LFR or pre-emphasis
- Omnilingual: FunASR 50+ language model with 20ms frame shift
  and per-utterance instance normalization
- FireRedASR: autoregressive encoder-decoder (sherpa-onnx exports)
  following the Moonshine pattern for greedy decoding
Introduces continuous meeting transcription with chunked processing,
speaker attribution, and export capabilities (Pro feature).

New modules: meeting/chunk.rs, meeting/data.rs, meeting/state.rs,
meeting/storage.rs, meeting/export/{json,markdown,txt}.rs

CLI: voxtype meeting {start,stop,pause,resume,status,list,show,export}

Adapted to multi-engine architecture (accepts Config instead of
WhisperConfig for engine-agnostic transcriber creation).
Adds meeting lifecycle management to the daemon: start, stop, pause,
resume, and chunk processing. Uses file-based IPC for state
communication with CLI.

Adapted send_notification calls for multi-engine signature and
MeetingDaemon::new to accept full Config.
Introduces loopback audio capture alongside microphone for
You/Remote speaker attribution. Adds diarization module
with simple energy-based speaker detection.
Adds ML-based speaker embedding extraction using ONNX Runtime
for improved speaker diarization accuracy. Includes spectral
clustering for speaker assignment.

Uses existing onnx-common deps via ml-diarization feature flag.
Adds meeting summarization using local Ollama LLM: key points,
action items, and speaker attribution. Includes configurable
prompt templates and async processing.
Comprehensive tests for meeting data types, storage, state
transitions, chunk processing, and export formats. Updates
smoke test documentation with meeting mode test procedures.
Auto-fix push_str single-char to push, unneeded returns,
derivable impls, collapsed if-else, redundant closures.
Manual fixes: rename ExportFormat::from_str to parse,
remove wildcard-with-pattern in match arm.
…edASR

The setup model command only knew about Whisper, Parakeet, Moonshine, and
SenseVoice. Add model catalog entries, download logic, and interactive
menu sections for the four remaining ONNX engines.

Uses generic shared handlers (validate_onnx_ctc_model, download_onnx_model,
handle_onnx_engine_selection, update_config_engine) to avoid duplicating
~200 lines per engine.
FireRedASR dropped: autoregressive encoder-decoder architecture is too
complex for the value, v2 has no ONNX export, and the 1.74GB model is
Chinese-primary niche.

Replaced SenseVoice, Paraformer, Dolphin, Omnilingual, CTC, and Fbank
implementations with improved versions that extract shared preprocessing
into dedicated modules (ctc.rs for CTC decoding, fbank.rs for mel
filterbank extraction).
Meeting mode ships as a standard feature, not Pro-gated. Remove
license.rs module, Pro feature checks, and related error variants.

Also incorporates code quality improvements from meeting mode review:
storage path handling, summary module cleanup, VAD integration test
coverage for all ONNX engine variants.

Cherry-picked from feature/meeting-mode (22873ef, 51548ab).
Dolphin: Add Fbank preprocessing pipeline. The model expects [N,T,80]
Fbank features, not raw waveform. Add CMVN normalization from model
metadata (mean/invstd keys with already-negated values). Fix input
tensor name (x_len) and type (i64). Add lob_probs output name.

Paraformer: Fix BPE marker stripping for 3D logits path. Add
ctc_decode_to_ids() that returns token IDs, then route through
tokens_to_text() which handles @@ marker removal and special token
filtering. Previously the CTC greedy decode path left @@ artifacts
and </s> tokens in the output.

Both engines now pass smoke tests with correct transcription output.
Cover all 7 engine variants (Whisper, Parakeet, Moonshine, SenseVoice,
Paraformer, Dolphin, Omnilingual) with quick validation, daemon
integration, error handling, and performance comparison procedures.
Dockerfile.onnx and Dockerfile.onnx-cuda now build with all ONNX engines
(sensevoice, paraformer, dolphin, omnilingual) instead of just parakeet +
moonshine. Added .worktrees/ to .dockerignore to prevent 16GB of worktree
data from being sent as build context over SSH.
The ONNX binaries now include six engines beyond Parakeet, so the
command name should reflect that. Users with existing scripts using
`voxtype setup parakeet` will continue to work via the hidden alias.
Rewrite model selection guide for all 7 engines with decision tree,
per-engine details, hardware recommendations, and troubleshooting.

Update user manual with all engine sections, meeting mode commands,
setup onnx documentation, and config examples.

Add meeting mode guide covering commands, configuration, storage,
speaker diarization, AI summarization, and export formats.

Update README with engine comparison table, meeting mode usage,
architecture diagram, and --engine CLI flag.

Add v0.6.0 news article to website with engine table, meeting mode,
speaker attribution, export formats, and setup onnx rename.

Update website homepage model grid with all 6 ONNX engines, remove
"experimental" from Parakeet, bump download URLs to v0.6.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant