+

Voxtype 0.6.0 is a major release. Five new ONNX-based transcription engines bring support for Chinese, Japanese, Korean, Cantonese, and 1600+ additional languages. Meeting mode adds continuous transcription for meetings with speaker attribution, export to multiple formats, and AI-generated summaries. The ONNX binary variants now ship with every ONNX engine included.

+ +

Five New Transcription Engines

+

Voxtype previously offered two transcription engines: Whisper and Parakeet. This release adds five more, all running locally via ONNX Runtime. Each engine has different strengths and language coverage.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
EngineLanguagesBest For
SenseVoiceChinese, English, Japanese, Korean, CantoneseCJK transcription with auto-detection
ParaformerChinese + English (bilingual), Chinese + Cantonese + English (trilingual)Mixed Chinese/English speech
Dolphin40 languages + 22 Chinese dialectsChinese dialects and Eastern languages
Omnilingual1600+ languagesLow-resource and rare languages
MoonshineEnglish (+ Japanese, Mandarin, Korean, Arabic)Fast CPU inference
+ +

Why use them: If you speak a CJK language and want a model trained specifically for it, SenseVoice and Paraformer will outperform Whisper on that task. If you need a rare language that Whisper doesn't cover well, Omnilingual supports over 1600 languages via Meta's MMS model. Dolphin covers 40 languages plus 22 Chinese dialects, which is useful if your dialect isn't well-served by general-purpose models.

+ +

All five engines are CTC-based, meaning they run in a single forward pass with no autoregressive decoder loop. In practice, this makes them fast on CPU.

+ +
+
config.toml
+
# SenseVoice for CJK + English
+engine = "sensevoice"
+
+[sensevoice]
+model = "SenseVoiceSmall"
+language = "auto"    # auto, zh, en, ja, ko, yue
+
+# Paraformer for bilingual Chinese/English
+engine = "paraformer"
+
+[paraformer]
+model = "paraformer-zh"
+
+# Omnilingual for 1600+ languages
+engine = "omnilingual"
+
+[omnilingual]
+model = "mms-1b-all"
+
+ +

Download models with voxtype setup model and select from the interactive menu. The new engines share infrastructure with Parakeet and Moonshine: shared Fbank feature extraction, shared CTC decoding, and shared ONNX Runtime session management.

+ +

All ONNX Engines in Every ONNX Binary

+

Previously, the ONNX release binaries only included Parakeet. Starting with v0.6.0, all four ONNX binary variants (onnx-avx2, onnx-avx512, onnx-cuda, onnx-rocm) include every ONNX engine: Parakeet, Moonshine, SenseVoice, Paraformer, Dolphin, and Omnilingual. The three Whisper-only binaries (avx2, avx512, vulkan) remain Whisper-only. Switch engines by changing a single config line.

+ +

The release binary naming also changed from voxtype-*-parakeet-* to voxtype-*-onnx-* to reflect this broader scope.

+ +

Meeting Mode

+

Meeting mode is a new way to use Voxtype: continuous transcription for meetings, calls, and lectures. Instead of push-to-talk, it records continuously in chunks and transcribes each chunk as it completes.

+ +
+
# Start a meeting
+voxtype meeting start --title "Weekly standup"
+
+# Pause and resume
+voxtype meeting pause
+voxtype meeting resume
+
+# Stop when done
+voxtype meeting stop
+
+# Export the transcript
+voxtype meeting export latest --format markdown
+voxtype meeting export latest --format srt
+
+ +

Why use it: Push-to-talk is great for dictation, but meetings need continuous recording. Meeting mode handles that with automatic chunking (default 30-second chunks), so memory usage stays bounded even for long sessions up to 3 hours.

+ +

Speaker Attribution

+

Meeting mode can identify who is speaking. Two approaches are available:

+ +
    +
  • Simple attribution: Uses dual audio capture (microphone + system loopback) to label segments as "You" or "Remote". Good for 1-on-1 calls where you just need to distinguish yourself from the other person.
  • +
  • ML diarization: Uses ONNX-based speaker embeddings to cluster speech by speaker and assign labels like SPEAKER_00, SPEAKER_01. Works for multi-person meetings. You can rename speakers after the fact with voxtype meeting label.
  • +
+ +
+
# Label auto-generated speaker IDs with real names
+voxtype meeting label latest SPEAKER_00 "Alice"
+voxtype meeting label latest SPEAKER_01 "Bob"
+
+ +

Meeting Export Formats

+

Meeting transcripts can be exported in five formats: plain text, Markdown, JSON, SRT subtitles, and VTT subtitles. The subtitle formats include timestamps, so you can use them with video recordings of the same meeting.

+ +
+
voxtype meeting export latest --format text
+voxtype meeting export latest --format markdown
+voxtype meeting export latest --format json
+voxtype meeting export latest --format srt
+voxtype meeting export latest --format vtt
+
+ +

AI Meeting Summaries

+

After a meeting, you can generate a summary with key points, action items, and decisions using a local LLM via Ollama or a remote API endpoint.

+ +
+
# Generate a summary using Ollama
+voxtype meeting summarize latest
+
+# Output as JSON for programmatic use
+voxtype meeting summarize latest --format json
+
+ +

This requires Ollama running locally or a configured remote summarization endpoint. The summary extracts action items, key decisions, and a concise overview from the full transcript.

+ +

Setup ONNX Replaces Setup Parakeet

+

The voxtype setup parakeet command has been renamed to voxtype setup onnx to reflect that it now manages all ONNX-based engines, not just Parakeet. The old setup parakeet command still works as a hidden alias, so existing scripts and muscle memory won't break.

+ +
+
# New name
+voxtype setup onnx
+
+# Old name still works
+voxtype setup parakeet
+
+ +

Other Changes

+
    +
  • Shared Fbank feature extraction and CTC decoder modules reduce code duplication across ONNX engines
  • +
  • Multi-engine transcription smoke tests for automated regression testing
  • +
  • Fix Dolphin and Paraformer transcription backends after initial implementation
  • +
  • Fix clippy warnings across the codebase
  • +
+