Nemotron streaming transcription (real-time text output)#155
Nemotron streaming transcription (real-time text output)#155lokkju wants to merge 23 commits intopeteonrails:mainfrom
Conversation
Implement parallel transcription of audio chunks during recording, reducing perceived latency on slower machines or for long recordings. New config options (all off by default): - eager_processing: enable/disable the feature - eager_chunk_secs: audio chunk duration (default 5.0s) - eager_overlap_secs: overlap between chunks (default 0.5s) New CLI flags: - --eager-processing - --eager-chunk-secs <SECS> - --eager-overlap-secs <SECS> The implementation chunks audio during recording, transcribes chunks in parallel background tasks, then combines results with word-level deduplication at chunk boundaries when recording stops.
- Add MEDIA, RECORD, REWIND, FASTFORWARD key names to evdev_listener.rs parse_key_name() - Support prefixed numeric keycodes: WEV_/X11_/XEV_ for XKB keycodes (offset by 8), EVTEST_ for kernel keycodes - Support hex values after prefix (e.g. WEV_0xEA) - Reject bare numeric keycodes as ambiguous with a helpful error message explaining the prefix requirement - Update docs/CONFIGURATION.md and docs/USER_MANUAL.md with new key names and numeric keycode documentation - Update --hotkey CLI help text in src/cli.rs
- Add src/output/eitype.rs implementing TextOutput trait using eitype CLI tool (libei/EI protocol for Wayland) - Add Eitype variant to OutputDriver enum in src/config.rs with Display, FromStr, and serde support - Add EitypeNotFound error variant in src/error.rs - Insert eitype after wtype in default fallback chain in src/output/mod.rs (wtype -> eitype -> dotool -> ydotool -> clipboard -> xclip) - Update docs/CONFIGURATION.md and docs/USER_MANUAL.md with eitype driver documentation and GNOME/KDE compatibility table eitype uses the Emulated Input protocol supported by GNOME and KDE, where wtype's virtual-keyboard protocol does not work.
- Add eitype to OutputChainStatus struct and detection logic in src/setup/mod.rs - Display eitype status in print_output_chain_status() between wtype and ydotool - Include eitype in primary method detection (after wtype, before ydotool) - Add eitype install suggestion when no output method is available on Wayland
- Bump parakeet-rs from 0.2.9 to 0.3.1 in Cargo.toml/Cargo.lock (maps parakeet-rocm feature to migraphx, matching upstream rename) - Add Nemotron variant to ParakeetModelType enum in config.rs - Add streaming config option (auto-enabled for Nemotron models) - Add StreamingTranscriber trait in transcribe/mod.rs with transcribe_chunk, flush, reset, get_transcript, chunk_size methods - Implement NemotronStreamingTranscriber in transcribe/parakeet.rs wrapping parakeet_rs::Nemotron with delta-based text output - Extend detect_model_type to recognize Nemotron file structure (encoder.onnx + decoder_joint.onnx + tokenizer.model) - Add non-streaming Nemotron support via transcribe_audio in ParakeetTranscriber for batch transcription fallback - Add State::StreamingRecording variant in state.rs with audio_buffer and text_output_so_far tracking - Integrate streaming into daemon.rs main event loop: - Channel architecture: audio_rx -> chunk_tx -> blocking task -> text_tx - Live text output via output_with_fallback during recording - Flush on hotkey release to drain remaining decoder state - Support in push-to-talk, toggle, and external trigger modes - Streaming cleanup in cancel, timeout, and shutdown handlers - Add tests for Nemotron model detection and streaming state
- Add Nemotron model type to CONFIGURATION.md model_type values - Add auto-detection table for model file structures - Document streaming config option in CONFIGURATION.md - Add Nemotron complete example alongside TDT example - Add "Nemotron Streaming" section to USER_MANUAL.md explaining real-time text output during recording
Run cargo fmt on the files modified for Nemotron streaming support.
- Add nemotron-speech-streaming-en-0.6b to PARAKEET_MODELS in setup/model.rs, downloadable via `voxtype setup model` - Add huggingface_path field to ParakeetModelInfo for repos where model files are in a subdirectory (Nemotron uses altunenes/parakeet-rs repo with files under nemotron-speech-streaming-en-0.6b/) - Update resolve_model_path error message in parakeet.rs to mention Nemotron download link and `voxtype setup model` command - Update test_parakeet_models_have_files to accept tokenizer.model as a valid tokenizer file (Nemotron uses SentencePiece, not vocab.txt) - Add nemotron to is_parakeet_model test - Add contrib/nemotron-streaming-test-config.toml sample config
Nemotron models use non-hyphenated filenames (encoder.onnx, decoder_joint.onnx) and tokenizer.model instead of vocab.txt. Without this, downloaded Nemotron models would not show [installed].
should_use_streaming() was defaulting to TDT when model_type wasn't explicitly set in config, instead of detecting from model files on disk. Now uses resolve_model_path + detect_model_type to check the actual model directory, enabling streaming automatically for Nemotron models.
The Nemotron streaming model was being loaded fresh on every recording start (~3s), causing the hotkey release to fire before the model was ready. Now the streaming transcriber is loaded once at daemon startup and persists across recordings via a command channel (Audio/Flush/Reset/ Shutdown). Recording start just sends a Reset command (instant).
- Don't drop streaming_text_rx between recordings — the task persists and needs both channel ends alive across sessions - stop_streaming() now waits for flush text with timeout and returns it, instead of the caller doing try_recv (which raced with flush processing) - Store streaming_chunk_size in Daemon struct instead of hardcoding - Use StreamingCommand::Shutdown in daemon shutdown path
parakeet-rs transcribe_chunk() already returns only the new tokens (delta), not the full cumulative transcript. Our wrapper was treating the return as cumulative and trying to extract a delta by tracking last_transcript_len, which produced garbled output.
parakeet-rs 0.3.1 renamed ExecutionProvider::ROCm to ExecutionProvider::MIGraphX. The Cargo feature was already updated but the enum variant in the code was missed.
When the hotkey is released, audio chunks may still be queued for
processing. The old stop_streaming() only waited for flush output
and broke early, losing text from in-flight chunks.
Now the flush handler sends a sentinel ("\0") after completing, and
stop_streaming() collects all text deltas (from pending chunks AND
flush) until it receives the sentinel. Timeout bumped to 30s to
accommodate slow CPU inference of queued chunks.
Add nemotron-speech-streaming-en-0.6b-int8 and int4 to PARAKEET_MODELS in src/setup/model.rs. These are quantized variants hosted under lokkju/ on HuggingFace that use the same encoder.onnx/decoder_joint.onnx filenames (required by parakeet-rs from_pretrained). Update related tests.
|
I've created int4 and int8 quants, and they do provide speed and size advantage, while seemingly having the same quality: |
The Parakeet model detection in `voxtype setup check` only looked for directories containing "parakeet" in the name and checked for TDT/CTC file patterns. Nemotron models were missed entirely. Replace the ad-hoc directory scan with a loop over known model names using validate_parakeet_model(), which already handles all model file structures (TDT, CTC, Nemotron).
|
One thing for us to think about is that most of our users assign modifier key sequences like SUPER-SHIFT-X or RIGHTALT to toggle recording, so streaming output may not appear and in the worst case scenario it might trigger other actions. Personally, I chose HOME and made SCROLLOCK the default, and those keys would not collide with streaming output. I haven't tested this out yet, but it's something to look at. mostly leaving this as a note for myself. |
|
I'm not sure how the modifier key sequences would affect output? at least when using wayland's virtual keyboard protocol or libei compat such as in Mutter, it should be treated as a seperate input device, so the modifier keys on your actual keyboard shouldn't affect the input coming from wtype/eitype; is that what you were potentially seeing as a problem? My main daily driver is a Framework 13, and it's F12 function key is a gear icon that maps to KEY_MEDIA; absolutely perfect for the hold-to-talk key. Right now, with the int4 quant, it's still just slightly too slow for my preferences; but you do get the typing as you speak, which is quite nice. I'm exploring the webgpu backend, but so far it's limited featureset seems to actually cause slower processing than avx512. Have you tried with ncnn at all? supposedly ONNX can be converted, and it has great vulkan backend support. |
Yes, when Omarchy incpororated Voxtype, users would report seeing wild results when unkeying SUPER-SHIFT-X because the keystrokes started before the user finished unkeying - I want to make sure I don't set folks up for that. But if you are right about it being a separate input device, then it sounds like the diagnosis may have been off - I'll have to look in to that some more. Have not tried ncnn but will allocate a little time to look in to it. Looking forward to getting nemotron merged soon - even if experimental! |
|
It's looking like the webgpu experiments won't help much here, nor vulkan backends; the streaming nature causes to much transfer between device memory. As it is, fp32 is 3x realtime, with int8 being 8x, and int4 4x (which doesn't make sense, but I haven't debugged it yet) |
That is unfortunate - I hope to get some numbers for you on my testing workstation this weekend: here's hoping they show some promise |
Add probe_output_chain() and output_with_cached_index() to src/output/mod.rs. These allow probing the output driver chain once and reusing the cached index for subsequent output calls, skipping redundant is_available() subprocess spawns per delta. - probe_output_chain(): walks chain, returns index of first available driver - output_with_cached_index(): outputs directly via cached index, falls back to full probe if cached driver fails - MockTextOutput test helper with call counters - Tests verifying cached path skips is_available() entirely - Latency baseline test: 15 mock deltas in <50ms
In src/daemon.rs, add streaming_output_chain and streaming_output_index fields to Daemon. These are set once when a streaming session begins (begin_streaming_session) and reused for every text delta, eliminating per-delta calls to create_output_chain() and is_available() subprocess probes. - Set cache in begin_streaming_session() after probing once - Use output_with_cached_index() in the text delta handler - Take cached chain for final flush output at session end - Clear cache in cancel_streaming() on session cancel - Fallback to full probe if cache is missing or driver fails Reduces ~45ms of subprocess overhead across a typical 15-delta streaming session to ~3ms (single probe at session start).
|
So at this point, it's not the models; it's the rest of the system around it. I've optimized the typing tool check to cache on launch, and that saves 5ms twice per second, which ads up. similar small gains elsewhere as well. I think the next step is to use eitype/wrtype as a library rather than launch as a tool; I'll be trying that tonight or tomorrow. Regardless, it's usable right now, and a better experience than typing for a while, then waiting for the whole transcription, for sure. |
|
I'm almost ready with 0.6.0: as soon as I ship that I'll start integrating this branch. Thank you again for this really solid contribution! |
Summary
Adds Nemotron Speech Streaming EN 0.6B support (issue #47) to the Parakeet engine, enabling real-time text output during recording. Text is typed incrementally as you speak rather than waiting until recording stops.
ParakeetModelType::Nemotron, auto-detected from model filesStreamingTranscriberfor incremental chunk-based transcriptionState::StreamingRecordingwith live text output during recordingvoxtype setup model(option 13)rocm→migraphx)How it works
Audio chunks (560ms / 8960 samples at 16kHz) are fed to the Nemotron model via a persistent blocking task. Text deltas are sent back through a channel and typed immediately via the output chain. On recording stop, remaining audio is flushed through silence padding.
Config
Status: Not Ready
This branch is functional but has a significant limitation:
Inference speed on CPU is too slow for real-time use. The 0.6B fp32 model takes longer than 560ms to process each chunk on CPU, so text output lags behind speech. It works, but with noticeable delay.
Path forward
--features parakeet,parakeet-cuda) would make this real-time on NVIDIA GPUs but hasn't been tested yet.--features parakeet,parakeet-rocm) for AMD discrete GPUs — compiles but untested.Test plan
cargo build --features parakeetcompilescargo buildwithout parakeet compiles (no regressions)cargo test— 282 tests passvoxtype setup model