Nemotron streaming transcription (real-time text output) by lokkju · Pull Request #155 · peteonrails/voxtype

lokkju · 2026-01-30T10:27:59Z

Summary

Adds Nemotron Speech Streaming EN 0.6B support (issue #47) to the Parakeet engine, enabling real-time text output during recording. Text is typed incrementally as you speak rather than waiting until recording stops.

New model type: ParakeetModelType::Nemotron, auto-detected from model files
New trait: StreamingTranscriber for incremental chunk-based transcription
New state: State::StreamingRecording with live text output during recording
Persistent model: loaded once at daemon startup, reused across recordings
Model download: available via voxtype setup model (option 13)
parakeet-rs bump: 0.2.9 → 0.3.1 (adds Nemotron API, renames rocm → migraphx)

How it works

Audio chunks (560ms / 8960 samples at 16kHz) are fed to the Nemotron model via a persistent blocking task. Text deltas are sent back through a channel and typed immediately via the output chain. On recording stop, remaining audio is flushed through silence padding.

Config

engine = "parakeet"

[parakeet]
model = "nemotron-speech-streaming-en-0.6b"
# streaming is auto-enabled for Nemotron models
# streaming = false  # to force batch mode instead

Status: Not Ready

This branch is functional but has a significant limitation:

Inference speed on CPU is too slow for real-time use. The 0.6B fp32 model takes longer than 560ms to process each chunk on CPU, so text output lags behind speech. It works, but with noticeable delay.

Path forward

int8/int4 quantized models are being worked on. Dynamic int8 quantization should bring the model from ~2.6 GB to ~670 MB with proportionally faster inference. int4 would be even smaller/faster. This is the most likely fix for CPU users.
CUDA acceleration (--features parakeet,parakeet-cuda) would make this real-time on NVIDIA GPUs but hasn't been tested yet.
MIGraphX (--features parakeet,parakeet-rocm) for AMD discrete GPUs — compiles but untested.

Test plan

cargo build --features parakeet compiles
cargo build without parakeet compiles (no regressions)
cargo test — 282 tests pass
Manual: Nemotron model appears in voxtype setup model
Manual: text appears live during recording
Manual: second recording works without errors
Manual: batch transcription still works for CTC/TDT models
Performance: real-time streaming on CPU with quantized model
Test CUDA acceleration

Implement parallel transcription of audio chunks during recording, reducing perceived latency on slower machines or for long recordings. New config options (all off by default): - eager_processing: enable/disable the feature - eager_chunk_secs: audio chunk duration (default 5.0s) - eager_overlap_secs: overlap between chunks (default 0.5s) New CLI flags: - --eager-processing - --eager-chunk-secs <SECS> - --eager-overlap-secs <SECS> The implementation chunks audio during recording, transcribes chunks in parallel background tasks, then combines results with word-level deduplication at chunk boundaries when recording stops.

- Add MEDIA, RECORD, REWIND, FASTFORWARD key names to evdev_listener.rs parse_key_name() - Support prefixed numeric keycodes: WEV_/X11_/XEV_ for XKB keycodes (offset by 8), EVTEST_ for kernel keycodes - Support hex values after prefix (e.g. WEV_0xEA) - Reject bare numeric keycodes as ambiguous with a helpful error message explaining the prefix requirement - Update docs/CONFIGURATION.md and docs/USER_MANUAL.md with new key names and numeric keycode documentation - Update --hotkey CLI help text in src/cli.rs

- Add src/output/eitype.rs implementing TextOutput trait using eitype CLI tool (libei/EI protocol for Wayland) - Add Eitype variant to OutputDriver enum in src/config.rs with Display, FromStr, and serde support - Add EitypeNotFound error variant in src/error.rs - Insert eitype after wtype in default fallback chain in src/output/mod.rs (wtype -> eitype -> dotool -> ydotool -> clipboard -> xclip) - Update docs/CONFIGURATION.md and docs/USER_MANUAL.md with eitype driver documentation and GNOME/KDE compatibility table eitype uses the Emulated Input protocol supported by GNOME and KDE, where wtype's virtual-keyboard protocol does not work.

- Add eitype to OutputChainStatus struct and detection logic in src/setup/mod.rs - Display eitype status in print_output_chain_status() between wtype and ydotool - Include eitype in primary method detection (after wtype, before ydotool) - Add eitype install suggestion when no output method is available on Wayland

- Bump parakeet-rs from 0.2.9 to 0.3.1 in Cargo.toml/Cargo.lock (maps parakeet-rocm feature to migraphx, matching upstream rename) - Add Nemotron variant to ParakeetModelType enum in config.rs - Add streaming config option (auto-enabled for Nemotron models) - Add StreamingTranscriber trait in transcribe/mod.rs with transcribe_chunk, flush, reset, get_transcript, chunk_size methods - Implement NemotronStreamingTranscriber in transcribe/parakeet.rs wrapping parakeet_rs::Nemotron with delta-based text output - Extend detect_model_type to recognize Nemotron file structure (encoder.onnx + decoder_joint.onnx + tokenizer.model) - Add non-streaming Nemotron support via transcribe_audio in ParakeetTranscriber for batch transcription fallback - Add State::StreamingRecording variant in state.rs with audio_buffer and text_output_so_far tracking - Integrate streaming into daemon.rs main event loop: - Channel architecture: audio_rx -> chunk_tx -> blocking task -> text_tx - Live text output via output_with_fallback during recording - Flush on hotkey release to drain remaining decoder state - Support in push-to-talk, toggle, and external trigger modes - Streaming cleanup in cancel, timeout, and shutdown handlers - Add tests for Nemotron model detection and streaming state

- Add Nemotron model type to CONFIGURATION.md model_type values - Add auto-detection table for model file structures - Document streaming config option in CONFIGURATION.md - Add Nemotron complete example alongside TDT example - Add "Nemotron Streaming" section to USER_MANUAL.md explaining real-time text output during recording

Run cargo fmt on the files modified for Nemotron streaming support.

- Add nemotron-speech-streaming-en-0.6b to PARAKEET_MODELS in setup/model.rs, downloadable via `voxtype setup model` - Add huggingface_path field to ParakeetModelInfo for repos where model files are in a subdirectory (Nemotron uses altunenes/parakeet-rs repo with files under nemotron-speech-streaming-en-0.6b/) - Update resolve_model_path error message in parakeet.rs to mention Nemotron download link and `voxtype setup model` command - Update test_parakeet_models_have_files to accept tokenizer.model as a valid tokenizer file (Nemotron uses SentencePiece, not vocab.txt) - Add nemotron to is_parakeet_model test - Add contrib/nemotron-streaming-test-config.toml sample config

Nemotron models use non-hyphenated filenames (encoder.onnx, decoder_joint.onnx) and tokenizer.model instead of vocab.txt. Without this, downloaded Nemotron models would not show [installed].

should_use_streaming() was defaulting to TDT when model_type wasn't explicitly set in config, instead of detecting from model files on disk. Now uses resolve_model_path + detect_model_type to check the actual model directory, enabling streaming automatically for Nemotron models.

The Nemotron streaming model was being loaded fresh on every recording start (~3s), causing the hotkey release to fire before the model was ready. Now the streaming transcriber is loaded once at daemon startup and persists across recordings via a command channel (Audio/Flush/Reset/ Shutdown). Recording start just sends a Reset command (instant).

- Don't drop streaming_text_rx between recordings — the task persists and needs both channel ends alive across sessions - stop_streaming() now waits for flush text with timeout and returns it, instead of the caller doing try_recv (which raced with flush processing) - Store streaming_chunk_size in Daemon struct instead of hardcoding - Use StreamingCommand::Shutdown in daemon shutdown path

parakeet-rs transcribe_chunk() already returns only the new tokens (delta), not the full cumulative transcript. Our wrapper was treating the return as cumulative and trying to extract a delta by tracking last_transcript_len, which produced garbled output.

parakeet-rs 0.3.1 renamed ExecutionProvider::ROCm to ExecutionProvider::MIGraphX. The Cargo feature was already updated but the enum variant in the code was missed.

When the hotkey is released, audio chunks may still be queued for processing. The old stop_streaming() only waited for flush output and broke early, losing text from in-flight chunks. Now the flush handler sends a sentinel ("\0") after completing, and stop_streaming() collects all text deltas (from pending chunks AND flush) until it receives the sentinel. Timeout bumped to 30s to accommodate slow CPU inference of queued chunks.

Add nemotron-speech-streaming-en-0.6b-int8 and int4 to PARAKEET_MODELS in src/setup/model.rs. These are quantized variants hosted under lokkju/ on HuggingFace that use the same encoder.onnx/decoder_joint.onnx filenames (required by parakeet-rs from_pretrained). Update related tests.

lokkju · 2026-01-30T12:07:36Z

I've created int4 and int8 quants, and they do provide speed and size advantage, while seemingly having the same quality:
https://huggingface.co/lokkju/nemotron-speech-streaming-en-0.6b-int4
https://huggingface.co/lokkju/nemotron-speech-streaming-en-0.6b-int8

The Parakeet model detection in `voxtype setup check` only looked for directories containing "parakeet" in the name and checked for TDT/CTC file patterns. Nemotron models were missed entirely. Replace the ad-hoc directory scan with a loop over known model names using validate_parakeet_model(), which already handles all model file structures (TDT, CTC, Nemotron).

peteonrails · 2026-01-30T14:59:12Z

One thing for us to think about is that most of our users assign modifier key sequences like SUPER-SHIFT-X or RIGHTALT to toggle recording, so streaming output may not appear and in the worst case scenario it might trigger other actions.

Personally, I chose HOME and made SCROLLOCK the default, and those keys would not collide with streaming output.

I haven't tested this out yet, but it's something to look at. mostly leaving this as a note for myself.

lokkju · 2026-01-30T18:16:00Z

I'm not sure how the modifier key sequences would affect output? at least when using wayland's virtual keyboard protocol or libei compat such as in Mutter, it should be treated as a seperate input device, so the modifier keys on your actual keyboard shouldn't affect the input coming from wtype/eitype; is that what you were potentially seeing as a problem?

My main daily driver is a Framework 13, and it's F12 function key is a gear icon that maps to KEY_MEDIA; absolutely perfect for the hold-to-talk key.

Right now, with the int4 quant, it's still just slightly too slow for my preferences; but you do get the typing as you speak, which is quite nice. I'm exploring the webgpu backend, but so far it's limited featureset seems to actually cause slower processing than avx512. Have you tried with ncnn at all? supposedly ONNX can be converted, and it has great vulkan backend support.

peteonrails · 2026-01-30T19:22:49Z

@lokkju

so the modifier keys on your actual keyboard shouldn't affect the input coming from wtype/eitype; is that what you were potentially seeing as a problem?

Yes, when Omarchy incpororated Voxtype, users would report seeing wild results when unkeying SUPER-SHIFT-X because the keystrokes started before the user finished unkeying - I want to make sure I don't set folks up for that.

But if you are right about it being a separate input device, then it sounds like the diagnosis may have been off - I'll have to look in to that some more.

Have not tried ncnn but will allocate a little time to look in to it. Looking forward to getting nemotron merged soon - even if experimental!

lokkju · 2026-01-31T16:40:19Z

It's looking like the webgpu experiments won't help much here, nor vulkan backends; the streaming nature causes to much transfer between device memory. As it is, fp32 is 3x realtime, with int8 being 8x, and int4 4x (which doesn't make sense, but I haven't debugged it yet)

peteonrails · 2026-01-31T19:15:23Z

@lokkju

It's looking like the webgpu experiments won't help much here, nor vulkan backends;

That is unfortunate - I hope to get some numbers for you on my testing workstation this weekend: here's hoping they show some promise

Add probe_output_chain() and output_with_cached_index() to src/output/mod.rs. These allow probing the output driver chain once and reusing the cached index for subsequent output calls, skipping redundant is_available() subprocess spawns per delta. - probe_output_chain(): walks chain, returns index of first available driver - output_with_cached_index(): outputs directly via cached index, falls back to full probe if cached driver fails - MockTextOutput test helper with call counters - Tests verifying cached path skips is_available() entirely - Latency baseline test: 15 mock deltas in <50ms

In src/daemon.rs, add streaming_output_chain and streaming_output_index fields to Daemon. These are set once when a streaming session begins (begin_streaming_session) and reused for every text delta, eliminating per-delta calls to create_output_chain() and is_available() subprocess probes. - Set cache in begin_streaming_session() after probing once - Use output_with_cached_index() in the text delta handler - Take cached chain for final flush output at session end - Clear cache in cancel_streaming() on session cancel - Fallback to full probe if cache is missing or driver fails Reduces ~45ms of subprocess overhead across a typical 15-delta streaming session to ~3ms (single probe at session start).

lokkju · 2026-01-31T19:45:32Z

So at this point, it's not the models; it's the rest of the system around it. I've optimized the typing tool check to cache on launch, and that saves 5ms twice per second, which ads up. similar small gains elsewhere as well. I think the next step is to use eitype/wrtype as a library rather than launch as a tool; I'll be trying that tonight or tomorrow. Regardless, it's usable right now, and a better experience than typing for a while, then waiting for the whole transcription, for sure.

peteonrails · 2026-01-31T20:20:24Z

I'm almost ready with 0.6.0: as soon as I ship that I'll start integrating this branch. Thank you again for this really solid contribution!

peteonrails and others added 19 commits January 29, 2026 16:49

Merge feature/media-keys-and-numeric-keycodes

fd7c33b

Merge feature/eitype-output

27154b2

Merge feature/eitype-output (setup check update)

a80af56

Merge upstream feature/70-eager-processing-fresh

327313c

style: Format streaming transcription source files

3f6f501

Run cargo fmt on the files modified for Nemotron streaming support.

fix: Update validate_parakeet_model to recognize Nemotron file structure

0776599

Nemotron models use non-hyphenated filenames (encoder.onnx, decoder_joint.onnx) and tokenizer.model instead of vocab.txt. Without this, downloaded Nemotron models would not show [installed].

fix: Update ROCm execution provider to MIGraphX for parakeet-rs 0.3.x

fb50de3

parakeet-rs 0.3.1 renamed ExecutionProvider::ROCm to ExecutionProvider::MIGraphX. The Cargo feature was already updated but the enum variant in the code was missed.

lokkju mentioned this pull request Jan 30, 2026

[Feature] Nemotron Speech backend #47

Open

lokkju added 2 commits January 31, 2026 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nemotron streaming transcription (real-time text output)#155

Nemotron streaming transcription (real-time text output)#155
lokkju wants to merge 23 commits intopeteonrails:mainfrom
lokkju:research/parakeet-streaming-output

lokkju commented Jan 30, 2026 •

edited

Loading

Uh oh!

lokkju commented Jan 30, 2026

Uh oh!

peteonrails commented Jan 30, 2026

Uh oh!

lokkju commented Jan 30, 2026 •

edited

Loading

Uh oh!

peteonrails commented Jan 30, 2026

Uh oh!

lokkju commented Jan 31, 2026

Uh oh!

peteonrails commented Jan 31, 2026

Uh oh!

lokkju commented Jan 31, 2026

Uh oh!

peteonrails commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lokkju commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Config

Status: Not Ready

Path forward

Test plan

Uh oh!

lokkju commented Jan 30, 2026

Uh oh!

peteonrails commented Jan 30, 2026

Uh oh!

lokkju commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peteonrails commented Jan 30, 2026

Uh oh!

lokkju commented Jan 31, 2026

Uh oh!

peteonrails commented Jan 31, 2026

Uh oh!

lokkju commented Jan 31, 2026

Uh oh!

peteonrails commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lokkju commented Jan 30, 2026 •

edited

Loading

lokkju commented Jan 30, 2026 •

edited

Loading