Skip to content

Nemotron streaming transcription (real-time text output)#155

Draft
lokkju wants to merge 23 commits intopeteonrails:mainfrom
lokkju:research/parakeet-streaming-output
Draft

Nemotron streaming transcription (real-time text output)#155
lokkju wants to merge 23 commits intopeteonrails:mainfrom
lokkju:research/parakeet-streaming-output

Conversation

@lokkju
Copy link
Collaborator

@lokkju lokkju commented Jan 30, 2026

Summary

Adds Nemotron Speech Streaming EN 0.6B support (issue #47) to the Parakeet engine, enabling real-time text output during recording. Text is typed incrementally as you speak rather than waiting until recording stops.

  • New model type: ParakeetModelType::Nemotron, auto-detected from model files
  • New trait: StreamingTranscriber for incremental chunk-based transcription
  • New state: State::StreamingRecording with live text output during recording
  • Persistent model: loaded once at daemon startup, reused across recordings
  • Model download: available via voxtype setup model (option 13)
  • parakeet-rs bump: 0.2.9 → 0.3.1 (adds Nemotron API, renames rocmmigraphx)

How it works

Audio chunks (560ms / 8960 samples at 16kHz) are fed to the Nemotron model via a persistent blocking task. Text deltas are sent back through a channel and typed immediately via the output chain. On recording stop, remaining audio is flushed through silence padding.

Config

engine = "parakeet"

[parakeet]
model = "nemotron-speech-streaming-en-0.6b"
# streaming is auto-enabled for Nemotron models
# streaming = false  # to force batch mode instead

Status: Not Ready

This branch is functional but has a significant limitation:

Inference speed on CPU is too slow for real-time use. The 0.6B fp32 model takes longer than 560ms to process each chunk on CPU, so text output lags behind speech. It works, but with noticeable delay.

Path forward

  • int8/int4 quantized models are being worked on. Dynamic int8 quantization should bring the model from ~2.6 GB to ~670 MB with proportionally faster inference. int4 would be even smaller/faster. This is the most likely fix for CPU users.
  • CUDA acceleration (--features parakeet,parakeet-cuda) would make this real-time on NVIDIA GPUs but hasn't been tested yet.
  • MIGraphX (--features parakeet,parakeet-rocm) for AMD discrete GPUs — compiles but untested.

Test plan

  • cargo build --features parakeet compiles
  • cargo build without parakeet compiles (no regressions)
  • cargo test — 282 tests pass
  • Manual: Nemotron model appears in voxtype setup model
  • Manual: text appears live during recording
  • Manual: second recording works without errors
  • Manual: batch transcription still works for CTC/TDT models
  • Performance: real-time streaming on CPU with quantized model
  • Test CUDA acceleration

peteonrails and others added 19 commits January 29, 2026 16:49
Implement parallel transcription of audio chunks during recording,
reducing perceived latency on slower machines or for long recordings.

New config options (all off by default):
- eager_processing: enable/disable the feature
- eager_chunk_secs: audio chunk duration (default 5.0s)
- eager_overlap_secs: overlap between chunks (default 0.5s)

New CLI flags:
- --eager-processing
- --eager-chunk-secs <SECS>
- --eager-overlap-secs <SECS>

The implementation chunks audio during recording, transcribes chunks in
parallel background tasks, then combines results with word-level
deduplication at chunk boundaries when recording stops.
- Add MEDIA, RECORD, REWIND, FASTFORWARD key names to
  evdev_listener.rs parse_key_name()
- Support prefixed numeric keycodes: WEV_/X11_/XEV_ for
  XKB keycodes (offset by 8), EVTEST_ for kernel keycodes
- Support hex values after prefix (e.g. WEV_0xEA)
- Reject bare numeric keycodes as ambiguous with a helpful
  error message explaining the prefix requirement
- Update docs/CONFIGURATION.md and docs/USER_MANUAL.md with
  new key names and numeric keycode documentation
- Update --hotkey CLI help text in src/cli.rs
- Add src/output/eitype.rs implementing TextOutput trait
  using eitype CLI tool (libei/EI protocol for Wayland)
- Add Eitype variant to OutputDriver enum in src/config.rs
  with Display, FromStr, and serde support
- Add EitypeNotFound error variant in src/error.rs
- Insert eitype after wtype in default fallback chain in
  src/output/mod.rs (wtype -> eitype -> dotool -> ydotool
  -> clipboard -> xclip)
- Update docs/CONFIGURATION.md and docs/USER_MANUAL.md with
  eitype driver documentation and GNOME/KDE compatibility
  table

eitype uses the Emulated Input protocol supported by GNOME
and KDE, where wtype's virtual-keyboard protocol does not
work.
- Add eitype to OutputChainStatus struct and detection
  logic in src/setup/mod.rs
- Display eitype status in print_output_chain_status()
  between wtype and ydotool
- Include eitype in primary method detection (after wtype,
  before ydotool)
- Add eitype install suggestion when no output method is
  available on Wayland
- Bump parakeet-rs from 0.2.9 to 0.3.1 in Cargo.toml/Cargo.lock
  (maps parakeet-rocm feature to migraphx, matching upstream rename)
- Add Nemotron variant to ParakeetModelType enum in config.rs
- Add streaming config option (auto-enabled for Nemotron models)
- Add StreamingTranscriber trait in transcribe/mod.rs with
  transcribe_chunk, flush, reset, get_transcript, chunk_size methods
- Implement NemotronStreamingTranscriber in transcribe/parakeet.rs
  wrapping parakeet_rs::Nemotron with delta-based text output
- Extend detect_model_type to recognize Nemotron file structure
  (encoder.onnx + decoder_joint.onnx + tokenizer.model)
- Add non-streaming Nemotron support via transcribe_audio in
  ParakeetTranscriber for batch transcription fallback
- Add State::StreamingRecording variant in state.rs with audio_buffer
  and text_output_so_far tracking
- Integrate streaming into daemon.rs main event loop:
  - Channel architecture: audio_rx -> chunk_tx -> blocking task -> text_tx
  - Live text output via output_with_fallback during recording
  - Flush on hotkey release to drain remaining decoder state
  - Support in push-to-talk, toggle, and external trigger modes
  - Streaming cleanup in cancel, timeout, and shutdown handlers
- Add tests for Nemotron model detection and streaming state
- Add Nemotron model type to CONFIGURATION.md model_type values
- Add auto-detection table for model file structures
- Document streaming config option in CONFIGURATION.md
- Add Nemotron complete example alongside TDT example
- Add "Nemotron Streaming" section to USER_MANUAL.md
  explaining real-time text output during recording
Run cargo fmt on the files modified for Nemotron streaming support.
- Add nemotron-speech-streaming-en-0.6b to PARAKEET_MODELS in
  setup/model.rs, downloadable via `voxtype setup model`
- Add huggingface_path field to ParakeetModelInfo for repos where
  model files are in a subdirectory (Nemotron uses altunenes/parakeet-rs
  repo with files under nemotron-speech-streaming-en-0.6b/)
- Update resolve_model_path error message in parakeet.rs to mention
  Nemotron download link and `voxtype setup model` command
- Update test_parakeet_models_have_files to accept tokenizer.model
  as a valid tokenizer file (Nemotron uses SentencePiece, not vocab.txt)
- Add nemotron to is_parakeet_model test
- Add contrib/nemotron-streaming-test-config.toml sample config
Nemotron models use non-hyphenated filenames (encoder.onnx,
decoder_joint.onnx) and tokenizer.model instead of vocab.txt.
Without this, downloaded Nemotron models would not show [installed].
should_use_streaming() was defaulting to TDT when model_type wasn't
explicitly set in config, instead of detecting from model files on disk.
Now uses resolve_model_path + detect_model_type to check the actual
model directory, enabling streaming automatically for Nemotron models.
The Nemotron streaming model was being loaded fresh on every recording
start (~3s), causing the hotkey release to fire before the model was
ready. Now the streaming transcriber is loaded once at daemon startup
and persists across recordings via a command channel (Audio/Flush/Reset/
Shutdown). Recording start just sends a Reset command (instant).
- Don't drop streaming_text_rx between recordings — the task persists
  and needs both channel ends alive across sessions
- stop_streaming() now waits for flush text with timeout and returns it,
  instead of the caller doing try_recv (which raced with flush processing)
- Store streaming_chunk_size in Daemon struct instead of hardcoding
- Use StreamingCommand::Shutdown in daemon shutdown path
parakeet-rs transcribe_chunk() already returns only the new tokens
(delta), not the full cumulative transcript. Our wrapper was treating
the return as cumulative and trying to extract a delta by tracking
last_transcript_len, which produced garbled output.
parakeet-rs 0.3.1 renamed ExecutionProvider::ROCm to
ExecutionProvider::MIGraphX. The Cargo feature was already updated
but the enum variant in the code was missed.
When the hotkey is released, audio chunks may still be queued for
processing. The old stop_streaming() only waited for flush output
and broke early, losing text from in-flight chunks.

Now the flush handler sends a sentinel ("\0") after completing, and
stop_streaming() collects all text deltas (from pending chunks AND
flush) until it receives the sentinel. Timeout bumped to 30s to
accommodate slow CPU inference of queued chunks.
Add nemotron-speech-streaming-en-0.6b-int8 and int4 to
PARAKEET_MODELS in src/setup/model.rs. These are quantized
variants hosted under lokkju/ on HuggingFace that use the
same encoder.onnx/decoder_joint.onnx filenames (required by
parakeet-rs from_pretrained). Update related tests.
@lokkju
Copy link
Collaborator Author

lokkju commented Jan 30, 2026

I've created int4 and int8 quants, and they do provide speed and size advantage, while seemingly having the same quality:
https://huggingface.co/lokkju/nemotron-speech-streaming-en-0.6b-int4
https://huggingface.co/lokkju/nemotron-speech-streaming-en-0.6b-int8

The Parakeet model detection in `voxtype setup check` only looked
for directories containing "parakeet" in the name and checked for
TDT/CTC file patterns. Nemotron models were missed entirely.

Replace the ad-hoc directory scan with a loop over known model
names using validate_parakeet_model(), which already handles all
model file structures (TDT, CTC, Nemotron).
@peteonrails
Copy link
Owner

One thing for us to think about is that most of our users assign modifier key sequences like SUPER-SHIFT-X or RIGHTALT to toggle recording, so streaming output may not appear and in the worst case scenario it might trigger other actions.

Personally, I chose HOME and made SCROLLOCK the default, and those keys would not collide with streaming output.

I haven't tested this out yet, but it's something to look at. mostly leaving this as a note for myself.

@lokkju
Copy link
Collaborator Author

lokkju commented Jan 30, 2026

I'm not sure how the modifier key sequences would affect output? at least when using wayland's virtual keyboard protocol or libei compat such as in Mutter, it should be treated as a seperate input device, so the modifier keys on your actual keyboard shouldn't affect the input coming from wtype/eitype; is that what you were potentially seeing as a problem?

My main daily driver is a Framework 13, and it's F12 function key is a gear icon that maps to KEY_MEDIA; absolutely perfect for the hold-to-talk key.

Right now, with the int4 quant, it's still just slightly too slow for my preferences; but you do get the typing as you speak, which is quite nice. I'm exploring the webgpu backend, but so far it's limited featureset seems to actually cause slower processing than avx512. Have you tried with ncnn at all? supposedly ONNX can be converted, and it has great vulkan backend support.

@peteonrails
Copy link
Owner

@lokkju

so the modifier keys on your actual keyboard shouldn't affect the input coming from wtype/eitype; is that what you were potentially seeing as a problem?

Yes, when Omarchy incpororated Voxtype, users would report seeing wild results when unkeying SUPER-SHIFT-X because the keystrokes started before the user finished unkeying - I want to make sure I don't set folks up for that.

But if you are right about it being a separate input device, then it sounds like the diagnosis may have been off - I'll have to look in to that some more.

Have not tried ncnn but will allocate a little time to look in to it. Looking forward to getting nemotron merged soon - even if experimental!

@lokkju
Copy link
Collaborator Author

lokkju commented Jan 31, 2026

It's looking like the webgpu experiments won't help much here, nor vulkan backends; the streaming nature causes to much transfer between device memory. As it is, fp32 is 3x realtime, with int8 being 8x, and int4 4x (which doesn't make sense, but I haven't debugged it yet)

@peteonrails
Copy link
Owner

@lokkju

It's looking like the webgpu experiments won't help much here, nor vulkan backends;

That is unfortunate - I hope to get some numbers for you on my testing workstation this weekend: here's hoping they show some promise

Add probe_output_chain() and output_with_cached_index() to
src/output/mod.rs. These allow probing the output driver chain
once and reusing the cached index for subsequent output calls,
skipping redundant is_available() subprocess spawns per delta.

- probe_output_chain(): walks chain, returns index of first
  available driver
- output_with_cached_index(): outputs directly via cached index,
  falls back to full probe if cached driver fails
- MockTextOutput test helper with call counters
- Tests verifying cached path skips is_available() entirely
- Latency baseline test: 15 mock deltas in <50ms
In src/daemon.rs, add streaming_output_chain and
streaming_output_index fields to Daemon. These are set once
when a streaming session begins (begin_streaming_session) and
reused for every text delta, eliminating per-delta calls to
create_output_chain() and is_available() subprocess probes.

- Set cache in begin_streaming_session() after probing once
- Use output_with_cached_index() in the text delta handler
- Take cached chain for final flush output at session end
- Clear cache in cancel_streaming() on session cancel
- Fallback to full probe if cache is missing or driver fails

Reduces ~45ms of subprocess overhead across a typical 15-delta
streaming session to ~3ms (single probe at session start).
@lokkju
Copy link
Collaborator Author

lokkju commented Jan 31, 2026

So at this point, it's not the models; it's the rest of the system around it. I've optimized the typing tool check to cache on launch, and that saves 5ms twice per second, which ads up. similar small gains elsewhere as well. I think the next step is to use eitype/wrtype as a library rather than launch as a tool; I'll be trying that tonight or tomorrow. Regardless, it's usable right now, and a better experience than typing for a while, then waiting for the whole transcription, for sure.

@peteonrails
Copy link
Owner

I'm almost ready with 0.6.0: as soon as I ship that I'll start integrating this branch. Thank you again for this really solid contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants