The voice input tool Linux has been missing — AI-powered and open source.
🎤 Press hotkey → Speak → AI formats → Text appears in your app
macOS has polished voice input tools with AI post-processing, but Linux has had no equivalent — just raw STT with no intelligence behind it.
koe fills that gap:
- AI post-processing — raw transcription is refined by an LLM that understands your context
- Context-aware — reads the active window title/app to tailor output (code comments in an editor, natural prose in a doc)
- Fully local option — whisper.cpp + Ollama means zero data leaves your machine
- Built in Rust — single binary, low latency, minimal resource usage
- Dictionary support — domain-specific term correction for accurate technical vocabulary
| Feature | koe | nerd-dictation | Google Docs Voice | No voice input |
|---|---|---|---|---|
| AI post-processing | Yes (Claude / Ollama) | No | No | — |
| Context-aware formatting | Yes (active window) | No | No | — |
| Fully local operation | Yes (whisper.cpp + Ollama) | Yes | No | — |
| Custom dictionary | Yes (declarative TOML + AI-aware) | Yes (Python scripting) | No | — |
| Hotkey modes | Push-to-talk + Toggle | Push-to-talk | Button | — |
| Direct typing to any app | Yes | Yes | Google Docs only | — |
| Language | Any (Whisper) | Any (Vosk) | Many | — |
- Speech Recognition: Switch between whisper.cpp (local) and OpenAI Whisper API (cloud)
- AI Post-Processing: Switch between Claude API and Ollama (local LLM)
- Context Awareness: Active window info is sent to the AI for context-appropriate formatting
- Dictionary Management: Domain-specific term dictionaries improve recognition accuracy
- Hotkeys: Push-to-talk and toggle modes with configurable key bindings
- Direct Input: Types directly into the active window, with clipboard fallback
flowchart TD
A["User presses hotkey\n(rdev)"] --> B{Mode}
B -->|Push-to-Talk: key down| C["Start recording\n(cpal)"]
B -->|Toggle: first press| C
C --> D["Accumulate mic input\nas f32 PCM buffer"]
D --> E["Release / re-press hotkey"]
E --> F["Stop recording\nget AudioData"]
F --> G["Resample to 16kHz mono"]
G --> H{Speech Recognition Engine}
H -->|whisper_local| I["whisper-rs\n(local whisper.cpp)"]
H -->|openai_api| J["OpenAI Whisper API\n(reqwest multipart)"]
I --> K["Raw text"]
J --> K
K --> L["Dictionary term replacement\n(dictionary.rs)"]
L --> M["Corrected text"]
M --> N["Get active window info\n(x11rb)"]
N --> O["Window title + app name"]
O --> P["Build AI post-processing prompt\ncorrected text + context + dictionary"]
P --> Q{AI Engine}
Q -->|claude| R["Claude API\n(reqwest)"]
Q -->|ollama| S["Ollama API\n(reqwest)"]
R --> T["Formatted text"]
S --> T
T --> U["Type into active window\n(enigo)"]
U --> V{Input result}
V -->|Success| W["Done → back to Idle"]
V -->|Failure| X["Paste via clipboard\n(arboard + Ctrl+V)"]
X --> W
stateDiagram-v2
[*] --> Idle
Idle --> Recording : Hotkey pressed
Recording --> Processing : Hotkey released / re-pressed
Processing --> Typing : Recognition + AI processing complete
Typing --> Idle : Input complete
Processing --> Idle : Error / empty result
Recording --> Idle : Recording error
[Hotkey (rdev)] → [Audio Capture (cpal)] → [Speech Recognition] → [AI Post-Processing] → [Text Input (enigo)]
↑ ↑
whisper-rs / OpenAI Claude / Ollama
↑
[Context Capture (x11rb)]
[Dictionary Manager]
| File | Role |
|---|---|
src/main.rs |
Event loop, state management (Idle → Recording → Processing → Typing) |
src/config.rs |
TOML config file loading |
src/audio.rs |
Mic recording via cpal, 16kHz resampling, WAV encoding |
src/recognition/whisper_local.rs |
Local speech recognition via whisper-rs |
src/recognition/openai_api.rs |
Speech recognition via OpenAI Whisper API |
src/ai/claude.rs |
Text post-processing via Claude API |
src/ai/ollama.rs |
Text post-processing via Ollama |
src/context.rs |
Active window title/class capture via x11rb |
src/input.rs |
Direct typing via enigo + clipboard paste fallback |
src/hotkey.rs |
Global hotkey via rdev (Push-to-Talk / Toggle) |
src/dictionary.rs |
TOML dictionary loading and term replacement |
Download the latest pre-built binary from the Releases page.
Extract and place the koe binary in your $PATH (e.g. ~/.local/bin/). Then skip to Download Whisper Model.
sudo apt install -y libasound2-dev libclang-dev libxkbcommon-dev \
libx11-dev libxi-dev libxext-dev libxtst-dev libxfixes-dev cmake \
libgtk-4-dev libadwaita-1-dev libvulkan-devmkdir -p ~/.local/share/koe/models
wget -O ~/.local/share/koe/models/ggml-large-v3.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin# If using Claude for AI post-processing
export ANTHROPIC_API_KEY="your-key-here"
# If using OpenAI Whisper API for recognition
export OPENAI_API_KEY="your-key-here"cargo build --release
./target/release/koe[recognition]
engine = "whisper_local" # "whisper_local" | "openai_api"
[recognition.whisper_local]
model_path = "~/.local/share/koe/models/ggml-large-v3.bin"
language = "ja"
[recognition.openai_api]
api_key_env = "OPENAI_API_KEY"
language = "ja"
[ai]
engine = "claude" # "claude" | "ollama"
[ai.claude]
api_key_env = "ANTHROPIC_API_KEY"
model = "claude-sonnet-4-6"
[ai.ollama]
host = "http://localhost:11434"
model = "qwen2.5:14b"
[hotkey]
mode = "push_to_talk" # "push_to_talk" | "toggle"
key = "Super_R"
[dictionaries]
paths = ["dictionaries/default.toml"]dictionaries/default.toml:
[terms]
"ラスト" = "Rust"
"クロード" = "Claude"
"ウブンツ" = "Ubuntu"
[context_hints]
domain = "software development"
notes = "Prefer English for programming language and tool names"| Function | Crate |
|---|---|
| Audio recording | cpal |
| Local speech recognition | whisper-rs (whisper.cpp bindings) |
| Global hotkey | rdev |
| Keyboard input | enigo |
| X11 window info | x11rb |
| Clipboard | arboard |
| HTTP client | reqwest (rustls) |
| Async runtime | tokio |
| Config file | serde + toml |
| Logging | tracing |
koe's behavior depends on your configuration. Here's what data goes where:
| Configuration | Data sent externally | Data stays local |
|---|---|---|
| whisper-rs + Ollama (fully local) | Nothing | Audio, transcription, window context, all processing |
| OpenAI Whisper API + Ollama | Audio (to OpenAI for STT) | Window context, post-processing |
| whisper-rs + Claude API | Transcribed text + active window title/app name (to Anthropic) | Audio |
| OpenAI API + Claude API | Audio (to OpenAI) + transcribed text + window context (to Anthropic) | — |
Note on context awareness: When using cloud AI (Claude API), koe sends the active window title and application name along with the transcribed text. Window titles may contain sensitive information (file paths, URLs, email subjects). For maximum privacy, use the fully local setup (whisper-rs + Ollama).

