Voice roadmap: picker -> provider abstraction -> local stack -> fleet command centre #691

heavygee · 2026-05-25T17:52:08Z

heavygee
May 25, 2026
Collaborator

Hey all

I've been running Hapi for a little while as my daily remote for Claude Code and Gemini, and I've been doing a lot of thinking about the voice interface -- where it is, where it could go. Figured an Ideas thread was the right place rather than a stream of issues.

Where we just got to

PR #690 landed today -- a voice picker in Settings that fetches your account's voices dynamically from ElevenLabs (so your instant voice clones show up alongside pre-made ones), with a play button on each so you can hear before you choose. Small thing, but it felt wrong to hardcode a list when ElevenLabs accounts already carry personalised voices.

Issue #686 tracks it.

What I've been thinking about next, roughly in order

Provider abstraction

The current voice path is ElevenLabs ConvAI all the way down. That's fine if you have a key, but I'd like a VoiceTransportProvider interface that lets different backends plug in. ElevenLabs stays the default, nothing breaks for existing users.

Which brings me to PR #401 from @Overbaker -- Gemini Live and Qwen Realtime (both cloud APIs, different vendors to ElevenLabs) backends, runtime backend discovery via GET /api/voice/backend, lazy loading of alternative sessions. The architecture is exactly right. The problem is it's been sitting with unresolved HAPI Bot findings since late April with no activity from the author.

Note

I've declared intent to take #401 over -- rebase it onto current main, work through the open issues. Forgiveness over permission at this point. If @Overbaker reappears, great, I'll coordinate; if not, the work needs to move.

Update: rebase filed as PR #692.

Local speech stack

Once a provider interface exists, I'd want to add a local backend. I run Speaches (STT) and Chatterbox (TTS) already on my home server -- both expose OpenAI-compatible endpoints. Pipecat is the framework I'm looking at for orchestrating the audio pipeline.

Important

Honest caveat: Speaches + Chatterbox handle audio in and out, but a full voice agent also needs an LLM for the conversational brain. ElevenLabs ConvAI bundles one. A local equivalent would need a local LLM (Ollama or similar) wired in. I haven't fully proven that end-to-end yet -- I know the audio plumbing works, the LLM integration via Pipecat is the part I'm still figuring out. Anyone who's done this before, I'd genuinely like to know what you used.

The appeal of local isn't just cost -- it's latency and offline operation on a tailnet.

Fleet command centre (the bigger idea)

This is the one I'm floating specifically to get friction on, because I've been living inside my own head about it with only an AI to sanity-check me, and that's not a reliable signal.

The use case: I've got four or five Claude/Gemini/Codex sessions running. One hits a permission request or goes blocked. Right now I have to get my phone out, find the right session, read the screen, tap approve. Fine, but it breaks flow.

What I want instead is one voice conversation that spans all active sessions -- an attention layer that routes agent interrupts to my earbuds by priority:

Permission requests (highest)
Blocked sessions
Completion summaries
Actively-working sessions (stay silent)

Completely hands-free. I'm in the garden, agents are working, the ones that need me find me.

Note

This only makes economic sense with local TTS. If five sessions are all firing cloud TTS on every completion, the bill adds up fast. With a local Chatterbox instance it's effectively free. So the local speech stack (phase above) is load-bearing for this one.

I'm going to build this regardless -- it scratches a very specific itch and I need to see it through. What I'm asking for is genuine pushback: things I haven't thought of, reasons it won't work the way I'm imagining, prior art I should look at. I've been getting a lot of "great idea" from my AI collaborator and not enough "here's why that's harder than you think."

Specific questions I don't have answers to

Tip

These are genuine unknowns -- not rhetorical. If you've solved any of these I'd love to hear what you actually did, not just what sounds right in theory.

Local stack end-to-end: Anyone running local TTS/STT in a home lab context -- what does your stack actually look like, LLM included? Speaches + Chatterbox + ??? for the conversational brain?
Attention/priority model: For multi-session routing -- does this feel like the right shape, or is there a simpler mental model I'm missing?
Provider interface shape: Since I'm going to be proposing a concrete VoiceTransportProvider as part of taking feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime #401 forward, any strong opinions on what it needs to expose before I write it up?

heavygee · 2026-05-25T18:07:08Z

heavygee
May 25, 2026
Collaborator Author

Update: #401 rebased and filed as #692

Moved faster than expected on the provider abstraction piece. PR #401 from @Overbaker had been sitting since late April, so I rebased it onto current main and filed it as #692.

The rebase resolved conflicts with 135 upstream commits (scheduling, conversation outline, the new composer enter-behaviour setting from #586, etc.). A couple of things changed from the original:

HappyComposer enter key: kept upstream's configurable setting (feat(web): add configurable enter behavior setting #586) rather than the hard-coded Ctrl+Enter change -- the setting covers both behaviours and is better for all users
Test runner: converted the Gemini audio tests from bun:test to vitest to match the web package

Everything else is @Overbaker and @TennyDDDD's work. Left a note on the original PR so they know.

All tests passing, TypeScript clean. From here the next thing is putting a concrete shape on the VoiceTransportProvider interface -- which is the question I raised above about what it needs to expose.

0 replies

heavygee · 2026-06-02T04:27:16Z

heavygee
Jun 2, 2026
Collaborator Author

Update: PRs #692 and #743 both green

Quick status thread since the last update.

PR #692 — Gemini Live + Qwen Realtime (pluggable voice backend)

The provider abstraction question from the OP now has a concrete answer. The VoiceAdapter contract in hub/src/web/voiceAdapters/base.ts exposes:

connect(config) / disconnect() — lifecycle
sendText(text) — inject a user turn mid-session
buildSessionConfig(personality, language?) — produce the backend-specific session payload
proxyWebSocket(browserWs, req) — for proxy-mode backends (Qwen, Gemini) where the hub relays the WS

ElevenLabs stays the default and the interface is backward-compatible. Adding a new backend is one file + registration in the factory.

What changed since the rebase:

Qwen tool shape: fixed flat Realtime format ({type, name, description, parameters}) vs the nested chat-completions format the original had
Protocol ordering: hub now buffers its session.update until DashScope fires session.created, then relays in the right order — browser gets session.created first, then session.updated. Qwen's protocol requires this.
Hub-owned session setup: hub sends the initial session config (voice, language, tools), browser can only update instructions. A frame filter (isQwenSafeClientFrame) closes the connection if the browser tries to override voice or tool config — fixes the proxy security finding from the bot review.
Language handling: buildVoiceLanguageBlock(language?) replaces the zh-only ternary. Three cases: no language arg → auto-detect from speech; zh → full Chinese block; any other code → "Always respond in [Language]". Applies to both Qwen and Gemini.
sendTextMessage: Qwen requires a user turn before response.create. Fixed to send conversation.item.create first.
Model/voice defaults: qwen3.5-omni-flash-realtime (was qwen3-omni-flash-realtime), Tina as default voice.

Dogfooding results:

I got a DashScope API key (signed up for Alibaba Cloud Model Studio) and ran it live. Qwen Realtime is working. Built a test harness (scripts/voice-test/) that confirmed all 6 Qwen voices (Tina, Cherry, Ethan, Chelsie, Dylan, Serena) are valid on qwen3.5-omni-flash-realtime.

Ran 11 paralinguistic scenarios (laugh, whisper, sing, cry, sigh, etc.) — all produced audio. Open question: whether bracketed natural-language directives like [laugh warmly] in user turns actually trigger audio paralinguistic behaviour on this model, or whether the model describes the action in text. qwen3.5-omni-flash-realtime is Chinese-first and there's some debate about whether DashScope-specific instruction formats or tokens are the right approach for evocative audio. Not blocking the PR, but worth a note for anyone picking this up.

Status: both CI checks green, ready for upstream review.

PR #743 — Voice settings UX redesign

While dogfooding #692 I found the existing "Advanced voice settings" disclosure was mixing too many concerns. Redesigned the settings page into five flat sections:

Connection & provider — backend chooser (if multiple configured), language dropdown, voice picker
How It Sounds — ElevenLabs sliders (stability, similarity, style) collapsed under "Tone & pace"; Gemini affective dialog toggle; Qwen shows a capability note instead of an empty accordion
How it behaves — Opening mode (Greet me / Brief me, replaces the proactive toggle); Response length (Brief / Balanced / Detailed, appended to the composed system prompt)
Persona & instructions — identity field, character & speaking style, speaking style preset
Advanced — wire budget, prompt size, context diagnostics (debug info out of the main flow)

ResponseLengthOption ('brief' | 'balanced' | 'detailed') is a new field in VoicePersonalityPreferences. Balanced is the default and adds nothing to the prompt. Brief and Detailed append explicit length instructions to the composed voice system prompt.

Status: both CI checks green, ready for upstream review.

Still open from the OP

Local speech stack — Pipecat Flow (probly) + Speaches + Chatterbox are still on the roadmap. The provider interface is now in place, so a LocalVoiceAdapter is the next logical entry point once #692 is merged. The LLM-wiring question (Ollama for the conversational brain) I still haven't fully resolved.

Fleet command centre — still dependent on local TTS being effectively free. Nothing has changed my view that this is the right shape; I just need the local stack first.

0 replies

heavygee · 2026-06-02T22:59:41Z

heavygee
Jun 2, 2026
Collaborator Author

Dogfooding notes — and an architectural observation @tiann

I've now spent meaningful time with all three backends (ElevenLabs ConvAI, Gemini Live, Qwen Realtime) and wanted to share what I found before asking whether there's anything about the implementation you'd want shaped differently.

What works well across the board

All three voices respond to the identity and character layers in a way that's genuinely useful — the persona system holds up across backends. In practice most users will never touch it, but it's good that it's there for the ones who will. The Qwen voices in particular surprised me: they're capable and notably resistant to fake interruptions, which matters for a coding assistant context where you want the agent to finish a thought. The "Greet me / Brief me" split for session opening turned out to be a real preference rather than a trivial toggle — I found myself wanting one and then the other depending on context, so exposing that choice was the right call.

The ElevenLabs sliders are arguably more knobs than most people need, but that's ElevenLabs' whole product positioning and hiding them would feel wrong for users who are already invested in that ecosystem. For the other backends, what's exposed is exactly enough.

Where I've landed architecturally

Using all three in real sessions surfaced something I think is worth naming: a conversational voice agent is in the wrong place when it lives inside an individual agent session.

The current shape is: you open a coding session → you get a voice button. But I can have a full conversation with the voice agent without ever sending a message to the underlying coding agent. That's the mismatch. Worker agents (Claude, Gemini CLI, Codex, OpenCode) operate at letter cadence — they receive instructions, they work, they report back. They don't chat. Putting a real-time conversational agent inside a worker session conflates two different interaction modes.

What I'm coming to realize:

Inside a worker session: voice-to-text for composing prompts, and text-to-speech to read out a completion summary. For that use case, full ConvAI with ElevenLabs or Gemini Live is overkill — a simple TTS call is enough.
Above the sessions: a persistent conversational agent that has visibility across all running sessions — what's blocked, what completed, what needs a decision. That's where real-time bidirectional voice belongs. That's what I described in the original post as the "fleet command centre."

So the voice work in #692/#743 isn't wrong — it's right for what it is, and the provider abstraction will be load-bearing for the control-plane agent when we get there. But I wanted to name the distinction clearly: we now have more first-class voice infrastructure, and the interesting question is where the conversation actually lives.

Is there anything about the provider abstraction or the proxy design in #692 you'd want shaped differently before it lands?

0 replies

tiann · 2026-06-04T10:11:07Z

tiann
Jun 4, 2026
Maintainer

Thanks for writing this up — I think the distinction you made is the right one.

For a single worker session, voice should probably stay lightweight: speech-to-text for composing prompts, and text-to-speech for reading summaries / important results. A full realtime conversational agent inside
each coding session feels too heavy, and also mixes two different interaction modes.

The more interesting direction is the control-plane layer above sessions: permission requests, blocked sessions, completions, and other attention events routed through one voice interface. That matches HAPI’s
architecture better.

#692 looks like the right foundation for that. I don’t have a major objection to the provider abstraction/proxy shape now that it has landed. The main things I’d want to preserve going forward:

provider-specific details stay isolated in adapters
browser clients should not be able to override hub-owned voice/tool/session config
the API should not assume every backend is a full ConvAI-style agent
local TTS/STT should be additive, not required for existing users

For the fleet command centre, I’d suggest starting smaller than a full conversational agent: first build an attention/event layer with priorities like permission request > blocked > completion. Once that model is
solid, voice can become one transport for it.

Also noted that #743 is currently conflicting after #692 landed — if you can rebase it, I’m happy to review it in that updated context.

0 replies

heavygee · 2026-06-04T20:59:24Z

heavygee
Jun 4, 2026
Collaborator Author

Thanks @tiann — glad the distinction landed. Completely agree on starting with the attention/event layer before full conversational agent; building the priority model (permission > blocked > completion) first gives a solid foundation to reason about before adding voice as a transport.

The four constraints you listed for the provider abstraction are exactly what we've been working to — the proxy security work in #692 was specifically about ensuring browser clients can't override hub-owned config.

On #743: Already rebased onto the post-merge main this morning (06-04 ~11:00 UTC). It's 8 commits ahead, 0 behind, CI green, 0 open threads. Should be clean to review now. The one open note is a bot finding about composed prompts in WS query strings — we left a reasoned response in the thread explaining the current truncation cap, and filed it as a follow-on consideration (issue #802, now closed in favour of tracking in the PR discussion).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Voice roadmap: picker -> provider abstraction -> local stack -> fleet command centre #691

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Voice roadmap: picker -> provider abstraction -> local stack -> fleet command centre #691

Uh oh!

Uh oh!

heavygee May 25, 2026 Collaborator

Where we just got to

What I've been thinking about next, roughly in order

Provider abstraction

Local speech stack

Fleet command centre (the bigger idea)

Specific questions I don't have answers to

Replies: 5 comments

Uh oh!

heavygee May 25, 2026 Collaborator Author

Update: #401 rebased and filed as #692

Uh oh!

Uh oh!

heavygee Jun 2, 2026 Collaborator Author

Update: PRs #692 and #743 both green

PR #692 — Gemini Live + Qwen Realtime (pluggable voice backend)

PR #743 — Voice settings UX redesign

Still open from the OP

Uh oh!

Uh oh!

heavygee Jun 2, 2026 Collaborator Author

Dogfooding notes — and an architectural observation @tiann

Uh oh!

tiann Jun 4, 2026 Maintainer

Uh oh!

heavygee Jun 4, 2026 Collaborator Author

heavygee
May 25, 2026
Collaborator

heavygee
May 25, 2026
Collaborator Author

heavygee
Jun 2, 2026
Collaborator Author

heavygee
Jun 2, 2026
Collaborator Author

tiann
Jun 4, 2026
Maintainer

heavygee
Jun 4, 2026
Collaborator Author