Replies: 5 comments
-
Update: #401 rebased and filed as #692Moved faster than expected on the provider abstraction piece. PR #401 from @Overbaker had been sitting since late April, so I rebased it onto current The rebase resolved conflicts with 135 upstream commits (scheduling, conversation outline, the new composer enter-behaviour setting from #586, etc.). A couple of things changed from the original:
Everything else is @Overbaker and @TennyDDDD's work. Left a note on the original PR so they know. All tests passing, TypeScript clean. From here the next thing is putting a concrete shape on the |
Beta Was this translation helpful? Give feedback.
-
Update: PRs #692 and #743 both greenQuick status thread since the last update. PR #692 — Gemini Live + Qwen Realtime (pluggable voice backend)The provider abstraction question from the OP now has a concrete answer. The
ElevenLabs stays the default and the interface is backward-compatible. Adding a new backend is one file + registration in the factory. What changed since the rebase:
Dogfooding results: I got a DashScope API key (signed up for Alibaba Cloud Model Studio) and ran it live. Qwen Realtime is working. Built a test harness ( Ran 11 paralinguistic scenarios (laugh, whisper, sing, cry, sigh, etc.) — all produced audio. Open question: whether bracketed natural-language directives like Status: both CI checks green, ready for upstream review. PR #743 — Voice settings UX redesignWhile dogfooding #692 I found the existing "Advanced voice settings" disclosure was mixing too many concerns. Redesigned the settings page into five flat sections:
Status: both CI checks green, ready for upstream review. Still open from the OPLocal speech stack — Pipecat Flow (probly) + Speaches + Chatterbox are still on the roadmap. The provider interface is now in place, so a Fleet command centre — still dependent on local TTS being effectively free. Nothing has changed my view that this is the right shape; I just need the local stack first. |
Beta Was this translation helpful? Give feedback.
-
Dogfooding notes — and an architectural observation @tiannI've now spent meaningful time with all three backends (ElevenLabs ConvAI, Gemini Live, Qwen Realtime) and wanted to share what I found before asking whether there's anything about the implementation you'd want shaped differently. What works well across the board All three voices respond to the identity and character layers in a way that's genuinely useful — the persona system holds up across backends. In practice most users will never touch it, but it's good that it's there for the ones who will. The Qwen voices in particular surprised me: they're capable and notably resistant to fake interruptions, which matters for a coding assistant context where you want the agent to finish a thought. The "Greet me / Brief me" split for session opening turned out to be a real preference rather than a trivial toggle — I found myself wanting one and then the other depending on context, so exposing that choice was the right call. The ElevenLabs sliders are arguably more knobs than most people need, but that's ElevenLabs' whole product positioning and hiding them would feel wrong for users who are already invested in that ecosystem. For the other backends, what's exposed is exactly enough. Where I've landed architecturally Using all three in real sessions surfaced something I think is worth naming: a conversational voice agent is in the wrong place when it lives inside an individual agent session. The current shape is: you open a coding session → you get a voice button. But I can have a full conversation with the voice agent without ever sending a message to the underlying coding agent. That's the mismatch. Worker agents (Claude, Gemini CLI, Codex, OpenCode) operate at letter cadence — they receive instructions, they work, they report back. They don't chat. Putting a real-time conversational agent inside a worker session conflates two different interaction modes. What I'm coming to realize:
So the voice work in #692/#743 isn't wrong — it's right for what it is, and the provider abstraction will be load-bearing for the control-plane agent when we get there. But I wanted to name the distinction clearly: we now have more first-class voice infrastructure, and the interesting question is where the conversation actually lives. Is there anything about the provider abstraction or the proxy design in #692 you'd want shaped differently before it lands? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for writing this up — I think the distinction you made is the right one. For a single worker session, voice should probably stay lightweight: speech-to-text for composing prompts, and text-to-speech for reading summaries / important results. A full realtime conversational agent inside The more interesting direction is the control-plane layer above sessions: permission requests, blocked sessions, completions, and other attention events routed through one voice interface. That matches HAPI’s #692 looks like the right foundation for that. I don’t have a major objection to the provider abstraction/proxy shape now that it has landed. The main things I’d want to preserve going forward:
For the fleet command centre, I’d suggest starting smaller than a full conversational agent: first build an attention/event layer with priorities like permission request > blocked > completion. Once that model is Also noted that #743 is currently conflicting after #692 landed — if you can rebase it, I’m happy to review it in that updated context. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @tiann — glad the distinction landed. Completely agree on starting with the attention/event layer before full conversational agent; building the priority model (permission > blocked > completion) first gives a solid foundation to reason about before adding voice as a transport. The four constraints you listed for the provider abstraction are exactly what we've been working to — the proxy security work in #692 was specifically about ensuring browser clients can't override hub-owned config. On #743: Already rebased onto the post-merge main this morning (06-04 ~11:00 UTC). It's 8 commits ahead, 0 behind, CI green, 0 open threads. Should be clean to review now. The one open note is a bot finding about composed prompts in WS query strings — we left a reasoned response in the thread explaining the current truncation cap, and filed it as a follow-on consideration (issue #802, now closed in favour of tracking in the PR discussion). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey all
I've been running Hapi for a little while as my daily remote for Claude Code and Gemini, and I've been doing a lot of thinking about the voice interface -- where it is, where it could go. Figured an Ideas thread was the right place rather than a stream of issues.
Where we just got to
PR #690 landed today -- a voice picker in Settings that fetches your account's voices dynamically from ElevenLabs (so your instant voice clones show up alongside pre-made ones), with a play button on each so you can hear before you choose. Small thing, but it felt wrong to hardcode a list when ElevenLabs accounts already carry personalised voices.
Issue #686 tracks it.
What I've been thinking about next, roughly in order
Provider abstraction
The current voice path is ElevenLabs ConvAI all the way down. That's fine if you have a key, but I'd like a
VoiceTransportProviderinterface that lets different backends plug in. ElevenLabs stays the default, nothing breaks for existing users.Which brings me to PR #401 from @Overbaker -- Gemini Live and Qwen Realtime (both cloud APIs, different vendors to ElevenLabs) backends, runtime backend discovery via
GET /api/voice/backend, lazy loading of alternative sessions. The architecture is exactly right. The problem is it's been sitting with unresolved HAPI Bot findings since late April with no activity from the author.Note
I've declared intent to take #401 over -- rebase it onto current main, work through the open issues. Forgiveness over permission at this point. If @Overbaker reappears, great, I'll coordinate; if not, the work needs to move.
Update: rebase filed as PR #692.
Local speech stack
Once a provider interface exists, I'd want to add a local backend. I run Speaches (STT) and Chatterbox (TTS) already on my home server -- both expose OpenAI-compatible endpoints. Pipecat is the framework I'm looking at for orchestrating the audio pipeline.
Important
Honest caveat: Speaches + Chatterbox handle audio in and out, but a full voice agent also needs an LLM for the conversational brain. ElevenLabs ConvAI bundles one. A local equivalent would need a local LLM (Ollama or similar) wired in. I haven't fully proven that end-to-end yet -- I know the audio plumbing works, the LLM integration via Pipecat is the part I'm still figuring out. Anyone who's done this before, I'd genuinely like to know what you used.
The appeal of local isn't just cost -- it's latency and offline operation on a tailnet.
Fleet command centre (the bigger idea)
This is the one I'm floating specifically to get friction on, because I've been living inside my own head about it with only an AI to sanity-check me, and that's not a reliable signal.
The use case: I've got four or five Claude/Gemini/Codex sessions running. One hits a permission request or goes blocked. Right now I have to get my phone out, find the right session, read the screen, tap approve. Fine, but it breaks flow.
What I want instead is one voice conversation that spans all active sessions -- an attention layer that routes agent interrupts to my earbuds by priority:
Completely hands-free. I'm in the garden, agents are working, the ones that need me find me.
Note
This only makes economic sense with local TTS. If five sessions are all firing cloud TTS on every completion, the bill adds up fast. With a local Chatterbox instance it's effectively free. So the local speech stack (phase above) is load-bearing for this one.
I'm going to build this regardless -- it scratches a very specific itch and I need to see it through. What I'm asking for is genuine pushback: things I haven't thought of, reasons it won't work the way I'm imagining, prior art I should look at. I've been getting a lot of "great idea" from my AI collaborator and not enough "here's why that's harder than you think."
Specific questions I don't have answers to
Tip
These are genuine unknowns -- not rhetorical. If you've solved any of these I'd love to hear what you actually did, not just what sounds right in theory.
VoiceTransportProvideras part of taking feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime #401 forward, any strong opinions on what it needs to expose before I write it up?Beta Was this translation helpful? Give feedback.
All reactions