Problem or Motivation
OpenMAIC currently has a complete custom voice workflow for VoxCPM2: users can create prompt voices, clone voices from reference audio, generate auto voices from agent persona, preview voices, delete saved voices, and reuse those voices from the shared Agent Bar voice pool.
Other TTS providers now expose similar capabilities, but OpenMAIC only treats most of them as static voice lists:
- ElevenLabs supports voice listing, Instant Voice Cloning, Voice Design previews, designed voice creation, voice settings, and deletion.
- MiniMax supports voice cloning, account voice listing, and voice deletion.
- Qwen / Alibaba Bailian supports voice clone and voice design through dedicated VC/VD model families, but OpenMAIC currently only exposes the regular TTS models.
- GLM supports
GLM-TTS-Clone, which can create a custom voice ID from short reference audio.
- Doubao, OpenAI, and Azure have custom voice capabilities, but with more account, consent, or training/deployment constraints.
This creates an inconsistent user experience: VoxCPM users get reusable managed voices, while users of other providers must either use built-in voices or manually manage provider-side voice IDs outside OpenMAIC.
Proposed Solution
Add a cross-provider TTS voice management capability layer, then implement provider-specific adapters incrementally.
1. Add a provider voice capability abstraction
Introduce a small adapter interface for providers that support voice management:
type VoiceKind = 'system' | 'library' | 'prompt' | 'clone' | 'designed';
type VoiceStatus = 'ready' | 'pending' | 'failed' | 'unsupported';
interface ProviderVoice {
providerId: TTSProviderId;
id: string;
name: string;
kind: VoiceKind;
status?: VoiceStatus;
targetModelId?: string;
language?: string;
gender?: 'male' | 'female' | 'neutral';
previewUrl?: string;
description?: string;
metadata?: Record<string, unknown>;
}
interface TTSVoiceManagementAdapter {
listVoices?(config: TTSProviderConfigInput): Promise<ProviderVoice[]>;
createCloneVoice?(input: CloneVoiceInput): Promise<ProviderVoice>;
createDesignedVoice?(input: DesignedVoiceInput): Promise<ProviderVoice | ProviderVoice[]>;
deleteVoice?(voiceId: string, input: DeleteVoiceInput): Promise<void>;
resolveVoiceForSynthesis?(voice: ProviderVoice): Promise<TTSSynthesisVoiceRef>;
}
The adapter should preserve the current VoxCPM2 behavior while allowing cloud providers to return provider-managed voices.
2. Build a shared voice pool UI
Generalize the current VoxCPM voice manager into a shared provider-aware voice pool:
- Show system, library, cloned, designed, prompt, and pending voices with clear labels.
- Support preview where the provider allows it.
- Support provider-side delete separately from local removal.
- Filter incompatible voices by provider/model when a voice is bound to a target model.
- Keep browser-local metadata only as a cache or UX helper for cloud voices; refresh from provider list APIs where possible.
3. Implement providers in priority order
P0:
- ElevenLabs: list/get voices, create Instant Voice Clone, optionally support Voice Design previews, create designed voices, delete voices.
- MiniMax: upload reference audio, clone voice, list account voices, delete cloned/generated voices.
P1:
- Qwen / Alibaba Bailian: add
qwen3-tts-vc-* and qwen3-tts-vd-* model families, enforce target_model compatibility, then support clone/design voice creation and listing.
- GLM: support creating a clone voice and saving the returned voice ID; postpone full cloud sync until list/delete APIs are confirmed.
- Doubao: support only after the current recommended voice clone / TTS V3 account flow is confirmed.
P2:
- OpenAI custom voices: gate behind eligibility and implement the required consent-aware workflow.
- Azure custom voice: treat as an advanced provider-specific training/deployment workflow, not a simple one-click clone flow.
Browser native TTS and generic custom OpenAI-compatible TTS should remain unsupported for voice cloning unless a provider-specific adapter is configured.
Acceptance Criteria
- Existing VoxCPM2 prompt voice, clone voice, auto voice, preview, delete, and Agent Bar voice selection behavior continues to work.
- TTS providers can declare voice management capabilities without provider-specific UI branching scattered through settings components.
- ElevenLabs and MiniMax voice management can be implemented through the new abstraction without changing the TTS synthesis API shape for other providers.
- Provider-managed voices and local voices have a common
ProviderVoice shape and can appear in the shared voice picker.
- The UI distinguishes provider-side deletion from local metadata removal.
- Model-bound voices are not shown for incompatible TTS models.
- Unit tests cover voice resolution, capability gating, and at least one cloud adapter with mocked API responses.
- Documentation is updated with provider capability status and official references.
Alternatives Considered
- Keep provider-specific voice managers under each TTS settings panel. This is faster for one provider, but it duplicates the VoxCPM UI and makes the Agent Bar voice pool harder to keep consistent.
- Keep accepting manual voice IDs. This is useful as a fallback, but it does not solve clone creation, account voice listing, preview, or deletion.
- Only support VoxCPM2 custom voices. This leaves strong existing provider capabilities unused, especially ElevenLabs and MiniMax.
Additional Context
Research document:
docs/tts-voice-capabilities-research.md
High-priority provider references:
Problem or Motivation
OpenMAIC currently has a complete custom voice workflow for VoxCPM2: users can create prompt voices, clone voices from reference audio, generate auto voices from agent persona, preview voices, delete saved voices, and reuse those voices from the shared Agent Bar voice pool.
Other TTS providers now expose similar capabilities, but OpenMAIC only treats most of them as static voice lists:
GLM-TTS-Clone, which can create a custom voice ID from short reference audio.This creates an inconsistent user experience: VoxCPM users get reusable managed voices, while users of other providers must either use built-in voices or manually manage provider-side voice IDs outside OpenMAIC.
Proposed Solution
Add a cross-provider TTS voice management capability layer, then implement provider-specific adapters incrementally.
1. Add a provider voice capability abstraction
Introduce a small adapter interface for providers that support voice management:
The adapter should preserve the current VoxCPM2 behavior while allowing cloud providers to return provider-managed voices.
2. Build a shared voice pool UI
Generalize the current VoxCPM voice manager into a shared provider-aware voice pool:
3. Implement providers in priority order
P0:
P1:
qwen3-tts-vc-*andqwen3-tts-vd-*model families, enforcetarget_modelcompatibility, then support clone/design voice creation and listing.P2:
Browser native TTS and generic custom OpenAI-compatible TTS should remain unsupported for voice cloning unless a provider-specific adapter is configured.
Acceptance Criteria
ProviderVoiceshape and can appear in the shared voice picker.Alternatives Considered
Additional Context
Research document:
docs/tts-voice-capabilities-research.mdHigh-priority provider references: