Skip to content

adds configurable model support for TTS and ASR#108

Open
ShaojieLiu wants to merge 23 commits intoTHU-MAIC:mainfrom
ShaojieLiu:lsj/fix-configurable-tts-asr
Open

adds configurable model support for TTS and ASR#108
ShaojieLiu wants to merge 23 commits intoTHU-MAIC:mainfrom
ShaojieLiu:lsj/fix-configurable-tts-asr

Conversation

@ShaojieLiu
Copy link
Copy Markdown
Contributor

Summary

This PR adds configurable model support for TTS and ASR.

Previously, TTS and ASR provider implementations used hardcoded server-side models, so the settings UI could not choose which model to use. This change introduces persisted ttsModelId and asrModelId settings, updates the TTS/ASR configuration pages to follow the image-generation model-management pattern, and propagates the selected model through preview, generation, and transcription flows.

Related Issues

Fixes the issue where TTS and ASR could not select models independently in settings.
closed #14

Changes

  • Added ttsModelId and asrModelId to persisted settings state
  • Added TTS/ASR provider model definitions to audio provider metadata
  • Reworked TTS and ASR settings pages to match the image-generation model section pattern
  • Added bottom-positioned model lists for TTS/ASR with selectable active model
  • Added create/edit/delete support for custom TTS/ASR models
  • Updated TTS preview, scene generation, generation preview, and ASR recording/transcription flows to send selected model IDs
  • Replaced hardcoded server-side TTS/ASR model selection with configurable model IDs from settings
  • Added migration/defaulting behavior so existing users get valid default TTS/ASR models

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • CI/CD or build changes

Verification

Steps to reproduce / test

  1. Open TTS settings and confirm the model section appears at the bottom of the page
  2. Select a built-in TTS model, add a custom TTS model, switch selection, and verify preview requests use the selected model
  3. Open ASR settings and confirm the model section appears at the bottom of the page
  4. Select a built-in ASR model, add a custom ASR model, switch selection, and verify transcription requests use the selected model
  5. Delete a selected custom TTS/ASR model and confirm the UI falls back to another available model

What you personally verified

  • Ran targeted eslint checks on modified files

  • Ran pnpm exec tsc -p tsconfig.json --noEmit

  • Verified the TTS/ASR settings UI structure now mirrors the image-generation model section

  • Verified selected TTS/ASR model IDs are threaded through client requests into server handlers

  • Did not run full manual browser interaction testing or full CI suite in this session

  • pnpm exec eslint lib/audio/types.ts lib/audio/constants.ts lib/store/settings.ts lib/audio/tts-providers.ts lib/audio/asr-providers.ts app/api/generate/tts/route.ts app/api/transcription/route.ts components/settings/tts-settings.tsx components/settings/asr-settings.tsx components/generation/media-popover.tsx components/audio/tts-config-popover.tsx lib/hooks/use-audio-recorder.ts lib/hooks/use-scene-generator.ts app/generation-preview/page.tsx

  • pnpm exec tsc -p tsconfig.json --noEmit

Evidence

  • CI passes (pnpm check && pnpm lint && npx tsc --noEmit)
  • Manually tested locally
  • Screenshots / recordings attached (if UI changes)

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have added/updated documentation as needed
  • My changes do not introduce new warnings

* main:
  feat: whiteboard history and auto-save (THU-MAIC#40)
  fix: use browser speechSynthesis for playback when browser-native-tts is selected (THU-MAIC#28)
  chore: fix some minor issues in the comments (THU-MAIC#71)
  fix: reset ASR language when changing provider (THU-MAIC#67)
  fix: isolate settings API key autofill fields (THU-MAIC#48)

# Conflicts:
#	components/audio/tts-config-popover.tsx
#	components/generation/media-popover.tsx
#	components/settings/tts-settings.tsx
#	lib/store/settings.ts
@ShaojieLiu
Copy link
Copy Markdown
Contributor Author

@wyuc Hi, could you help me review this PR? THX

Copy link
Copy Markdown
Contributor

@wyuc wyuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — the overall approach is solid and the model selection threads through the full stack correctly.

Two things to address:

1. Azure TTS / Browser Native don't actually use model IDs

Azure TTS uses SSML with voice selection, and Browser Native uses the Web Speech API — neither has a "model" concept. But the PR adds dummy model entries for them (azure-neural-tts, browser-native-tts, browser-native-asr), which means the UI shows a model selector that does nothing. This is confusing for users.

Suggestion: skip the model section for providers where model selection has no effect, or add a flag like supportsModelSelection to conditionally render it.

2. Dead code

getTTSModels() and getASRModels() in lib/audio/constants.ts are defined but never called anywhere. Either wire them up or remove them.

Minor nits (non-blocking):

  • Some section comments were removed from tts-settings.tsx ({/* API Key & Base URL */}, etc.) — looks unintentional, might want to restore them
  • Custom model list uses key={custom-${index}} — using model ID as key would be more robust

@ShaojieLiu
Copy link
Copy Markdown
Contributor Author

Thanks for this — the overall approach is solid and the model selection threads through the full stack correctly.

Two things to address:

1. Azure TTS / Browser Native don't actually use model IDs

Azure TTS uses SSML with voice selection, and Browser Native uses the Web Speech API — neither has a "model" concept. But the PR adds dummy model entries for them (azure-neural-tts, browser-native-tts, browser-native-asr), which means the UI shows a model selector that does nothing. This is confusing for users.

Suggestion: skip the model section for providers where model selection has no effect, or add a flag like supportsModelSelection to conditionally render it.

2. Dead code

getTTSModels() and getASRModels() in lib/audio/constants.ts are defined but never called anywhere. Either wire them up or remove them.

Minor nits (non-blocking):

  • Some section comments were removed from tts-settings.tsx ({/* API Key & Base URL */}, etc.) — looks unintentional, might want to restore them
  • Custom model list uses key={custom-${index}} — using model ID as key would be more robust

Thanks, addressed all of these.

I split the follow-up into three commits for clarity:

5c852b2 fix: hide audio model selectors for unsupported providers
5193da9 chore: remove unused audio model helpers
bcab3e3 chore: restore settings comments and stable model keys
What changed:

Added supportsModelSelection to TTS/ASR provider metadata and set it to false for Azure TTS and Browser Native TTS/ASR
Removed the dummy model entries for those providers
Updated the TTS/ASR settings UIs to only render model management when model selection actually matters
Updated default/store fallback logic so unsupported providers keep an empty model id instead of a fake one
Removed unused getTTSModels() / getASRModels()
Restored the dropped section comments in tts-settings.tsx
Switched custom model list keys to use model.id

@ShaojieLiu ShaojieLiu requested a review from wyuc March 23, 2026 03:06
@wyuc
Copy link
Copy Markdown
Contributor

wyuc commented Mar 27, 2026

Thanks for the follow-up commits. The supportsModelSelection flag and the cleanup all look correct.

One thing I missed in the first round: generateElevenLabsTTS() in lib/audio/tts-providers.ts still has model_id hardcoded to 'eleven_multilingual_v2' (line 353). The other providers were updated to use config.modelId || 'default', but ElevenLabs was not. Since ElevenLabs is marked supportsModelSelection: true, users can select or add a custom model in the UI, but the actual API request will always send eleven_multilingual_v2.

Fix should be straightforward:

```ts
model_id: config.modelId || 'eleven_multilingual_v2',
```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: 增加对 TTS 服务商 Model ID 的灵活配置支持

2 participants