feat: add CosyVoice2/3 TTS engine with instruct and voice cloning by jamiepine · Pull Request #311 · jamiepine/voicebox

jamiepine · 2026-03-17T19:32:50Z

Summary

Adds Alibaba's CosyVoice2-0.5B and Fun-CosyVoice3-0.5B as a new TTS engine (cosyvoice) with 9-language support, zero-shot voice cloning, and instruct control (emotions, speed, volume, dialects)
CosyVoice source is cloned at setup time (just setup-python) since no PyPI package exists — a modelscope→HuggingFace shim redirects downloads to public HF repos
Includes a lightweight Matcha-TTS pylogger shim to avoid pulling in pytorch-lightning (~300MB) as a transitive dependency

Changes

Backend (5 files)

File	Change
`backend/backends/cosyvoice_backend.py`	New — Full backend: sys.path injection, modelscope→HF patch, pylogger shim, torch.load CPU patch, `inference_instruct2()` + `inference_zero_shot()`
`backend/backends/__init__.py`	Register engine in `TTS_ENGINES`, add 2 `ModelConfig` entries (v2/v3), factory branch, multi-size dispatch
`backend/models.py`	Extend engine regex (`cosyvoice`) and model_size regex (`v2\|v3`)
`backend/build_binary.py`	PyInstaller: hidden-imports, `--collect-all` for hyperpyyaml/onnxruntime/whisper/tiktoken, `--add-data` for vendored source
`backend/server.py`	Add CosyVoice to sys.path in frozen builds

Frontend (5 files)

File	Change
`app/src/lib/api/types.ts`	Add `'cosyvoice'` to engine union, `'v2' \| 'v3'` to model_size
`app/src/lib/constants/languages.ts`	Add `cosyvoice` language list (zh, en, ja, ko, de, fr, ru, es, it)
`app/src/components/Generation/EngineModelSelector.tsx`	Add CosyVoice2/3 options, description, select/change handlers
`app/src/lib/hooks/useGenerationForm.ts`	Extend Zod schema, model-name mapping, instruct pass-through
`app/src/components/ServerSettings/ModelManagement.tsx`	Add model descriptions

Infra (4 files)

File	Change
`backend/requirements.txt`	Add hyperpyyaml, onnxruntime, openai-whisper, tiktoken, einops, inflect
`justfile`	Add `git clone --recursive --depth 1` in Unix + Windows setup-python
`.github/workflows/release.yml`	Add CosyVoice clone to CPU and CUDA build jobs
`Dockerfile`	Clone in builder stage, copy to runtime
`.gitignore`	Ignore `backend/vendors/`

Design Decisions

Clone at setup, not vendor — CosyVoice is ~10K lines with no setup.py; cloning avoids maintaining vendored code
Single engine, two model sizes — cosyvoice:v2 (CosyVoice2) and cosyvoice:v3 (CosyVoice3) share one backend that swaps models on demand
Matcha-TTS pylogger shim — Avoids lightning.pytorch (~300MB) dependency used only for rank_zero_only logging
ModelScope → HuggingFace — Monkey-patches download to use public HF repos (no auth needed)
CPU forced on macOS — CosyVoice has no MPS support

Testing

just setup-python  # clones CosyVoice + installs deps
just dev           # select CosyVoice2 0.5B from engine dropdown

Model download (~4.6GB) is triggered on first generation.

Summary by CodeRabbit

New Features
- Added CosyVoice TTS engine with v2 and v3 variants and selectable cosyvoice2-0.5b / cosyvoice3-0.5b models
- CosyVoice supports 9 languages: Chinese, English, Japanese, Korean, German, French, Russian, Spanish, Italian
Chores
- Vendored CosyVoice sources are fetched at setup and included in builds; runtime packaging and dependency list updated to support CosyVoice integration

Integrate Alibaba's CosyVoice2-0.5B and Fun-CosyVoice3-0.5B as a new TTS engine supporting 9 languages, zero-shot voice cloning, and instruct control (emotions, speed, volume, dialects). The CosyVoice source is cloned at setup time into backend/vendors/ since no PyPI package exists. A modelscope→HuggingFace shim redirects model downloads to the public HF repos, and a lightweight pylogger shim avoids pulling in pytorch-lightning as a transitive dependency. Backend: cosyvoice_backend.py, __init__.py registry, models.py regex Frontend: engine selector, language map, Zod schema, model descriptions Infra: requirements.txt, justfile, release.yml, Dockerfile, PyInstaller

coderabbitai · 2026-03-17T19:33:07Z

📝 Walkthrough

Walkthrough

Adds CosyVoice TTS support (v2/v3): frontend options and types, backend CosyVoiceTTSBackend, model registrations and validation, new Python dependencies, and build/CI changes to clone and bundle the CosyVoice source for packaging and runtime.

Changes

Cohort / File(s)	Summary
Build & Packaging `.github/workflows/release.yml`, `Dockerfile`, `justfile`, `backend/build_binary.py`	Clone CosyVoice repo during CI/setup, copy vendor sources into Docker/runtime images, and include cosyvoice/matcha assets in PyInstaller bundling.
Version Control / Runtime Path `.gitignore`, `backend/server.py`	Ignore vendored `backend/vendors/` and `*.tmp`; add frozen-app sys.path handling to expose bundled cosyvoice/matcha at runtime.
Frontend: UI & Management `app/src/components/Generation/EngineModelSelector.tsx`, `app/src/components/ServerSettings/ModelManagement.tsx`	Add `cosyvoice:v2`/`v3` options and descriptions; add model descriptions for `cosyvoice2-0.5b` and `cosyvoice3-0.5b`; broaden voice model filtering to exclude `whisper*`.
Frontend: Types, Constants & Hooks `app/src/lib/api/types.ts`, `app/src/lib/constants/languages.ts`, `app/src/lib/hooks/useGenerationForm.ts`	Extend engine union with `cosyvoice` and model_size with `v2
Backend: Registry & Models `backend/backends/__init__.py`, `backend/models.py`	Register `cosyvoice` in TTS_ENGINES, add two cosyvoice model configs, include cosyvoice in load/cache branches, and extend pydantic validation patterns for engine/model_size.
Backend: CosyVoice Implementation `backend/backends/cosyvoice_backend.py`	New CosyVoiceTTSBackend (async-safe loading, v2/v3 variant handling, source path checks, import shims/patches, generate/create_voice_prompt/combine_voice_prompts, 24kHz outputs).
Dependencies `backend/requirements.txt`	Add CosyVoice sub-dependencies (hyperpyyaml, onnxruntime, openai-whisper, tiktoken, einops, inflect, matplotlib, etc.).

Sequence Diagram

sequenceDiagram
    participant User as User/Frontend
    participant Form as Generation Form
    participant API as Backend API
    participant Factory as Engine Factory
    participant CVBackend as CosyVoice Backend
    participant ModelCache as Model Cache/Loader

    User->>Form: choose cosyvoice + v2|v3
    Form->>API: POST GenerationRequest(engine: cosyvoice, model_size: v2|v3, lang, instruct?)
    API->>Factory: get_tts_backend_for_engine("cosyvoice")
    Factory->>CVBackend: instantiate/load backend
    CVBackend->>ModelCache: ensure model variant loaded (v2/v3)
    ModelCache-->>CVBackend: model ready (cached or loaded)
    API->>CVBackend: generate(text, reference?, seed?)
    CVBackend->>CVBackend: select mode (instruct2 / zero-shot), run inference, assemble chunks
    CVBackend-->>API: return audio numpy @ 24kHz
    API-->>User: stream/return audio

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Add HumeAI TADA TTS engine (1B English + 3B Multilingual) #296 — Parallel multi-engine integration adding a new TTS engine and coordinated frontend/backend/build changes (similar file targets and patterns).
feat: LuxTTS integration — multi-engine TTS support #254 — Adds a TTS engine with matching updates to engine enums, backend factory, and frontend mappings.
feat: Chatterbox Turbo engine + per-engine language lists #258 — Related engine addition that modified ENGINE_LANGUAGES, TTS_ENGINES, and generation form logic.

Poem

🐰 I hopped to fetch a cosy tune,
Cloned the garden under the moon,
v2 and v3, voices in play,
Languages many to brighten the day,
I nibble bugs and hum — hooray! 🎶

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 48.39% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add CosyVoice2/3 TTS engine with instruct and voice cloning' accurately and specifically describes the primary change—adding a new TTS engine with its key features.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/cosyvoice-engine

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

Dockerfile (1)
42-42: Pin the CosyVoice checkout to ensure reproducible builds.

Cloning the default branch at build time makes the image non-reproducible and allows upstream changes to affect already-released app revisions. Define one pinned ref (commit SHA, tag, or branch) and reuse it consistently across the Dockerfile (line 42), justfile (lines 55, 89), and .github/workflows/release.yml (lines 66, 194).
♻️ One way to pin the vendor ref
+ARG COSYVOICE_REF
-RUN git clone --recursive --depth 1 https://github.com/FunAudioLLM/CosyVoice.git /build/CosyVoice
+RUN git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git /build/CosyVoice \
+    && git -C /build/CosyVoice checkout "$COSYVOICE_REF" \
+    && git -C /build/CosyVoice submodule update --init --recursive
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` at line 42, Replace the unpinned git clone with a reproducible,
pinned ref: introduce a build-time variable (e.g., ARG or ENV named
COSYVOICE_REF) and use it in the git clone command (the RUN git clone
--recursive ... line) to checkout a specific commit SHA, tag, or branch, and
then reuse that same COSYVOICE_REF variable in the justfile and
.github/workflows/release.yml so all places (the Dockerfile's RUN git clone, the
justfile entries at the noted lines, and the release workflow steps) reference
the identical pinned ref for reproducible builds.
backend/backends/cosyvoice_backend.py (1)
23-23: Consider using modern type hints.

Static analysis flags that typing.List and typing.Tuple are deprecated in favor of lowercase list and tuple (available since Python 3.9+).
♻️ Proposed refactor
-from typing import ClassVar, List, Optional, Tuple
+from typing import ClassVar, Optional
Then update usages:

List[str] → list[str]

Tuple[dict, bool] → tuple[dict, bool]

Tuple[np.ndarray, str] → tuple[np.ndarray, str]

Tuple[np.ndarray, int] → tuple[np.ndarray, int]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/backends/cosyvoice_backend.py` at line 23, Replace deprecated typing
generics with native Python types: change the import line that currently brings
in List and Tuple (from typing import ClassVar, List, Optional, Tuple) to drop
List and Tuple and use the built-in lowercase forms in annotations; update all
annotations in this module that use List[...] to list[...] and Tuple[...] to
tuple[...] (e.g., List[str] -> list[str], Tuple[dict, bool] -> tuple[dict,
bool], Tuple[np.ndarray, str] -> tuple[np.ndarray, str], Tuple[np.ndarray, int]
-> tuple[np.ndarray, int]), and remove List/Tuple from the import list so only
needed names like ClassVar and Optional remain (also clean up any now-unused
imports).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/release.yml:
- Line 66: The git clone command "git clone --recursive --depth 1
https://github.com/FunAudioLLM/CosyVoice.git backend/vendors/CosyVoice" will
fail when its parent directory doesn't exist—add a step to create the parent
directory before cloning (e.g., run a cross-platform create like mkdir -p
backend/vendors or the equivalent Windows-aware command) and apply the same
change to the other clone occurrence of the same command; update the workflow so
the directory creation runs immediately before each git clone of
backend/vendors/CosyVoice.

In `@app/src/components/ServerSettings/ModelManagement.tsx`:
- Around line 69-72: The voiceModels filter used to build the "Voice Generation"
list excludes models that begin with "cosyvoice", so add a condition to the
existing filter predicate in the voiceModels creation (the filter around lines
397-404) to accept model IDs that startWith('cosyvoice') (or match
/^cosyvoice/); update the predicate used where voiceModels is computed so
entries like 'cosyvoice2-0.5b' and 'cosyvoice3-0.5b' are included in the
resulting list shown in the UI.

In `@backend/backends/cosyvoice_backend.py`:
- Around line 263-268: The combine_voice_prompts wrapper in cosyvoice_backend.py
currently forwards only audio_paths and reference_texts to
_combine_voice_prompts; modify the async method combine_voice_prompts to pass
sample_rate=24000 (use COSYVOICE_SAMPLE_RATE if defined) when calling
_combine_voice_prompts so the combined audio is explicitly resampled to 24 kHz;
update the call in combine_voice_prompts to include the sample_rate keyword
argument to match the TADA backend's behavior.

In `@backend/build_binary.py`:
- Around line 254-263: The _ensure_cosyvoice_on_path() function must recognize
PyInstaller frozen runs and use the extracted _MEIPASS paths or skip the
vendor-dir existence check; update _ensure_cosyvoice_on_path() to inspect
getattr(sys, "_MEIPASS", None) and, when present, construct paths to
os.path.join(sys._MEIPASS, "cosyvoice") and os.path.join(sys._MEIPASS, "matcha")
(or skip raising RuntimeError) instead of insisting on
backend/vendors/CosyVoice; ensure the same logic aligns with the
backend/server.py shim behavior so frozen binaries don't raise on startup.

---

Nitpick comments:
In `@backend/backends/cosyvoice_backend.py`:
- Line 23: Replace deprecated typing generics with native Python types: change
the import line that currently brings in List and Tuple (from typing import
ClassVar, List, Optional, Tuple) to drop List and Tuple and use the built-in
lowercase forms in annotations; update all annotations in this module that use
List[...] to list[...] and Tuple[...] to tuple[...] (e.g., List[str] ->
list[str], Tuple[dict, bool] -> tuple[dict, bool], Tuple[np.ndarray, str] ->
tuple[np.ndarray, str], Tuple[np.ndarray, int] -> tuple[np.ndarray, int]), and
remove List/Tuple from the import list so only needed names like ClassVar and
Optional remain (also clean up any now-unused imports).

In `@Dockerfile`:
- Line 42: Replace the unpinned git clone with a reproducible, pinned ref:
introduce a build-time variable (e.g., ARG or ENV named COSYVOICE_REF) and use
it in the git clone command (the RUN git clone --recursive ... line) to checkout
a specific commit SHA, tag, or branch, and then reuse that same COSYVOICE_REF
variable in the justfile and .github/workflows/release.yml so all places (the
Dockerfile's RUN git clone, the justfile entries at the noted lines, and the
release workflow steps) reference the identical pinned ref for reproducible
builds.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4db7c9ed-55cf-4484-8442-2b3a3447f384

📥 Commits

Reviewing files that changed from the base of the PR and between c9f38dd and f77dd62.

📒 Files selected for processing (15)

.github/workflows/release.yml
.gitignore
Dockerfile
app/src/components/Generation/EngineModelSelector.tsx
app/src/components/ServerSettings/ModelManagement.tsx
app/src/lib/api/types.ts
app/src/lib/constants/languages.ts
app/src/lib/hooks/useGenerationForm.ts
backend/backends/__init__.py
backend/backends/cosyvoice_backend.py
backend/build_binary.py
backend/models.py
backend/requirements.txt
backend/server.py
justfile

coderabbitai · 2026-03-17T19:44:31Z

.github/workflows/release.yml

          pip install -r backend/requirements.txt
          pip install --no-deps chatterbox-tts
          pip install --no-deps hume-tada
+          git clone --recursive --depth 1 https://github.com/FunAudioLLM/CosyVoice.git backend/vendors/CosyVoice


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash test -d backend/vendors && echo "backend/vendors exists" || echo "backend/vendors missing" rg -n "backend/vendors/CosyVoice" .github/workflows/release.yml

Repository: jamiepine/voicebox

Length of output: 318

🏁 Script executed:

cat -n .gitignore | grep -i "backend/vendors"

Repository: jamiepine/voicebox

Length of output: 85

🏁 Script executed:

# Check if justfile exists and contains backend/vendors directory creation if [ -f justfile ]; then cat -n justfile | grep -A 3 -B 3 "backend/vendors" else echo "justfile not found" fi

Repository: jamiepine/voicebox

Length of output: 497

🏁 Script executed:

# Check the workflow file around lines 66 and 194 for context sed -n '50,75p' .github/workflows/release.yml echo "---" sed -n '178,210p' .github/workflows/release.yml

Repository: jamiepine/voicebox

Length of output: 2329

🏁 Script executed:

# Verify git clone behavior with missing parent directory git clone --help 2>&1 | grep -i "parent\|directory" | head -5

Repository: jamiepine/voicebox

Length of output: 118

Create backend/vendors before cloning.

Fresh checkouts do not contain backend/vendors (it is ignored by .gitignore), and git clone ... backend/vendors/CosyVoice will fail when the parent path is missing. This breaks both release jobs before the server build starts.

🛠️ Cross-platform fix for both clone sites

pip install --no-deps chatterbox-tts pip install --no-deps hume-tada + python -c "from pathlib import Path; Path('backend/vendors').mkdir(parents=True, exist_ok=True)" git clone --recursive --depth 1 https://github.com/FunAudioLLM/CosyVoice.git backend/vendors/CosyVoice

Also applies to: 194-194

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/release.yml at line 66, The git clone command "git clone --recursive --depth 1 https://github.com/FunAudioLLM/CosyVoice.git backend/vendors/CosyVoice" will fail when its parent directory doesn't exist—add a step to create the parent directory before cloning (e.g., run a cross-platform create like mkdir -p backend/vendors or the equivalent Windows-aware command) and apply the same change to the other clone occurrence of the same command; update the workflow so the directory creation runs immediately before each git clone of backend/vendors/CosyVoice.

app/src/components/ServerSettings/ModelManagement.tsx

coderabbitai · 2026-03-17T19:44:31Z

backend/backends/cosyvoice_backend.py

+    async def combine_voice_prompts(
+        self,
+        audio_paths: List[str],
+        reference_texts: List[str],
+    ) -> Tuple[np.ndarray, str]:
+        return await _combine_voice_prompts(audio_paths, reference_texts)


⚠️ Potential issue | 🟡 Minor

Pass sample_rate=24000 to ensure consistent audio resampling.

The combine_voice_prompts call omits the sample_rate parameter. CosyVoice expects 24 kHz audio (as defined by COSYVOICE_SAMPLE_RATE). Without explicit resampling, combined reference audio may be at the wrong sample rate if source files differ.

The TADA backend (which also uses 24 kHz) explicitly passes sample_rate=24000 in its combine_voice_prompts implementation.

🔧 Proposed fix

async def combine_voice_prompts( self, audio_paths: List[str], reference_texts: List[str], ) -> Tuple[np.ndarray, str]: - return await _combine_voice_prompts(audio_paths, reference_texts) + return await _combine_voice_prompts(audio_paths, reference_texts, sample_rate=COSYVOICE_SAMPLE_RATE)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def combine_voice_prompts(

self,

audio_paths: List[str],

reference_texts: List[str],

) -> Tuple[np.ndarray, str]:

return await _combine_voice_prompts(audio_paths, reference_texts)

async def combine_voice_prompts(

self,

audio_paths: List[str],

reference_texts: List[str],

) -> Tuple[np.ndarray, str]:

return await _combine_voice_prompts(audio_paths, reference_texts, sample_rate=COSYVOICE_SAMPLE_RATE)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/backends/cosyvoice_backend.py` around lines 263 - 268, The combine_voice_prompts wrapper in cosyvoice_backend.py currently forwards only audio_paths and reference_texts to _combine_voice_prompts; modify the async method combine_voice_prompts to pass sample_rate=24000 (use COSYVOICE_SAMPLE_RATE if defined) when calling _combine_voice_prompts so the combined audio is explicitly resampled to 24 kHz; update the call in combine_voice_prompts to include the sample_rate keyword argument to match the TADA backend's behavior.

coderabbitai · 2026-03-17T19:44:32Z

backend/build_binary.py

+    # Bundle the vendored CosyVoice source tree for frozen builds.
+    # The clone lives at backend/vendors/CosyVoice/ at build time.
+    cosyvoice_vendor = backend_dir / "vendors" / "CosyVoice"
+    if cosyvoice_vendor.exists():
+        args.extend([
+            "--add-data",
+            f"{cosyvoice_vendor / 'cosyvoice'}{os.pathsep}cosyvoice",
+            "--add-data",
+            f"{cosyvoice_vendor / 'third_party' / 'Matcha-TTS' / 'matcha'}{os.pathsep}matcha",
+        ])


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -n "vendors/CosyVoice|_MEIPASS|--add-data|cosyvoice|matcha" backend/backends/cosyvoice_backend.py backend/build_binary.py backend/server.py sed -n '60,90p' backend/backends/cosyvoice_backend.py

Repository: jamiepine/voicebox

Length of output: 4866

_ensure_cosyvoice_on_path() must detect frozen builds and use extracted paths.

When PyInstaller bundles the binary, --add-data extracts CosyVoice to _MEIPASS/cosyvoice and Matcha to _MEIPASS/matcha. However, _ensure_cosyvoice_on_path() still checks for backend/vendors/CosyVoice and raises RuntimeError if not found—this path does not exist in frozen binaries. The backend/server.py shim only adds to sys.path; it cannot prevent the error. Update _ensure_cosyvoice_on_path() to check for the frozen environment (via getattr(sys, '_MEIPASS', None)) and use the extracted paths instead, or skip the existence check when running frozen.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/build_binary.py` around lines 254 - 263, The _ensure_cosyvoice_on_path() function must recognize PyInstaller frozen runs and use the extracted _MEIPASS paths or skip the vendor-dir existence check; update _ensure_cosyvoice_on_path() to inspect getattr(sys, "_MEIPASS", None) and, when present, construct paths to os.path.join(sys._MEIPASS, "cosyvoice") and os.path.join(sys._MEIPASS, "matcha") (or skip raising RuntimeError) instead of insisting on backend/vendors/CosyVoice; ensure the same logic aligns with the backend/server.py shim behavior so frozen binaries don't raise on startup.

- Model filter in ModelManagement was an allowlist missing cosyvoice prefix, replaced with inverse filter (everything except whisper = voice model) - Shim training-only modules (cosyvoice.dataset.processor, matcha.utils.*) to prevent hyperpyyaml from importing pyarrow, pyworld, lightning etc. - Patch torchaudio.load for 2.9+ compat (torchcodec now required by default) - Add matplotlib to requirements (matcha hifigan imports it at module level)

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (5)

backend/backends/cosyvoice_backend.py (4)

183-185: Parameter format shadows Python builtin.

The format parameter name shadows the Python builtin. While this is intentional for API compatibility with torchaudio.load's signature, consider using format_ or audio_format with a local alias if this causes linting noise.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/backends/cosyvoice_backend.py` around lines 183 - 185, The parameter
name format in function _sf_load shadows the Python builtin; rename the
parameter to format_ (or audio_format) and update the function body to use the
new name (or create a local alias format = format_ at top of _sf_load) so
external API compatibility is preserved while removing the builtin shadowing;
also update any internal calls to _sf_load or usages within the same file to
pass the new parameter name or keep backwards-compatible kwargs mapping if
needed.

98-98: Replace lambda assignment with a def statement.

Per static analysis E731, avoid assigning lambda expressions to variables. Use a def statement instead for clarity and better stack traces.

♻️ Proposed fix

-    _noop = lambda *a, **kw: None
+    def _noop(*a, **kw):
+        pass

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/backends/cosyvoice_backend.py` at line 98, The assignment "_noop =
lambda *a, **kw: None" should be replaced with a real function definition to
satisfy E731; define a function named _noop with the same signature (accepting
*a, **kw) that returns None (or simply contains a pass) and use that in place of
the lambda wherever referenced (e.g., any calls to _noop in
cosyvoice_backend.py) so stack traces and static analysis are improved.

425-426: Clarify the empty-chunks fallback.

The fallback returns np.zeros(COSYVOICE_SAMPLE_RATE, ...) which creates 24000 samples (1 second of silence). Using COSYVOICE_SAMPLE_RATE as a length is semantically confusing since it's a rate constant. Consider adding a brief comment or using a named constant for clarity.

♻️ Proposed fix

             # Concatenate all chunks
             if not audio_chunks:
-                return np.zeros(COSYVOICE_SAMPLE_RATE, dtype=np.float32), COSYVOICE_SAMPLE_RATE
+                # Return 1 second of silence as fallback
+                return np.zeros(COSYVOICE_SAMPLE_RATE, dtype=np.float32), COSYVOICE_SAMPLE_RATE

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/backends/cosyvoice_backend.py` around lines 425 - 426, The fallback
that returns np.zeros(COSYVOICE_SAMPLE_RATE, dtype=np.float32),
COSYVOICE_SAMPLE_RATE is confusing because COSYVOICE_SAMPLE_RATE is a rate, not
a length; update the empty-chunks fallback to return a clearly named silence
buffer (e.g., SILENCE_ONE_SECOND = int(COSYVOICE_SAMPLE_RATE * 1) or
SAMPLE_LENGTH_ONE_SEC) and/or add a concise comment above the return explaining
it returns one second of silence, then use that named constant in the np.zeros
call and keep returning COSYVOICE_SAMPLE_RATE for the sample rate; reference the
audio_chunks check and the COSYVOICE_SAMPLE_RATE and np.zeros usage to locate
the change.

23-23: Use built-in list and tuple instead of deprecated typing imports.

Per static analysis, typing.List and typing.Tuple are deprecated in Python 3.9+. Use the built-in generics instead.

♻️ Proposed fix

-from typing import ClassVar, List, Optional, Tuple
+from typing import ClassVar, Optional

Then update usages:

List[str] → list[str]
Tuple[dict, bool] → tuple[dict, bool]
Tuple[np.ndarray, str] → tuple[np.ndarray, str]
Tuple[np.ndarray, int] → tuple[np.ndarray, int]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/backends/cosyvoice_backend.py` at line 23, Replace the deprecated
typing imports in the import line (remove List and Tuple from "from typing
import ClassVar, List, Optional, Tuple") and update all type hints in this
module that use List[...] and Tuple[...] to the built-in generics list[...] and
tuple[...]; specifically change occurrences like List[str] → list[str],
Tuple[dict, bool] → tuple[dict, bool], Tuple[np.ndarray, str] →
tuple[np.ndarray, str], and Tuple[np.ndarray, int] → tuple[np.ndarray, int],
ensuring any references to those type names in functions, methods or class
annotations in cosyvoice_backend.py are updated accordingly.

backend/requirements.txt (1)

43-51: Consider pinning versions for unpinned dependencies.

tiktoken, einops, inflect, and matplotlib lack version constraints. While this is common for transitive dependencies, it can lead to reproducibility issues or unexpected breakages when upstream releases incompatible versions.

If strict reproducibility is desired, consider adding minimum version pins based on your tested versions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/requirements.txt` around lines 43 - 51, The requirements file lists
tiktoken, einops, inflect, and matplotlib without version constraints which
risks breakage; update those entries in the requirements.txt (the lines
containing the package names tiktoken, einops, inflect, matplotlib) to include
explicit version constraints—either minimum tested versions using >= (e.g.,
tiktoken>=X.Y.Z) or exact pins (==X.Y.Z) based on the versions you validated—and
then run the project's test suite/CI and regenerate any lock file to ensure
reproducibility; also add a short comment next to each entry indicating the
tested version you pinned to.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/backends/cosyvoice_backend.py`:
- Around line 274-298: The current temporary global patch of torch.load in the
CosyVoice model-loading block is unsafe across concurrent CosyVoiceTTSBackend
instances because other threads can see the patched function during the brief
window; to fix, move the torch.load patch/restore so it is performed while
holding the same class-level _import_lock (or create a dedicated lock like
_torch_load_patch_lock) used for other global patches, or convert the patch into
a one-time persistent patch under _import_lock so it is not restored per-load;
update the code around torch.load, _orig_torch_load, and the try/finally that
surrounds CosyVoice2/CosyVoice3 construction to use that lock strategy (or
persistent patch) to eliminate the race with _model_load_lock and concurrent
instances.
- Around line 412-422: The fallback branch currently calls
self.model.inference_cross_lingual with prompt_wav=ref_audio or "" which
silently proceeds when ref_audio is missing; instead detect the missing
reference (ref_audio is None or empty) and raise a descriptive exception (e.g.,
ValueError) before calling inference_cross_lingual; update the else branch where
logger.info("[CosyVoice] cross_lingual fallback: lang=%s", language) appears to
check ref_audio and raise the error, leaving inference_cross_lingual calls only
when a valid ref_audio exists so the model method isn’t invoked with an empty
string.

---

Nitpick comments:
In `@backend/backends/cosyvoice_backend.py`:
- Around line 183-185: The parameter name format in function _sf_load shadows
the Python builtin; rename the parameter to format_ (or audio_format) and update
the function body to use the new name (or create a local alias format = format_
at top of _sf_load) so external API compatibility is preserved while removing
the builtin shadowing; also update any internal calls to _sf_load or usages
within the same file to pass the new parameter name or keep backwards-compatible
kwargs mapping if needed.
- Line 98: The assignment "_noop = lambda *a, **kw: None" should be replaced
with a real function definition to satisfy E731; define a function named _noop
with the same signature (accepting *a, **kw) that returns None (or simply
contains a pass) and use that in place of the lambda wherever referenced (e.g.,
any calls to _noop in cosyvoice_backend.py) so stack traces and static analysis
are improved.
- Around line 425-426: The fallback that returns np.zeros(COSYVOICE_SAMPLE_RATE,
dtype=np.float32), COSYVOICE_SAMPLE_RATE is confusing because
COSYVOICE_SAMPLE_RATE is a rate, not a length; update the empty-chunks fallback
to return a clearly named silence buffer (e.g., SILENCE_ONE_SECOND =
int(COSYVOICE_SAMPLE_RATE * 1) or SAMPLE_LENGTH_ONE_SEC) and/or add a concise
comment above the return explaining it returns one second of silence, then use
that named constant in the np.zeros call and keep returning
COSYVOICE_SAMPLE_RATE for the sample rate; reference the audio_chunks check and
the COSYVOICE_SAMPLE_RATE and np.zeros usage to locate the change.
- Line 23: Replace the deprecated typing imports in the import line (remove List
and Tuple from "from typing import ClassVar, List, Optional, Tuple") and update
all type hints in this module that use List[...] and Tuple[...] to the built-in
generics list[...] and tuple[...]; specifically change occurrences like
List[str] → list[str], Tuple[dict, bool] → tuple[dict, bool], Tuple[np.ndarray,
str] → tuple[np.ndarray, str], and Tuple[np.ndarray, int] → tuple[np.ndarray,
int], ensuring any references to those type names in functions, methods or class
annotations in cosyvoice_backend.py are updated accordingly.

In `@backend/requirements.txt`:
- Around line 43-51: The requirements file lists tiktoken, einops, inflect, and
matplotlib without version constraints which risks breakage; update those
entries in the requirements.txt (the lines containing the package names
tiktoken, einops, inflect, matplotlib) to include explicit version
constraints—either minimum tested versions using >= (e.g., tiktoken>=X.Y.Z) or
exact pins (==X.Y.Z) based on the versions you validated—and then run the
project's test suite/CI and regenerate any lock file to ensure reproducibility;
also add a short comment next to each entry indicating the tested version you
pinned to.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4e6acd47-425c-40be-876e-33b16dcfec95

📥 Commits

Reviewing files that changed from the base of the PR and between f77dd62 and 8302c14.

📒 Files selected for processing (3)

app/src/components/ServerSettings/ModelManagement.tsx
backend/backends/cosyvoice_backend.py
backend/requirements.txt

🚧 Files skipped from review as they are similar to previous changes (1)

app/src/components/ServerSettings/ModelManagement.tsx

coderabbitai · 2026-03-18T11:46:35Z

backend/backends/cosyvoice_backend.py

+            # 3. Patch torch.load to force map_location on CPU
+            import torch
+
+            if device == "cpu":
+                _orig_torch_load = torch.load
+
+                def _patched_load(*args, **kwargs):
+                    kwargs.setdefault("map_location", "cpu")
+                    return _orig_torch_load(*args, **kwargs)
+
+                torch.load = _patched_load
+
+            try:
+                if variant == "v2":
+                    from cosyvoice.cli.cosyvoice import CosyVoice2
+
+                    model = CosyVoice2(hf_repo)
+                else:
+                    from cosyvoice.cli.cosyvoice import CosyVoice3
+
+                    model = CosyVoice3(hf_repo)
+            finally:
+                # Restore original torch.load
+                if device == "cpu":
+                    torch.load = _orig_torch_load


⚠️ Potential issue | 🟡 Minor

Thread-safety concern with global torch.load patch.

The torch.load patch temporarily modifies a global function. While _model_load_lock serializes calls within a single instance, multiple CosyVoiceTTSBackend instances loading concurrently could race on the patch/restore sequence.

This is likely acceptable if backends are singletons, but consider moving the patch inside the class-level _import_lock block (applied once, never restored) for consistency with the other patches, or use a dedicated lock for this patch.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/backends/cosyvoice_backend.py` around lines 274 - 298, The current temporary global patch of torch.load in the CosyVoice model-loading block is unsafe across concurrent CosyVoiceTTSBackend instances because other threads can see the patched function during the brief window; to fix, move the torch.load patch/restore so it is performed while holding the same class-level _import_lock (or create a dedicated lock like _torch_load_patch_lock) used for other global patches, or convert the patch into a one-time persistent patch under _import_lock so it is not restored per-load; update the code around torch.load, _orig_torch_load, and the try/finally that surrounds CosyVoice2/CosyVoice3 construction to use that lock strategy (or persistent patch) to eliminate the race with _model_load_lock and concurrent instances.

coderabbitai · 2026-03-18T11:46:35Z

backend/backends/cosyvoice_backend.py

+            else:
+                # cross-lingual (no reference audio, shouldn't normally happen
+                # in voicebox since profiles always have samples, but handle it)
+                logger.info("[CosyVoice] cross_lingual fallback: lang=%s", language)
+                for chunk in self.model.inference_cross_lingual(
+                    tts_text=text,
+                    prompt_wav=ref_audio or "",
+                    stream=False,
+                    speed=1.0,
+                ):
+                    audio_chunks.append(chunk["tts_speech"])


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for inference_cross_lingual signature and implementation in CosyVoice source ast-grep --pattern $'def inference_cross_lingual($$$) { $$$ }'

Repository: jamiepine/voicebox

Length of output: 44

🏁 Script executed:

# First, check the imports and understand CosyVoice usage head -50 backend/backends/cosyvoice_backend.py

Repository: jamiepine/voicebox

Length of output: 1508

🏁 Script executed:

# Search for any other inference_cross_lingual calls in the codebase rg "inference_cross_lingual" --type py

Repository: jamiepine/voicebox

Length of output: 164

🏁 Script executed:

# Look for CosyVoice library imports and source rg "from cosyvoice|import cosyvoice|CosyVoice" --type py -A 2

Repository: jamiepine/voicebox

Length of output: 10070

🏁 Script executed:

# Check if CosyVoice is in requirements or setup files fd -e "requirements*.txt" -e "setup.py" -e "pyproject.toml" -e "poetry.lock" | xargs grep -l -i cosyvoice 2>/dev/null || echo "No CosyVoice found in dependency files"

Repository: jamiepine/voicebox

Length of output: 100

🏁 Script executed:

# Get lines around the cross_lingual call to see ref_audio handling sed -n '345,433p' backend/backends/cosyvoice_backend.py | cat -n

Repository: jamiepine/voicebox

Length of output: 4210

🏁 Script executed:

# Look for where ref_audio is assigned in the generate method sed -n '345,433p' backend/backends/cosyvoice_backend.py | grep -n "ref_audio"

Repository: jamiepine/voicebox

Length of output: 568

🏁 Script executed:

# Check if CosyVoice vendor exists and find the inference_cross_lingual method find backend/vendors -name "*.py" 2>/dev/null | head -20

Repository: jamiepine/voicebox

Length of output: 44

🏁 Script executed:

# Search for inference_cross_lingual in vendor directory rg "def inference_cross_lingual" backend/vendors/ 2>/dev/null || echo "Not found in vendors"

Repository: jamiepine/voicebox

Length of output: 82

🏁 Script executed:

# Check if there's any documentation or README about prompt_wav parameter find backend/vendors -name "README*" -o -name "*.md" | xargs grep -l "prompt_wav\|inference_cross_lingual" 2>/dev/null || echo "No docs found"

Repository: jamiepine/voicebox

Length of output: 143

🏁 Script executed:

# Check setup/build scripts that might mention CosyVoice and how it's set up find . -name "setup.py" -o -name "Justfile" -o -name "justfile" | xargs grep -A 5 -B 5 "cosyvoice\|CosyVoice" 2>/dev/null | head -50

Repository: jamiepine/voicebox

Length of output: 1769

🏁 Script executed:

# Search for any comments or documentation about the inference methods rg "inference_cross_lingual|inference_zero_shot|inference_instruct2" -B 3 -A 3 backend/ --type py | head -100

Repository: jamiepine/voicebox

Length of output: 3234

🌐 Web query:

CosyVoice inference_cross_lingual prompt_wav parameter documentation

💡 Result:

The inference_cross_lingual method in CosyVoice is used for cross-lingual zero-shot voice cloning and synthesis, where the target text (tts_text) is synthesized in a potentially different language from the reference prompt audio. Unlike inference_zero_shot, it does not pass prompt text to the LLM, decoupling textual content from the reference audio to focus on speaker timbre, prosody, and paralinguistic features (e.g., laughter) from the audio alone. This design improves control over non-verbal elements. Signature (from examples and code snippets): - inference_cross_lingual(tts_text: str, prompt_speech_16k: torch.Tensor, zero_shot_spk_id: str = '', stream: bool = False, speed: float = 1.0, text_frontend: bool = True) -> Generator Parameters: - tts_text: The text to synthesize. Prefix with language tokens like <|en|>, <|zh|>, <|ja|>, etc., for multilingual support. Can include special controls like [laughter]. - prompt_wav (or prompt_speech_16k): Reference audio waveform (resampled to 16kHz, max ~30s) providing speaker voice, style, and prosody. Loaded via load_wav(prompt_wav_path, 16000). - zero_shot_spk_id: Optional predefined speaker ID (rarely used in cross-lingual mode). - stream: If True, yields audio chunks for streaming inference. - speed: Speech speed multiplier (e.g., 1.0 default, 0.5-1.5 range). - text_frontend: Enable text normalization/pronunciation (uses ttsfrd or WeTextProcessing). Usage example: from cosyvoice.cli.cosyvoice import CosyVoice from cosyvoice.utils.file_utils import load_wav import torchaudio cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M') prompt_speech_16k = load_wav('prompt.wav', 16000) for i, output in enumerate(cosyvoice.inference_cross_lingual('<|en|>Sample English text.', prompt_speech_16k)): torchaudio.save(f'output_{i}.wav', output['tts_speech'], cosyvoice.sample_rate) Requires CosyVoice-300M (or CosyVoice2-0.5B+) model. Frontend extracts speech tokens, speaker embedding, and features from prompt_wav but omits prompt text tokens for LLM.

Citations:

1: https://github.com/FunAudioLLM/CosyVoice

2: https://github.com/PasiKoodaa/CosyVoice-optimized

3: https://huggingface.co/FunAudioLLM/CosyVoice-300M

4: https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/cli/frontend.py

5: https://readmex.com/FunAudioLLM/CosyVoice/page-78571fbeb-6359-48df-96cd-da6e335c8dc9

6: https://github.com/FunAudioLLM/CosyVoice/blob/master/cosyvoice/cli/model.py

7: https://git.citory.tech/deepgeek/CosyVoice/src/branch/main/example.py

Raise an error when reference audio is unavailable instead of passing empty string to inference_cross_lingual.

The inference_cross_lingual method requires reference audio to extract speaker timbre, prosody, and voice characteristics. Passing an empty string at line 74 will fail or produce generic speech without the intended speaker cloning. Since the code comment notes this fallback "shouldn't normally happen in voicebox since profiles always have samples," missing reference audio indicates a configuration problem that should be surfaced to the user rather than silently degraded. Consider raising a descriptive error in the else branch instead of attempting inference without audio.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/backends/cosyvoice_backend.py` around lines 412 - 422, The fallback branch currently calls self.model.inference_cross_lingual with prompt_wav=ref_audio or "" which silently proceeds when ref_audio is missing; instead detect the missing reference (ref_audio is None or empty) and raise a descriptive exception (e.g., ValueError) before calling inference_cross_lingual; update the else branch where logger.info("[CosyVoice] cross_lingual fallback: lang=%s", language) appears to check ref_audio and raise the error, leaving inference_cross_lingual calls only when a valid ref_audio exists so the model method isn’t invoked with an empty string.

jamiepine · 2026-03-18T14:56:52Z

In my first tests of this model, output is horrible. Not passing the quality bar for the next update. Maybe I'm configuring it wrong, but I'll choose other models to add instead for 0.4.0.

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

Conversation

jamiepine commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Backend (5 files)

Frontend (5 files)

Infra (4 files)

Design Decisions

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

jamiepine commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jamiepine commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading