feat: add CosyVoice2/3 TTS engine with instruct and voice cloning#311
feat: add CosyVoice2/3 TTS engine with instruct and voice cloning#311
Conversation
Integrate Alibaba's CosyVoice2-0.5B and Fun-CosyVoice3-0.5B as a new TTS engine supporting 9 languages, zero-shot voice cloning, and instruct control (emotions, speed, volume, dialects). The CosyVoice source is cloned at setup time into backend/vendors/ since no PyPI package exists. A modelscope→HuggingFace shim redirects model downloads to the public HF repos, and a lightweight pylogger shim avoids pulling in pytorch-lightning as a transitive dependency. Backend: cosyvoice_backend.py, __init__.py registry, models.py regex Frontend: engine selector, language map, Zod schema, model descriptions Infra: requirements.txt, justfile, release.yml, Dockerfile, PyInstaller
📝 WalkthroughWalkthroughAdds CosyVoice TTS support (v2/v3): frontend options and types, backend CosyVoiceTTSBackend, model registrations and validation, new Python dependencies, and build/CI changes to clone and bundle the CosyVoice source for packaging and runtime. Changes
Sequence DiagramsequenceDiagram
participant User as User/Frontend
participant Form as Generation Form
participant API as Backend API
participant Factory as Engine Factory
participant CVBackend as CosyVoice Backend
participant ModelCache as Model Cache/Loader
User->>Form: choose cosyvoice + v2|v3
Form->>API: POST GenerationRequest(engine: cosyvoice, model_size: v2|v3, lang, instruct?)
API->>Factory: get_tts_backend_for_engine("cosyvoice")
Factory->>CVBackend: instantiate/load backend
CVBackend->>ModelCache: ensure model variant loaded (v2/v3)
ModelCache-->>CVBackend: model ready (cached or loaded)
API->>CVBackend: generate(text, reference?, seed?)
CVBackend->>CVBackend: select mode (instruct2 / zero-shot), run inference, assemble chunks
CVBackend-->>API: return audio numpy @ 24kHz
API-->>User: stream/return audio
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
Dockerfile (1)
42-42: Pin the CosyVoice checkout to ensure reproducible builds.Cloning the default branch at build time makes the image non-reproducible and allows upstream changes to affect already-released app revisions. Define one pinned ref (commit SHA, tag, or branch) and reuse it consistently across the Dockerfile (line 42),
justfile(lines 55, 89), and.github/workflows/release.yml(lines 66, 194).♻️ One way to pin the vendor ref
+ARG COSYVOICE_REF -RUN git clone --recursive --depth 1 https://github.com/FunAudioLLM/CosyVoice.git /build/CosyVoice +RUN git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git /build/CosyVoice \ + && git -C /build/CosyVoice checkout "$COSYVOICE_REF" \ + && git -C /build/CosyVoice submodule update --init --recursive🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` at line 42, Replace the unpinned git clone with a reproducible, pinned ref: introduce a build-time variable (e.g., ARG or ENV named COSYVOICE_REF) and use it in the git clone command (the RUN git clone --recursive ... line) to checkout a specific commit SHA, tag, or branch, and then reuse that same COSYVOICE_REF variable in the justfile and .github/workflows/release.yml so all places (the Dockerfile's RUN git clone, the justfile entries at the noted lines, and the release workflow steps) reference the identical pinned ref for reproducible builds.backend/backends/cosyvoice_backend.py (1)
23-23: Consider using modern type hints.Static analysis flags that
typing.Listandtyping.Tupleare deprecated in favor of lowercaselistandtuple(available since Python 3.9+).♻️ Proposed refactor
-from typing import ClassVar, List, Optional, Tuple +from typing import ClassVar, OptionalThen update usages:
List[str]→list[str]Tuple[dict, bool]→tuple[dict, bool]Tuple[np.ndarray, str]→tuple[np.ndarray, str]Tuple[np.ndarray, int]→tuple[np.ndarray, int]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/backends/cosyvoice_backend.py` at line 23, Replace deprecated typing generics with native Python types: change the import line that currently brings in List and Tuple (from typing import ClassVar, List, Optional, Tuple) to drop List and Tuple and use the built-in lowercase forms in annotations; update all annotations in this module that use List[...] to list[...] and Tuple[...] to tuple[...] (e.g., List[str] -> list[str], Tuple[dict, bool] -> tuple[dict, bool], Tuple[np.ndarray, str] -> tuple[np.ndarray, str], Tuple[np.ndarray, int] -> tuple[np.ndarray, int]), and remove List/Tuple from the import list so only needed names like ClassVar and Optional remain (also clean up any now-unused imports).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/release.yml:
- Line 66: The git clone command "git clone --recursive --depth 1
https://github.com/FunAudioLLM/CosyVoice.git backend/vendors/CosyVoice" will
fail when its parent directory doesn't exist—add a step to create the parent
directory before cloning (e.g., run a cross-platform create like mkdir -p
backend/vendors or the equivalent Windows-aware command) and apply the same
change to the other clone occurrence of the same command; update the workflow so
the directory creation runs immediately before each git clone of
backend/vendors/CosyVoice.
In `@app/src/components/ServerSettings/ModelManagement.tsx`:
- Around line 69-72: The voiceModels filter used to build the "Voice Generation"
list excludes models that begin with "cosyvoice", so add a condition to the
existing filter predicate in the voiceModels creation (the filter around lines
397-404) to accept model IDs that startWith('cosyvoice') (or match
/^cosyvoice/); update the predicate used where voiceModels is computed so
entries like 'cosyvoice2-0.5b' and 'cosyvoice3-0.5b' are included in the
resulting list shown in the UI.
In `@backend/backends/cosyvoice_backend.py`:
- Around line 263-268: The combine_voice_prompts wrapper in cosyvoice_backend.py
currently forwards only audio_paths and reference_texts to
_combine_voice_prompts; modify the async method combine_voice_prompts to pass
sample_rate=24000 (use COSYVOICE_SAMPLE_RATE if defined) when calling
_combine_voice_prompts so the combined audio is explicitly resampled to 24 kHz;
update the call in combine_voice_prompts to include the sample_rate keyword
argument to match the TADA backend's behavior.
In `@backend/build_binary.py`:
- Around line 254-263: The _ensure_cosyvoice_on_path() function must recognize
PyInstaller frozen runs and use the extracted _MEIPASS paths or skip the
vendor-dir existence check; update _ensure_cosyvoice_on_path() to inspect
getattr(sys, "_MEIPASS", None) and, when present, construct paths to
os.path.join(sys._MEIPASS, "cosyvoice") and os.path.join(sys._MEIPASS, "matcha")
(or skip raising RuntimeError) instead of insisting on
backend/vendors/CosyVoice; ensure the same logic aligns with the
backend/server.py shim behavior so frozen binaries don't raise on startup.
---
Nitpick comments:
In `@backend/backends/cosyvoice_backend.py`:
- Line 23: Replace deprecated typing generics with native Python types: change
the import line that currently brings in List and Tuple (from typing import
ClassVar, List, Optional, Tuple) to drop List and Tuple and use the built-in
lowercase forms in annotations; update all annotations in this module that use
List[...] to list[...] and Tuple[...] to tuple[...] (e.g., List[str] ->
list[str], Tuple[dict, bool] -> tuple[dict, bool], Tuple[np.ndarray, str] ->
tuple[np.ndarray, str], Tuple[np.ndarray, int] -> tuple[np.ndarray, int]), and
remove List/Tuple from the import list so only needed names like ClassVar and
Optional remain (also clean up any now-unused imports).
In `@Dockerfile`:
- Line 42: Replace the unpinned git clone with a reproducible, pinned ref:
introduce a build-time variable (e.g., ARG or ENV named COSYVOICE_REF) and use
it in the git clone command (the RUN git clone --recursive ... line) to checkout
a specific commit SHA, tag, or branch, and then reuse that same COSYVOICE_REF
variable in the justfile and .github/workflows/release.yml so all places (the
Dockerfile's RUN git clone, the justfile entries at the noted lines, and the
release workflow steps) reference the identical pinned ref for reproducible
builds.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4db7c9ed-55cf-4484-8442-2b3a3447f384
📒 Files selected for processing (15)
.github/workflows/release.yml.gitignoreDockerfileapp/src/components/Generation/EngineModelSelector.tsxapp/src/components/ServerSettings/ModelManagement.tsxapp/src/lib/api/types.tsapp/src/lib/constants/languages.tsapp/src/lib/hooks/useGenerationForm.tsbackend/backends/__init__.pybackend/backends/cosyvoice_backend.pybackend/build_binary.pybackend/models.pybackend/requirements.txtbackend/server.pyjustfile
| pip install -r backend/requirements.txt | ||
| pip install --no-deps chatterbox-tts | ||
| pip install --no-deps hume-tada | ||
| git clone --recursive --depth 1 https://github.com/FunAudioLLM/CosyVoice.git backend/vendors/CosyVoice |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
test -d backend/vendors && echo "backend/vendors exists" || echo "backend/vendors missing"
rg -n "backend/vendors/CosyVoice" .github/workflows/release.ymlRepository: jamiepine/voicebox
Length of output: 318
🏁 Script executed:
cat -n .gitignore | grep -i "backend/vendors"Repository: jamiepine/voicebox
Length of output: 85
🏁 Script executed:
# Check if justfile exists and contains backend/vendors directory creation
if [ -f justfile ]; then
cat -n justfile | grep -A 3 -B 3 "backend/vendors"
else
echo "justfile not found"
fiRepository: jamiepine/voicebox
Length of output: 497
🏁 Script executed:
# Check the workflow file around lines 66 and 194 for context
sed -n '50,75p' .github/workflows/release.yml
echo "---"
sed -n '178,210p' .github/workflows/release.ymlRepository: jamiepine/voicebox
Length of output: 2329
🏁 Script executed:
# Verify git clone behavior with missing parent directory
git clone --help 2>&1 | grep -i "parent\|directory" | head -5Repository: jamiepine/voicebox
Length of output: 118
Create backend/vendors before cloning.
Fresh checkouts do not contain backend/vendors (it is ignored by .gitignore), and git clone ... backend/vendors/CosyVoice will fail when the parent path is missing. This breaks both release jobs before the server build starts.
🛠️ Cross-platform fix for both clone sites
pip install --no-deps chatterbox-tts
pip install --no-deps hume-tada
+ python -c "from pathlib import Path; Path('backend/vendors').mkdir(parents=True, exist_ok=True)"
git clone --recursive --depth 1 https://github.com/FunAudioLLM/CosyVoice.git backend/vendors/CosyVoiceAlso applies to: 194-194
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/release.yml at line 66, The git clone command "git clone
--recursive --depth 1 https://github.com/FunAudioLLM/CosyVoice.git
backend/vendors/CosyVoice" will fail when its parent directory doesn't exist—add
a step to create the parent directory before cloning (e.g., run a cross-platform
create like mkdir -p backend/vendors or the equivalent Windows-aware command)
and apply the same change to the other clone occurrence of the same command;
update the workflow so the directory creation runs immediately before each git
clone of backend/vendors/CosyVoice.
| async def combine_voice_prompts( | ||
| self, | ||
| audio_paths: List[str], | ||
| reference_texts: List[str], | ||
| ) -> Tuple[np.ndarray, str]: | ||
| return await _combine_voice_prompts(audio_paths, reference_texts) |
There was a problem hiding this comment.
Pass sample_rate=24000 to ensure consistent audio resampling.
The combine_voice_prompts call omits the sample_rate parameter. CosyVoice expects 24 kHz audio (as defined by COSYVOICE_SAMPLE_RATE). Without explicit resampling, combined reference audio may be at the wrong sample rate if source files differ.
The TADA backend (which also uses 24 kHz) explicitly passes sample_rate=24000 in its combine_voice_prompts implementation.
🔧 Proposed fix
async def combine_voice_prompts(
self,
audio_paths: List[str],
reference_texts: List[str],
) -> Tuple[np.ndarray, str]:
- return await _combine_voice_prompts(audio_paths, reference_texts)
+ return await _combine_voice_prompts(audio_paths, reference_texts, sample_rate=COSYVOICE_SAMPLE_RATE)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def combine_voice_prompts( | |
| self, | |
| audio_paths: List[str], | |
| reference_texts: List[str], | |
| ) -> Tuple[np.ndarray, str]: | |
| return await _combine_voice_prompts(audio_paths, reference_texts) | |
| async def combine_voice_prompts( | |
| self, | |
| audio_paths: List[str], | |
| reference_texts: List[str], | |
| ) -> Tuple[np.ndarray, str]: | |
| return await _combine_voice_prompts(audio_paths, reference_texts, sample_rate=COSYVOICE_SAMPLE_RATE) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/backends/cosyvoice_backend.py` around lines 263 - 268, The
combine_voice_prompts wrapper in cosyvoice_backend.py currently forwards only
audio_paths and reference_texts to _combine_voice_prompts; modify the async
method combine_voice_prompts to pass sample_rate=24000 (use
COSYVOICE_SAMPLE_RATE if defined) when calling _combine_voice_prompts so the
combined audio is explicitly resampled to 24 kHz; update the call in
combine_voice_prompts to include the sample_rate keyword argument to match the
TADA backend's behavior.
| # Bundle the vendored CosyVoice source tree for frozen builds. | ||
| # The clone lives at backend/vendors/CosyVoice/ at build time. | ||
| cosyvoice_vendor = backend_dir / "vendors" / "CosyVoice" | ||
| if cosyvoice_vendor.exists(): | ||
| args.extend([ | ||
| "--add-data", | ||
| f"{cosyvoice_vendor / 'cosyvoice'}{os.pathsep}cosyvoice", | ||
| "--add-data", | ||
| f"{cosyvoice_vendor / 'third_party' / 'Matcha-TTS' / 'matcha'}{os.pathsep}matcha", | ||
| ]) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n "vendors/CosyVoice|_MEIPASS|--add-data|cosyvoice|matcha" backend/backends/cosyvoice_backend.py backend/build_binary.py backend/server.py
sed -n '60,90p' backend/backends/cosyvoice_backend.pyRepository: jamiepine/voicebox
Length of output: 4866
_ensure_cosyvoice_on_path() must detect frozen builds and use extracted paths.
When PyInstaller bundles the binary, --add-data extracts CosyVoice to _MEIPASS/cosyvoice and Matcha to _MEIPASS/matcha. However, _ensure_cosyvoice_on_path() still checks for backend/vendors/CosyVoice and raises RuntimeError if not found—this path does not exist in frozen binaries. The backend/server.py shim only adds to sys.path; it cannot prevent the error. Update _ensure_cosyvoice_on_path() to check for the frozen environment (via getattr(sys, '_MEIPASS', None)) and use the extracted paths instead, or skip the existence check when running frozen.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/build_binary.py` around lines 254 - 263, The
_ensure_cosyvoice_on_path() function must recognize PyInstaller frozen runs and
use the extracted _MEIPASS paths or skip the vendor-dir existence check; update
_ensure_cosyvoice_on_path() to inspect getattr(sys, "_MEIPASS", None) and, when
present, construct paths to os.path.join(sys._MEIPASS, "cosyvoice") and
os.path.join(sys._MEIPASS, "matcha") (or skip raising RuntimeError) instead of
insisting on backend/vendors/CosyVoice; ensure the same logic aligns with the
backend/server.py shim behavior so frozen binaries don't raise on startup.
- Model filter in ModelManagement was an allowlist missing cosyvoice prefix, replaced with inverse filter (everything except whisper = voice model) - Shim training-only modules (cosyvoice.dataset.processor, matcha.utils.*) to prevent hyperpyyaml from importing pyarrow, pyworld, lightning etc. - Patch torchaudio.load for 2.9+ compat (torchcodec now required by default) - Add matplotlib to requirements (matcha hifigan imports it at module level)
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
backend/backends/cosyvoice_backend.py (4)
183-185: Parameterformatshadows Python builtin.The
formatparameter name shadows the Python builtin. While this is intentional for API compatibility withtorchaudio.load's signature, consider usingformat_oraudio_formatwith a local alias if this causes linting noise.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/backends/cosyvoice_backend.py` around lines 183 - 185, The parameter name format in function _sf_load shadows the Python builtin; rename the parameter to format_ (or audio_format) and update the function body to use the new name (or create a local alias format = format_ at top of _sf_load) so external API compatibility is preserved while removing the builtin shadowing; also update any internal calls to _sf_load or usages within the same file to pass the new parameter name or keep backwards-compatible kwargs mapping if needed.
98-98: Replace lambda assignment with adefstatement.Per static analysis E731, avoid assigning lambda expressions to variables. Use a
defstatement instead for clarity and better stack traces.♻️ Proposed fix
- _noop = lambda *a, **kw: None + def _noop(*a, **kw): + pass🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/backends/cosyvoice_backend.py` at line 98, The assignment "_noop = lambda *a, **kw: None" should be replaced with a real function definition to satisfy E731; define a function named _noop with the same signature (accepting *a, **kw) that returns None (or simply contains a pass) and use that in place of the lambda wherever referenced (e.g., any calls to _noop in cosyvoice_backend.py) so stack traces and static analysis are improved.
425-426: Clarify the empty-chunks fallback.The fallback returns
np.zeros(COSYVOICE_SAMPLE_RATE, ...)which creates 24000 samples (1 second of silence). UsingCOSYVOICE_SAMPLE_RATEas a length is semantically confusing since it's a rate constant. Consider adding a brief comment or using a named constant for clarity.♻️ Proposed fix
# Concatenate all chunks if not audio_chunks: - return np.zeros(COSYVOICE_SAMPLE_RATE, dtype=np.float32), COSYVOICE_SAMPLE_RATE + # Return 1 second of silence as fallback + return np.zeros(COSYVOICE_SAMPLE_RATE, dtype=np.float32), COSYVOICE_SAMPLE_RATE🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/backends/cosyvoice_backend.py` around lines 425 - 426, The fallback that returns np.zeros(COSYVOICE_SAMPLE_RATE, dtype=np.float32), COSYVOICE_SAMPLE_RATE is confusing because COSYVOICE_SAMPLE_RATE is a rate, not a length; update the empty-chunks fallback to return a clearly named silence buffer (e.g., SILENCE_ONE_SECOND = int(COSYVOICE_SAMPLE_RATE * 1) or SAMPLE_LENGTH_ONE_SEC) and/or add a concise comment above the return explaining it returns one second of silence, then use that named constant in the np.zeros call and keep returning COSYVOICE_SAMPLE_RATE for the sample rate; reference the audio_chunks check and the COSYVOICE_SAMPLE_RATE and np.zeros usage to locate the change.
23-23: Use built-inlistandtupleinstead of deprecatedtypingimports.Per static analysis,
typing.Listandtyping.Tupleare deprecated in Python 3.9+. Use the built-in generics instead.♻️ Proposed fix
-from typing import ClassVar, List, Optional, Tuple +from typing import ClassVar, OptionalThen update usages:
List[str]→list[str]Tuple[dict, bool]→tuple[dict, bool]Tuple[np.ndarray, str]→tuple[np.ndarray, str]Tuple[np.ndarray, int]→tuple[np.ndarray, int]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/backends/cosyvoice_backend.py` at line 23, Replace the deprecated typing imports in the import line (remove List and Tuple from "from typing import ClassVar, List, Optional, Tuple") and update all type hints in this module that use List[...] and Tuple[...] to the built-in generics list[...] and tuple[...]; specifically change occurrences like List[str] → list[str], Tuple[dict, bool] → tuple[dict, bool], Tuple[np.ndarray, str] → tuple[np.ndarray, str], and Tuple[np.ndarray, int] → tuple[np.ndarray, int], ensuring any references to those type names in functions, methods or class annotations in cosyvoice_backend.py are updated accordingly.backend/requirements.txt (1)
43-51: Consider pinning versions for unpinned dependencies.
tiktoken,einops,inflect, andmatplotliblack version constraints. While this is common for transitive dependencies, it can lead to reproducibility issues or unexpected breakages when upstream releases incompatible versions.If strict reproducibility is desired, consider adding minimum version pins based on your tested versions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/requirements.txt` around lines 43 - 51, The requirements file lists tiktoken, einops, inflect, and matplotlib without version constraints which risks breakage; update those entries in the requirements.txt (the lines containing the package names tiktoken, einops, inflect, matplotlib) to include explicit version constraints—either minimum tested versions using >= (e.g., tiktoken>=X.Y.Z) or exact pins (==X.Y.Z) based on the versions you validated—and then run the project's test suite/CI and regenerate any lock file to ensure reproducibility; also add a short comment next to each entry indicating the tested version you pinned to.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/backends/cosyvoice_backend.py`:
- Around line 274-298: The current temporary global patch of torch.load in the
CosyVoice model-loading block is unsafe across concurrent CosyVoiceTTSBackend
instances because other threads can see the patched function during the brief
window; to fix, move the torch.load patch/restore so it is performed while
holding the same class-level _import_lock (or create a dedicated lock like
_torch_load_patch_lock) used for other global patches, or convert the patch into
a one-time persistent patch under _import_lock so it is not restored per-load;
update the code around torch.load, _orig_torch_load, and the try/finally that
surrounds CosyVoice2/CosyVoice3 construction to use that lock strategy (or
persistent patch) to eliminate the race with _model_load_lock and concurrent
instances.
- Around line 412-422: The fallback branch currently calls
self.model.inference_cross_lingual with prompt_wav=ref_audio or "" which
silently proceeds when ref_audio is missing; instead detect the missing
reference (ref_audio is None or empty) and raise a descriptive exception (e.g.,
ValueError) before calling inference_cross_lingual; update the else branch where
logger.info("[CosyVoice] cross_lingual fallback: lang=%s", language) appears to
check ref_audio and raise the error, leaving inference_cross_lingual calls only
when a valid ref_audio exists so the model method isn’t invoked with an empty
string.
---
Nitpick comments:
In `@backend/backends/cosyvoice_backend.py`:
- Around line 183-185: The parameter name format in function _sf_load shadows
the Python builtin; rename the parameter to format_ (or audio_format) and update
the function body to use the new name (or create a local alias format = format_
at top of _sf_load) so external API compatibility is preserved while removing
the builtin shadowing; also update any internal calls to _sf_load or usages
within the same file to pass the new parameter name or keep backwards-compatible
kwargs mapping if needed.
- Line 98: The assignment "_noop = lambda *a, **kw: None" should be replaced
with a real function definition to satisfy E731; define a function named _noop
with the same signature (accepting *a, **kw) that returns None (or simply
contains a pass) and use that in place of the lambda wherever referenced (e.g.,
any calls to _noop in cosyvoice_backend.py) so stack traces and static analysis
are improved.
- Around line 425-426: The fallback that returns np.zeros(COSYVOICE_SAMPLE_RATE,
dtype=np.float32), COSYVOICE_SAMPLE_RATE is confusing because
COSYVOICE_SAMPLE_RATE is a rate, not a length; update the empty-chunks fallback
to return a clearly named silence buffer (e.g., SILENCE_ONE_SECOND =
int(COSYVOICE_SAMPLE_RATE * 1) or SAMPLE_LENGTH_ONE_SEC) and/or add a concise
comment above the return explaining it returns one second of silence, then use
that named constant in the np.zeros call and keep returning
COSYVOICE_SAMPLE_RATE for the sample rate; reference the audio_chunks check and
the COSYVOICE_SAMPLE_RATE and np.zeros usage to locate the change.
- Line 23: Replace the deprecated typing imports in the import line (remove List
and Tuple from "from typing import ClassVar, List, Optional, Tuple") and update
all type hints in this module that use List[...] and Tuple[...] to the built-in
generics list[...] and tuple[...]; specifically change occurrences like
List[str] → list[str], Tuple[dict, bool] → tuple[dict, bool], Tuple[np.ndarray,
str] → tuple[np.ndarray, str], and Tuple[np.ndarray, int] → tuple[np.ndarray,
int], ensuring any references to those type names in functions, methods or class
annotations in cosyvoice_backend.py are updated accordingly.
In `@backend/requirements.txt`:
- Around line 43-51: The requirements file lists tiktoken, einops, inflect, and
matplotlib without version constraints which risks breakage; update those
entries in the requirements.txt (the lines containing the package names
tiktoken, einops, inflect, matplotlib) to include explicit version
constraints—either minimum tested versions using >= (e.g., tiktoken>=X.Y.Z) or
exact pins (==X.Y.Z) based on the versions you validated—and then run the
project's test suite/CI and regenerate any lock file to ensure reproducibility;
also add a short comment next to each entry indicating the tested version you
pinned to.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4e6acd47-425c-40be-876e-33b16dcfec95
📒 Files selected for processing (3)
app/src/components/ServerSettings/ModelManagement.tsxbackend/backends/cosyvoice_backend.pybackend/requirements.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- app/src/components/ServerSettings/ModelManagement.tsx
| # 3. Patch torch.load to force map_location on CPU | ||
| import torch | ||
|
|
||
| if device == "cpu": | ||
| _orig_torch_load = torch.load | ||
|
|
||
| def _patched_load(*args, **kwargs): | ||
| kwargs.setdefault("map_location", "cpu") | ||
| return _orig_torch_load(*args, **kwargs) | ||
|
|
||
| torch.load = _patched_load | ||
|
|
||
| try: | ||
| if variant == "v2": | ||
| from cosyvoice.cli.cosyvoice import CosyVoice2 | ||
|
|
||
| model = CosyVoice2(hf_repo) | ||
| else: | ||
| from cosyvoice.cli.cosyvoice import CosyVoice3 | ||
|
|
||
| model = CosyVoice3(hf_repo) | ||
| finally: | ||
| # Restore original torch.load | ||
| if device == "cpu": | ||
| torch.load = _orig_torch_load |
There was a problem hiding this comment.
Thread-safety concern with global torch.load patch.
The torch.load patch temporarily modifies a global function. While _model_load_lock serializes calls within a single instance, multiple CosyVoiceTTSBackend instances loading concurrently could race on the patch/restore sequence.
This is likely acceptable if backends are singletons, but consider moving the patch inside the class-level _import_lock block (applied once, never restored) for consistency with the other patches, or use a dedicated lock for this patch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/backends/cosyvoice_backend.py` around lines 274 - 298, The current
temporary global patch of torch.load in the CosyVoice model-loading block is
unsafe across concurrent CosyVoiceTTSBackend instances because other threads can
see the patched function during the brief window; to fix, move the torch.load
patch/restore so it is performed while holding the same class-level _import_lock
(or create a dedicated lock like _torch_load_patch_lock) used for other global
patches, or convert the patch into a one-time persistent patch under
_import_lock so it is not restored per-load; update the code around torch.load,
_orig_torch_load, and the try/finally that surrounds CosyVoice2/CosyVoice3
construction to use that lock strategy (or persistent patch) to eliminate the
race with _model_load_lock and concurrent instances.
| else: | ||
| # cross-lingual (no reference audio, shouldn't normally happen | ||
| # in voicebox since profiles always have samples, but handle it) | ||
| logger.info("[CosyVoice] cross_lingual fallback: lang=%s", language) | ||
| for chunk in self.model.inference_cross_lingual( | ||
| tts_text=text, | ||
| prompt_wav=ref_audio or "", | ||
| stream=False, | ||
| speed=1.0, | ||
| ): | ||
| audio_chunks.append(chunk["tts_speech"]) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for inference_cross_lingual signature and implementation in CosyVoice source
ast-grep --pattern $'def inference_cross_lingual($$$) {
$$$
}'Repository: jamiepine/voicebox
Length of output: 44
🏁 Script executed:
# First, check the imports and understand CosyVoice usage
head -50 backend/backends/cosyvoice_backend.pyRepository: jamiepine/voicebox
Length of output: 1508
🏁 Script executed:
# Search for any other inference_cross_lingual calls in the codebase
rg "inference_cross_lingual" --type pyRepository: jamiepine/voicebox
Length of output: 164
🏁 Script executed:
# Look for CosyVoice library imports and source
rg "from cosyvoice|import cosyvoice|CosyVoice" --type py -A 2Repository: jamiepine/voicebox
Length of output: 10070
🏁 Script executed:
# Check if CosyVoice is in requirements or setup files
fd -e "requirements*.txt" -e "setup.py" -e "pyproject.toml" -e "poetry.lock" | xargs grep -l -i cosyvoice 2>/dev/null || echo "No CosyVoice found in dependency files"Repository: jamiepine/voicebox
Length of output: 100
🏁 Script executed:
# Get lines around the cross_lingual call to see ref_audio handling
sed -n '345,433p' backend/backends/cosyvoice_backend.py | cat -nRepository: jamiepine/voicebox
Length of output: 4210
🏁 Script executed:
# Look for where ref_audio is assigned in the generate method
sed -n '345,433p' backend/backends/cosyvoice_backend.py | grep -n "ref_audio"Repository: jamiepine/voicebox
Length of output: 568
🏁 Script executed:
# Check if CosyVoice vendor exists and find the inference_cross_lingual method
find backend/vendors -name "*.py" 2>/dev/null | head -20Repository: jamiepine/voicebox
Length of output: 44
🏁 Script executed:
# Search for inference_cross_lingual in vendor directory
rg "def inference_cross_lingual" backend/vendors/ 2>/dev/null || echo "Not found in vendors"Repository: jamiepine/voicebox
Length of output: 82
🏁 Script executed:
# Check if there's any documentation or README about prompt_wav parameter
find backend/vendors -name "README*" -o -name "*.md" | xargs grep -l "prompt_wav\|inference_cross_lingual" 2>/dev/null || echo "No docs found"Repository: jamiepine/voicebox
Length of output: 143
🏁 Script executed:
# Check setup/build scripts that might mention CosyVoice and how it's set up
find . -name "setup.py" -o -name "Justfile" -o -name "justfile" | xargs grep -A 5 -B 5 "cosyvoice\|CosyVoice" 2>/dev/null | head -50Repository: jamiepine/voicebox
Length of output: 1769
🏁 Script executed:
# Search for any comments or documentation about the inference methods
rg "inference_cross_lingual|inference_zero_shot|inference_instruct2" -B 3 -A 3 backend/ --type py | head -100Repository: jamiepine/voicebox
Length of output: 3234
🌐 Web query:
CosyVoice inference_cross_lingual prompt_wav parameter documentation
💡 Result:
The inference_cross_lingual method in CosyVoice is used for cross-lingual zero-shot voice cloning and synthesis, where the target text (tts_text) is synthesized in a potentially different language from the reference prompt audio. Unlike inference_zero_shot, it does not pass prompt text to the LLM, decoupling textual content from the reference audio to focus on speaker timbre, prosody, and paralinguistic features (e.g., laughter) from the audio alone. This design improves control over non-verbal elements. Signature (from examples and code snippets): - inference_cross_lingual(tts_text: str, prompt_speech_16k: torch.Tensor, zero_shot_spk_id: str = '', stream: bool = False, speed: float = 1.0, text_frontend: bool = True) -> Generator Parameters: - tts_text: The text to synthesize. Prefix with language tokens like <|en|>, <|zh|>, <|ja|>, etc., for multilingual support. Can include special controls like [laughter]. - prompt_wav (or prompt_speech_16k): Reference audio waveform (resampled to 16kHz, max ~30s) providing speaker voice, style, and prosody. Loaded via load_wav(prompt_wav_path, 16000). - zero_shot_spk_id: Optional predefined speaker ID (rarely used in cross-lingual mode). - stream: If True, yields audio chunks for streaming inference. - speed: Speech speed multiplier (e.g., 1.0 default, 0.5-1.5 range). - text_frontend: Enable text normalization/pronunciation (uses ttsfrd or WeTextProcessing). Usage example: from cosyvoice.cli.cosyvoice import CosyVoice from cosyvoice.utils.file_utils import load_wav import torchaudio cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M') prompt_speech_16k = load_wav('prompt.wav', 16000) for i, output in enumerate(cosyvoice.inference_cross_lingual('<|en|>Sample English text.', prompt_speech_16k)): torchaudio.save(f'output_{i}.wav', output['tts_speech'], cosyvoice.sample_rate) Requires CosyVoice-300M (or CosyVoice2-0.5B+) model. Frontend extracts speech tokens, speaker embedding, and features from prompt_wav but omits prompt text tokens for LLM.
Citations:
- 1: https://github.com/FunAudioLLM/CosyVoice
- 2: https://github.com/PasiKoodaa/CosyVoice-optimized
- 3: https://huggingface.co/FunAudioLLM/CosyVoice-300M
- 4: https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/cli/frontend.py
- 5: https://readmex.com/FunAudioLLM/CosyVoice/page-78571fbeb-6359-48df-96cd-da6e335c8dc9
- 6: https://github.com/FunAudioLLM/CosyVoice/blob/master/cosyvoice/cli/model.py
- 7: https://git.citory.tech/deepgeek/CosyVoice/src/branch/main/example.py
Raise an error when reference audio is unavailable instead of passing empty string to inference_cross_lingual.
The inference_cross_lingual method requires reference audio to extract speaker timbre, prosody, and voice characteristics. Passing an empty string at line 74 will fail or produce generic speech without the intended speaker cloning. Since the code comment notes this fallback "shouldn't normally happen in voicebox since profiles always have samples," missing reference audio indicates a configuration problem that should be surfaced to the user rather than silently degraded. Consider raising a descriptive error in the else branch instead of attempting inference without audio.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/backends/cosyvoice_backend.py` around lines 412 - 422, The fallback
branch currently calls self.model.inference_cross_lingual with
prompt_wav=ref_audio or "" which silently proceeds when ref_audio is missing;
instead detect the missing reference (ref_audio is None or empty) and raise a
descriptive exception (e.g., ValueError) before calling inference_cross_lingual;
update the else branch where logger.info("[CosyVoice] cross_lingual fallback:
lang=%s", language) appears to check ref_audio and raise the error, leaving
inference_cross_lingual calls only when a valid ref_audio exists so the model
method isn’t invoked with an empty string.
|
In my first tests of this model, output is horrible. Not passing the quality bar for the next update. Maybe I'm configuring it wrong, but I'll choose other models to add instead for 0.4.0. |
Summary
cosyvoice) with 9-language support, zero-shot voice cloning, and instruct control (emotions, speed, volume, dialects)just setup-python) since no PyPI package exists — a modelscope→HuggingFace shim redirects downloads to public HF reposChanges
Backend (5 files)
backend/backends/cosyvoice_backend.pyinference_instruct2()+inference_zero_shot()backend/backends/__init__.pyTTS_ENGINES, add 2ModelConfigentries (v2/v3), factory branch, multi-size dispatchbackend/models.pycosyvoice) and model_size regex (v2|v3)backend/build_binary.py--collect-allfor hyperpyyaml/onnxruntime/whisper/tiktoken,--add-datafor vendored sourcebackend/server.pyFrontend (5 files)
app/src/lib/api/types.ts'cosyvoice'to engine union,'v2' | 'v3'to model_sizeapp/src/lib/constants/languages.tscosyvoicelanguage list (zh, en, ja, ko, de, fr, ru, es, it)app/src/components/Generation/EngineModelSelector.tsxapp/src/lib/hooks/useGenerationForm.tsapp/src/components/ServerSettings/ModelManagement.tsxInfra (4 files)
backend/requirements.txtjustfilegit clone --recursive --depth 1in Unix + Windows setup-python.github/workflows/release.ymlDockerfile.gitignorebackend/vendors/Design Decisions
cosyvoice:v2(CosyVoice2) andcosyvoice:v3(CosyVoice3) share one backend that swaps models on demandlightning.pytorch(~300MB) dependency used only forrank_zero_onlyloggingTesting
Model download (~4.6GB) is triggered on first generation.
Summary by CodeRabbit
New Features
Chores