On-device text-to-speech using CosyVoice 3 (Alibaba/Tongyi) running entirely on iOS via ONNX Runtime. No server, no API calls — the full inference pipeline runs locally on the device.
Status: Work in progress. The 3-stage pipeline runs end-to-end on-device and produces audio. Output quality is still being tuned.
Type Chinese text, pick a dialect voice, and generate speech — all on-device:
- Text input — Mandarin, Cantonese, or any Chinese dialect text
- Voice selection — 8 bundled voices across 5 dialects, cloned from short reference clips
- On-device generation — Full CosyVoice 3 inference via 12 ONNX models (~500MB)
- Audio output — 24kHz 16-bit WAV, playable and exportable
Zero-shot voice cloning: any voice can be added from a 2-6 second audio clip. No fine-tuning, no retraining.
CosyVoice 3 is a 3-stage neural TTS system. This project implements all 3 stages in Swift using ONNX Runtime, running the CosyVoice 3 ONNX models on iOS. The Swift code was written by translating the reference Python implementation.
Text ("你好世界") + Voice Data (speaker embedding + prompt audio)
|
v
┌─────────────────────────────────────────────────────────────┐
│ Stage 1: LLM (Qwen2-0.5B) │
│ │
│ BPE tokenize text → embed → concat [SOS, text, TASK, │
│ prompt_speech] → autoregressive decode with KV cache → │
│ top-k sample speech tokens until EOS │
│ │
│ Input: text tokens + prompt speech tokens │
│ Output: ~N discrete speech token IDs (codebook size 6561) │
└──────────────────────┬──────────────────────────────────────┘
│ speech tokens [N]
v
┌─────────────────────────────────────────────────────────────┐
│ Stage 2: Flow Matching │
│ │
│ Token embed → pre-lookahead conv → repeat_interleave → │
│ build mu [80, melLen] + conds (prompt mel) → │
│ 10-step Euler ODE with flow estimator → │
│ strip prompt mel portion │
│ │
│ Input: prompt_tokens + speech_tokens, speaker embedding │
│ Output: mel spectrogram [80, melLen2] │
└──────────────────────┬──────────────────────────────────────┘
│ mel spectrogram
v
┌─────────────────────────────────────────────────────────────┐
│ Stage 3: HiFT Vocoder │
│ │
│ F0 predictor → source generator → STFT(source) → │
│ HiFT decoder(mel, source_stft) → magnitude + phase → │
│ iSTFT with overlap-add → clip to [-0.99, 0.99] │
│ │
│ Input: mel spectrogram [80, melLen] │
│ Output: PCM Float32 audio @ 24kHz │
└──────────────────────┬──────────────────────────────────────┘
│
v
WAV encoder → 16-bit PCM file
File: App/Services/TTS/LLMInference.swift
The LLM is a Qwen2-0.5B variant fine-tuned to generate discrete speech tokens from text. It uses 5 ONNX model files:
| Model | Purpose | I/O |
|---|---|---|
text_embedding_fp32 |
BPE token IDs → hidden embeddings | [1, seq] int64 → [1, seq, 896] |
llm_speech_embedding_fp16 |
Speech token IDs → hidden embeddings | [1, seq] int64 → [1, seq, 896] |
llm_backbone_initial_fp16 |
Initial forward pass (prefill) | [1, totalSeq, 896] → hidden + KV cache |
llm_backbone_decode_fp16 |
Autoregressive decode (1 token at a time) | [1, 1, 896] + KV cache → hidden + new KV cache |
llm_decoder_fp16 |
Hidden state → logits over speech codebook | [1, 1, 896] → [1, 6761] |
Input construction:
[SOS(1×896), textEmb(combinedLen×896), TASK(1×896), promptSpeech(promptLen×896)]
Where combinedLen = BPE(promptText) + BPE(targetText). The prompt text must exactly match the reference audio transcript — misalignment causes the model to generate wrong content.
Special token IDs:
- SOS: 6561 (start of speech)
- EOS: 6562 (end of speech — stops generation)
- TASK: 6563 (zero-shot task identifier)
Decoding: Top-k sampling (k=25) with softmax. Max tokens = max(200, min(4000, targetTextTokens * 40)). Each decode step appends to the KV cache.
FP16 KV cache handling: The backbone models output FP16 KV cache tensors. The implementation preserves the actual ONNX tensor element type and shape metadata across decode steps — blindly wrapping FP16 bytes as FP32 would corrupt every subsequent step. See KVCacheState struct.
File: App/Services/TTS/FlowVocoderInference.swift (flow section)
Conditional flow matching converts discrete speech tokens into a continuous mel spectrogram. Uses 4 ONNX models:
| Model | Purpose |
|---|---|
flow_token_embedding |
Speech tokens → [1, totalSeq, 80] embeddings |
flow_pre_lookahead |
Conv + repeat_interleave → [1, melLen, 80] |
flow_speaker_projection |
L2-normalized 192-dim speaker emb → [1, 80] projected |
flow.decoder.estimator |
Velocity field estimation for ODE solver |
Key details:
- Token-mel ratio = 2: Each speech token produces 2 mel frames. So
melLen = totalTokens × 2. - Prompt mel conditioning: The first
melLen1 = promptTokens × 2mel frames are filled with the prompt audio's mel spectrogram (linearly resized if frame counts differ). The rest is zero. - Euler ODE solver: 10 steps,
dt = 0.1. Initializes from Gaussian noise, iteratively denoises using the velocity field:x₀ = N(0, 1) for t in [0.0, 0.1, ..., 0.9]: v = estimator(x, mask, mu, t, spks, cond) x = x + v × dt - Batch=2 constraint: The flow estimator ONNX model was exported with hardcoded batch size 2. All inputs are duplicated and only the first batch of output is used.
- Prompt stripping: After flow matching, the first
melLen1mel frames (prompt portion) are removed. The output contains only the generated portion.
File: App/Services/TTS/FlowVocoderInference.swift (vocoder section)
HiFT (Harmonic-plus-noise with iSTFT) converts mel spectrograms to audio waveforms. Uses 3 ONNX models:
| Model | Purpose |
|---|---|
hift_f0_predictor_fp32 |
Mel [1, 80, melLen] → F0 [1, melLen] |
hift_source_generator_fp32 |
F0 [1, 1, melLen] → source signal [1, 1, timeUp] |
hift_decoder_fp32 |
Mel + source STFT → magnitude [1, 9, outLen] + phase [1, 9, outLen] |
Signal processing pipeline:
- Predict F0 (fundamental frequency) from mel
- Generate source signal from F0
- STFT of source signal (
n_fft=16, hop=4, Hann window, center=true) - HiFT decoder combines mel + source STFT → magnitude + phase
- Clip magnitude to max 100 (not log-domain — the model outputs linear magnitude)
- iSTFT with overlap-add reconstruction → raw audio
- Clip audio to
[-0.99, 0.99]
All vocoder models use FP32 for numerical stability — the STFT/iSTFT operations are sensitive to precision.
File: App/Services/TTS/BPETokenizer.swift
A from-scratch implementation of Qwen2/GPT-2 byte-level BPE tokenization. No Python dependency, no SentencePiece, no external tokenizer libraries.
How it works:
- Regex split: Split input text using the GPT-2 pattern (
's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ...) - Byte encoding: Map each UTF-8 byte to a printable Unicode character (GPT-2's byte-to-char scheme — printable ASCII/Latin-1 map to themselves, non-printable bytes map to
U+0100+) - BPE merging: Iteratively find and merge the highest-priority pair from
merges.txtuntil no more merges apply - Vocabulary lookup: Map merged tokens to integer IDs via
vocab.json
Loads vocab.json (~11MB, 151k+ entries) and merges.txt at runtime from the models directory.
File: App/Services/TTS/AudioMath.swift
All signal processing runs on-device using Apple's Accelerate framework (vDSP):
| Function | What it does |
|---|---|
stft() |
Short-time Fourier Transform using vDSP_DFT_zrop — periodic Hann window, reflect padding, output as [2×nFreqs, nFrames] |
istft() |
Inverse STFT with overlap-add and window normalization |
melSpectrogram() |
Power mel spectrogram with Slaney-normalized filterbank (matches librosa) |
kaldiMel80() |
80-mel at 16kHz with log + mean normalization — for CAMPPlus speaker embedding |
whisperMel128() |
128-mel at 16kHz with Whisper normalization — for Speech Tokenizer v3 |
mel80At24kHz() |
80-mel at 24kHz — for flow conditioning |
gaussianNoise() |
Box-Muller transform for standard normal samples |
l2Normalize() |
L2 normalization using vDSP_svesq + vDSP_vsdiv |
linearResize() |
Linear interpolation resize for mel frame count alignment |
File: App/Services/TTS/ONNXSessionManager.swift
- Actor-isolated: Thread-safe session management via Swift
actor - Lazy loading: Sessions are created on first use and cached
- CoreML Execution Provider: Enabled on real devices (not simulator) for Neural Engine / GPU acceleration via
ORTCoreMLExecutionProviderOptions - Graph optimization: Level
.allfor maximum ONNX graph folding - Memory management: Unloads all sessions on
UIApplication.didReceiveMemoryWarningNotification - Sendable conformance:
ORTSessionandORTValueare marked@retroactive @unchecked Sendablefor actor boundary crossing
CosyVoice 3 clones a voice from a short reference clip (2-6 seconds). The reference audio provides three pieces of information:
- Speaker embedding (192-dim, from CAMPPlus) — captures speaker identity (timbre, pitch range)
- Prompt speech tokens (from Speech Tokenizer v3) — discrete representation of the reference speech content
- Prompt mel spectrogram (80-bin, 24kHz) — continuous acoustic features for flow conditioning
At inference time, the LLM generates new speech tokens conditioned on the speaker's voice characteristics, and the flow model uses the prompt mel to anchor the acoustic style.
Each voice is a directory under App/Resources/voices/{voiceId}/ containing 4 files:
| File | Format | Example size |
|---|---|---|
speaker_embedding.bin |
192 × Float32, little-endian | 768 bytes |
prompt_tokens.bin |
N × Int64, little-endian | 344-1112 bytes (43-139 tokens) |
prompt_mel.bin |
Int32 frame count + frames × 80 × Float32 | 50-167 KB |
prompt_text.txt |
UTF-8 transcript | Short sentence |
Critical: The prompt_text.txt must exactly match what the reference audio says. The LLM uses the combined text (prompt + target) with the prompt speech tokens to align text to speech. A mismatch causes the model to generate wrong or extra content.
python tools/extract_voice_onnx.py \
--audio my_reference.wav \
--text "Exact transcript of the audio" \
--voice-id my_custom_voice \
--models-dir models \
--output-dir App/Resources/voices/Requires campplus.onnx and speech_tokenizer_v3.onnx in the models directory (available from FunAudioLLM/CosyVoice2-0.5B).
| Voice ID | Language | Gender | Tokens | Duration | Source |
|---|---|---|---|---|---|
longjiaxin_v3 |
Cantonese | F | 43 | ~1.7s | CosyVoice 3 demo |
longjiayi_v3 |
Cantonese | F | 115 | ~4.6s | Nexdata HK corpus |
longanyue_v3 |
Cantonese | M | 128 | ~5.1s | Nexdata HK corpus |
longanmin_v3 |
Sichuan | F | 113 | ~4.5s | CosyVoice 3 demo |
longlaotie_v3 |
Northeast Mandarin | M | 97 | ~3.9s | CosyVoice 3 demo |
longshange_v3 |
Shanghainese | M | 69 | ~2.8s | CosyVoice 3 demo |
longanyang |
Mandarin | M | 127 | ~5.1s | CosyVoice 3 demo |
longanhuan |
Mandarin | F | 139 | ~5.6s | CosyVoice 3 demo |
12 ONNX model files are required (~500MB total). They are downloaded on first app launch.
| Stage | Model File | Precision | Purpose |
|---|---|---|---|
| LLM | text_embedding_fp32.onnx |
FP32 | BPE token IDs → text embeddings |
| LLM | llm_speech_embedding_fp16.onnx |
FP16 | Speech token IDs → speech embeddings |
| LLM | llm_backbone_initial_fp16.onnx |
FP16 | Prefill forward pass (produces KV cache) |
| LLM | llm_backbone_decode_fp16.onnx |
FP16 | Single-token decode step (consumes/extends KV cache) |
| LLM | llm_decoder_fp16.onnx |
FP16 | Hidden state → logits [6761] |
| Flow | flow_token_embedding_fp16.onnx |
FP16 | Speech token embedding for flow |
| Flow | flow_pre_lookahead_fp16.onnx |
FP16 | Conv + repeat_interleave upsampling |
| Flow | flow_speaker_projection_fp16.onnx |
FP16 | Speaker embedding → 80-dim projection |
| Flow | flow.decoder.estimator.fp16.onnx |
FP16 | Velocity field estimator (10x per generation) |
| HiFT | hift_f0_predictor_fp32.onnx |
FP32 | Mel → fundamental frequency |
| HiFT | hift_source_generator_fp32.onnx |
FP32 | F0 → excitation source signal |
| HiFT | hift_decoder_fp32.onnx |
FP32 | Mel + source → magnitude + phase |
Additionally, the BPE tokenizer needs vocab.json (~11MB) and merges.txt.
FP16 vs FP32 rationale: LLM backbone and flow models use FP16 to reduce model size and memory. HiFT vocoder and text embedding stay FP32 because STFT/iSTFT and embedding lookup are sensitive to precision.
The LLM backbone outputs FP16 KV cache tensors. The autoregressive decode loop passes KV cache back as input on every step. If the FP16 bytes are reinterpreted as FP32 (which doubles the apparent element count and reinterprets every pair of FP16 bytes as a garbage FP32 value), the entire generation collapses. The solution: preserve the ONNX tensor's actual elementType and shape metadata alongside the raw bytes, and reconstruct the ORTValue with the correct type on each decode step.
The CosyVoice ONNX export hardcodes batch dimension to 2 in the flow estimator model. Passing batch=1 causes a shape mismatch error. The workaround: duplicate all inputs to batch=2, run inference, and use only the first batch of the output. This is a known artifact of the export process.
The LLM input is [SOS, textEmb(promptText + targetText), TASK, promptSpeech]. The model internally aligns the text embedding with the prompt speech tokens to determine where the prompt audio "covers" in the text, then generates speech for the remaining portion. If the prompt text doesn't match the actual audio content, the alignment fails and the model produces extra or wrong speech. This was the cause of our initial output quality issues — the reference audio transcripts were swapped with the generated output text from the demo page.
Three different mel spectrogram configurations are used by different parts of the pipeline:
| Config | Sample Rate | Mels | FFT | Hop | Range | Used By |
|---|---|---|---|---|---|---|
| Kaldi | 16kHz | 80 | 400 | 160 | 20-7600 Hz | CAMPPlus speaker embedding |
| Whisper | 16kHz | 128 | 400 | 160 | 0-8000 Hz | Speech Tokenizer v3 |
| Flow | 24kHz | 80 | 1024 | 256 | 0-12000 Hz | Flow conditioning mel |
Each has different normalization: Kaldi uses log + per-feature mean subtraction, Whisper uses log10 + clamp(max-8) + shift, flow uses plain log. All implemented in AudioMath.swift using Accelerate.
App/
Services/TTS/
OnDeviceTTSEngine.swift # Pipeline orchestrator (actor)
LLMInference.swift # Stage 1: Qwen2-0.5B autoregressive decode
FlowVocoderInference.swift # Stage 2+3: flow matching + HiFT vocoder
BPETokenizer.swift # Pure Swift Qwen2/GPT-2 byte-level BPE
AudioMath.swift # DSP: STFT, iSTFT, mel, Gaussian noise (Accelerate)
ONNXSessionManager.swift # ONNX Runtime sessions + CoreML EP
WAVEncoder.swift # PCM Float32 → 16-bit WAV
ModelDownloadManager.swift # Downloads ~500MB of ONNX models
TTSEngine.swift # Protocol definition
Views/TTS/
TTSView.swift # Main TTS interface
VoicePickerView.swift # Voice selection sheet
TTSPlayerView.swift # Audio playback controls
TTSHistoryView.swift # Generation history
ViewModels/
TTSViewModel.swift # TTS generation + playback state
TTSHistoryViewModel.swift # History management
Resources/voices/ # Pre-extracted voice data (8 voices)
Packages/
DialectCore/ # Voice catalog, data models (pure Swift, Linux-testable)
SRSEngine/ # FSRS spaced repetition for dialect learning
tools/
extract_voice_onnx.py # Extract voice data from audio via ONNX (CPU)
generate_voice_data.py # Batch voice generation via DashScope API
Swift 6 strict concurrency throughout:
OnDeviceTTSEngine—actor(thread-safe pipeline orchestration)LLMInference—actor(isolated ONNX session calls)FlowVocoderInference—actor(isolated ONNX session calls)ONNXSessionManager—actor(shared session cache)BPETokenizer—struct: Sendable(immutable after init)- ViewModels —
@MainActor @Observable(UI-bound state)
- iOS 17.0+
- Xcode 16+ with Swift 6
- ~500MB storage for ONNX models (downloaded on first launch)
- XcodeGen to generate the
.xcodeproj
git clone https://github.com/Psypeal/DialectLearn.git
cd DialectLearn
# Generate Xcode project
xcodegen generate
# Open and build
open BabYap.xcodeprojModels are downloaded automatically on first launch. Firebase configuration (GoogleService-Info.plist) is required for auth — see project.yml for dependencies.
| Dependency | Version | Purpose |
|---|---|---|
| ONNX Runtime | 1.20+ | On-device neural network inference |
| Firebase iOS SDK | 11.0+ | Auth, Firestore, Cloud Storage, Remote Config |
- CosyVoice 3 Paper: CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training (Du et al., 2025)
- CosyVoice 2 Paper: CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models (Du et al., 2024)
- Code: FunAudioLLM/CosyVoice — Official Python/PyTorch implementation
- Demo: CosyVoice 3.0 audio samples
- ONNX Models: ayousanz/cosy-voice3-onnx — Community ONNX export used by this project
- Official Weights: FunAudioLLM/CosyVoice2-0.5B — Official PyTorch weights (Apache 2.0)
MIT — see LICENSE.
The CosyVoice 3 ONNX model weights are subject to their own license from Alibaba/Tongyi SpeechTeam. Voice data from Nexdata is used under CC license.