BabYap — On-Device CosyVoice 3 TTS for iOS

On-device text-to-speech using CosyVoice 3 (Alibaba/Tongyi) running entirely on iOS via ONNX Runtime. No server, no API calls — the full inference pipeline runs locally on the device.

Status: Work in progress. The 3-stage pipeline runs end-to-end on-device and produces audio. Output quality is still being tuned.

What This Does

Type Chinese text, pick a dialect voice, and generate speech — all on-device:

Text input — Mandarin, Cantonese, or any Chinese dialect text
Voice selection — 8 bundled voices across 5 dialects, cloned from short reference clips
On-device generation — Full CosyVoice 3 inference via 12 ONNX models (~500MB)
Audio output — 24kHz 16-bit WAV, playable and exportable

Zero-shot voice cloning: any voice can be added from a 2-6 second audio clip. No fine-tuning, no retraining.

Architecture

CosyVoice 3 is a 3-stage neural TTS system. This project implements all 3 stages in Swift using ONNX Runtime, running the CosyVoice 3 ONNX models on iOS. The Swift code was written by translating the reference Python implementation.

  Text ("你好世界")  +  Voice Data (speaker embedding + prompt audio)
                              |
                              v
  ┌─────────────────────────────────────────────────────────────┐
  │  Stage 1: LLM (Qwen2-0.5B)                                 │
  │                                                             │
  │  BPE tokenize text → embed → concat [SOS, text, TASK,      │
  │  prompt_speech] → autoregressive decode with KV cache →     │
  │  top-k sample speech tokens until EOS                       │
  │                                                             │
  │  Input:  text tokens + prompt speech tokens                 │
  │  Output: ~N discrete speech token IDs (codebook size 6561)  │
  └──────────────────────┬──────────────────────────────────────┘
                         │  speech tokens [N]
                         v
  ┌─────────────────────────────────────────────────────────────┐
  │  Stage 2: Flow Matching                                     │
  │                                                             │
  │  Token embed → pre-lookahead conv → repeat_interleave →     │
  │  build mu [80, melLen] + conds (prompt mel) →               │
  │  10-step Euler ODE with flow estimator →                    │
  │  strip prompt mel portion                                   │
  │                                                             │
  │  Input:  prompt_tokens + speech_tokens, speaker embedding   │
  │  Output: mel spectrogram [80, melLen2]                      │
  └──────────────────────┬──────────────────────────────────────┘
                         │  mel spectrogram
                         v
  ┌─────────────────────────────────────────────────────────────┐
  │  Stage 3: HiFT Vocoder                                      │
  │                                                             │
  │  F0 predictor → source generator → STFT(source) →          │
  │  HiFT decoder(mel, source_stft) → magnitude + phase →      │
  │  iSTFT with overlap-add → clip to [-0.99, 0.99]            │
  │                                                             │
  │  Input:  mel spectrogram [80, melLen]                       │
  │  Output: PCM Float32 audio @ 24kHz                          │
  └──────────────────────┬──────────────────────────────────────┘
                         │
                         v
                   WAV encoder → 16-bit PCM file

Technical Deep Dive

Stage 1: LLM — Autoregressive Speech Token Generation

File: App/Services/TTS/LLMInference.swift

The LLM is a Qwen2-0.5B variant fine-tuned to generate discrete speech tokens from text. It uses 5 ONNX model files:

Model	Purpose	I/O
`text_embedding_fp32`	BPE token IDs → hidden embeddings	`[1, seq] int64` → `[1, seq, 896]`
`llm_speech_embedding_fp16`	Speech token IDs → hidden embeddings	`[1, seq] int64` → `[1, seq, 896]`
`llm_backbone_initial_fp16`	Initial forward pass (prefill)	`[1, totalSeq, 896]` → hidden + KV cache
`llm_backbone_decode_fp16`	Autoregressive decode (1 token at a time)	`[1, 1, 896]` + KV cache → hidden + new KV cache
`llm_decoder_fp16`	Hidden state → logits over speech codebook	`[1, 1, 896]` → `[1, 6761]`

Input construction:

[SOS(1×896), textEmb(combinedLen×896), TASK(1×896), promptSpeech(promptLen×896)]

Where combinedLen = BPE(promptText) + BPE(targetText). The prompt text must exactly match the reference audio transcript — misalignment causes the model to generate wrong content.

Special token IDs:

SOS: 6561 (start of speech)
EOS: 6562 (end of speech — stops generation)
TASK: 6563 (zero-shot task identifier)

Decoding: Top-k sampling (k=25) with softmax. Max tokens = max(200, min(4000, targetTextTokens * 40)). Each decode step appends to the KV cache.

FP16 KV cache handling: The backbone models output FP16 KV cache tensors. The implementation preserves the actual ONNX tensor element type and shape metadata across decode steps — blindly wrapping FP16 bytes as FP32 would corrupt every subsequent step. See KVCacheState struct.

Stage 2: Flow Matching — Tokens to Mel Spectrogram

File: App/Services/TTS/FlowVocoderInference.swift (flow section)

Conditional flow matching converts discrete speech tokens into a continuous mel spectrogram. Uses 4 ONNX models:

Model	Purpose
`flow_token_embedding`	Speech tokens → `[1, totalSeq, 80]` embeddings
`flow_pre_lookahead`	Conv + repeat_interleave → `[1, melLen, 80]`
`flow_speaker_projection`	L2-normalized 192-dim speaker emb → `[1, 80]` projected
`flow.decoder.estimator`	Velocity field estimation for ODE solver

Key details:

Token-mel ratio = 2: Each speech token produces 2 mel frames. So melLen = totalTokens × 2.
Prompt mel conditioning: The first melLen1 = promptTokens × 2 mel frames are filled with the prompt audio's mel spectrogram (linearly resized if frame counts differ). The rest is zero.
Euler ODE solver: 10 steps, dt = 0.1. Initializes from Gaussian noise, iteratively denoises using the velocity field:
```
x₀ = N(0, 1)
for t in [0.0, 0.1, ..., 0.9]:
    v = estimator(x, mask, mu, t, spks, cond)
    x = x + v × dt
```
Batch=2 constraint: The flow estimator ONNX model was exported with hardcoded batch size 2. All inputs are duplicated and only the first batch of output is used.
Prompt stripping: After flow matching, the first melLen1 mel frames (prompt portion) are removed. The output contains only the generated portion.

Stage 3: HiFT Vocoder — Mel to Waveform

File: App/Services/TTS/FlowVocoderInference.swift (vocoder section)

HiFT (Harmonic-plus-noise with iSTFT) converts mel spectrograms to audio waveforms. Uses 3 ONNX models:

Model	Purpose
`hift_f0_predictor_fp32`	Mel `[1, 80, melLen]` → F0 `[1, melLen]`
`hift_source_generator_fp32`	F0 `[1, 1, melLen]` → source signal `[1, 1, timeUp]`
`hift_decoder_fp32`	Mel + source STFT → magnitude `[1, 9, outLen]` + phase `[1, 9, outLen]`

Signal processing pipeline:

Predict F0 (fundamental frequency) from mel
Generate source signal from F0
STFT of source signal (n_fft=16, hop=4, Hann window, center=true)
HiFT decoder combines mel + source STFT → magnitude + phase
Clip magnitude to max 100 (not log-domain — the model outputs linear magnitude)
iSTFT with overlap-add reconstruction → raw audio
Clip audio to [-0.99, 0.99]

All vocoder models use FP32 for numerical stability — the STFT/iSTFT operations are sensitive to precision.

Pure Swift BPE Tokenizer

File: App/Services/TTS/BPETokenizer.swift

A from-scratch implementation of Qwen2/GPT-2 byte-level BPE tokenization. No Python dependency, no SentencePiece, no external tokenizer libraries.

How it works:

Regex split: Split input text using the GPT-2 pattern ('s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ...)
Byte encoding: Map each UTF-8 byte to a printable Unicode character (GPT-2's byte-to-char scheme — printable ASCII/Latin-1 map to themselves, non-printable bytes map to U+0100+)
BPE merging: Iteratively find and merge the highest-priority pair from merges.txt until no more merges apply
Vocabulary lookup: Map merged tokens to integer IDs via vocab.json

Loads vocab.json (~11MB, 151k+ entries) and merges.txt at runtime from the models directory.

DSP Utilities (Accelerate Framework)

File: App/Services/TTS/AudioMath.swift

All signal processing runs on-device using Apple's Accelerate framework (vDSP):

Function	What it does
`stft()`	Short-time Fourier Transform using `vDSP_DFT_zrop` — periodic Hann window, reflect padding, output as `[2×nFreqs, nFrames]`
`istft()`	Inverse STFT with overlap-add and window normalization
`melSpectrogram()`	Power mel spectrogram with Slaney-normalized filterbank (matches librosa)
`kaldiMel80()`	80-mel at 16kHz with log + mean normalization — for CAMPPlus speaker embedding
`whisperMel128()`	128-mel at 16kHz with Whisper normalization — for Speech Tokenizer v3
`mel80At24kHz()`	80-mel at 24kHz — for flow conditioning
`gaussianNoise()`	Box-Muller transform for standard normal samples
`l2Normalize()`	L2 normalization using `vDSP_svesq` + `vDSP_vsdiv`
`linearResize()`	Linear interpolation resize for mel frame count alignment

ONNX Runtime Session Management

File: App/Services/TTS/ONNXSessionManager.swift

Actor-isolated: Thread-safe session management via Swift actor
Lazy loading: Sessions are created on first use and cached
CoreML Execution Provider: Enabled on real devices (not simulator) for Neural Engine / GPU acceleration via ORTCoreMLExecutionProviderOptions
Graph optimization: Level .all for maximum ONNX graph folding
Memory management: Unloads all sessions on UIApplication.didReceiveMemoryWarningNotification
Sendable conformance: ORTSession and ORTValue are marked @retroactive @unchecked Sendable for actor boundary crossing

Voice Cloning

How Zero-Shot Cloning Works

CosyVoice 3 clones a voice from a short reference clip (2-6 seconds). The reference audio provides three pieces of information:

Speaker embedding (192-dim, from CAMPPlus) — captures speaker identity (timbre, pitch range)
Prompt speech tokens (from Speech Tokenizer v3) — discrete representation of the reference speech content
Prompt mel spectrogram (80-bin, 24kHz) — continuous acoustic features for flow conditioning

At inference time, the LLM generates new speech tokens conditioned on the speaker's voice characteristics, and the flow model uses the prompt mel to anchor the acoustic style.

Voice Data Format

Each voice is a directory under App/Resources/voices/{voiceId}/ containing 4 files:

File	Format	Example size
`speaker_embedding.bin`	192 × Float32, little-endian	768 bytes
`prompt_tokens.bin`	N × Int64, little-endian	344-1112 bytes (43-139 tokens)
`prompt_mel.bin`	Int32 frame count + frames × 80 × Float32	50-167 KB
`prompt_text.txt`	UTF-8 transcript	Short sentence

Critical: The prompt_text.txt must exactly match what the reference audio says. The LLM uses the combined text (prompt + target) with the prompt speech tokens to align text to speech. A mismatch causes the model to generate wrong or extra content.

Extracting New Voices

python tools/extract_voice_onnx.py \
  --audio my_reference.wav \
  --text "Exact transcript of the audio" \
  --voice-id my_custom_voice \
  --models-dir models \
  --output-dir App/Resources/voices/

Requires campplus.onnx and speech_tokenizer_v3.onnx in the models directory (available from FunAudioLLM/CosyVoice2-0.5B).

Bundled Voices

Voice ID	Language	Gender	Tokens	Duration	Source
`longjiaxin_v3`	Cantonese	F	43	~1.7s	CosyVoice 3 demo
`longjiayi_v3`	Cantonese	F	115	~4.6s	Nexdata HK corpus
`longanyue_v3`	Cantonese	M	128	~5.1s	Nexdata HK corpus
`longanmin_v3`	Sichuan	F	113	~4.5s	CosyVoice 3 demo
`longlaotie_v3`	Northeast Mandarin	M	97	~3.9s	CosyVoice 3 demo
`longshange_v3`	Shanghainese	M	69	~2.8s	CosyVoice 3 demo
`longanyang`	Mandarin	M	127	~5.1s	CosyVoice 3 demo
`longanhuan`	Mandarin	F	139	~5.6s	CosyVoice 3 demo

ONNX Models

12 ONNX model files are required (~500MB total). They are downloaded on first app launch.

Stage	Model File	Precision	Purpose
LLM	`text_embedding_fp32.onnx`	FP32	BPE token IDs → text embeddings
LLM	`llm_speech_embedding_fp16.onnx`	FP16	Speech token IDs → speech embeddings
LLM	`llm_backbone_initial_fp16.onnx`	FP16	Prefill forward pass (produces KV cache)
LLM	`llm_backbone_decode_fp16.onnx`	FP16	Single-token decode step (consumes/extends KV cache)
LLM	`llm_decoder_fp16.onnx`	FP16	Hidden state → logits [6761]
Flow	`flow_token_embedding_fp16.onnx`	FP16	Speech token embedding for flow
Flow	`flow_pre_lookahead_fp16.onnx`	FP16	Conv + repeat_interleave upsampling
Flow	`flow_speaker_projection_fp16.onnx`	FP16	Speaker embedding → 80-dim projection
Flow	`flow.decoder.estimator.fp16.onnx`	FP16	Velocity field estimator (10x per generation)
HiFT	`hift_f0_predictor_fp32.onnx`	FP32	Mel → fundamental frequency
HiFT	`hift_source_generator_fp32.onnx`	FP32	F0 → excitation source signal
HiFT	`hift_decoder_fp32.onnx`	FP32	Mel + source → magnitude + phase

Additionally, the BPE tokenizer needs vocab.json (~11MB) and merges.txt.

FP16 vs FP32 rationale: LLM backbone and flow models use FP16 to reduce model size and memory. HiFT vocoder and text embedding stay FP32 because STFT/iSTFT and embedding lookup are sensitive to precision.

Engineering Challenges

FP16 KV Cache Preservation

The LLM backbone outputs FP16 KV cache tensors. The autoregressive decode loop passes KV cache back as input on every step. If the FP16 bytes are reinterpreted as FP32 (which doubles the apparent element count and reinterprets every pair of FP16 bytes as a garbage FP32 value), the entire generation collapses. The solution: preserve the ONNX tensor's actual elementType and shape metadata alongside the raw bytes, and reconstruct the ORTValue with the correct type on each decode step.

Flow Estimator Batch=2

The CosyVoice ONNX export hardcodes batch dimension to 2 in the flow estimator model. Passing batch=1 causes a shape mismatch error. The workaround: duplicate all inputs to batch=2, run inference, and use only the first batch of the output. This is a known artifact of the export process.

Prompt Text Alignment

The LLM input is [SOS, textEmb(promptText + targetText), TASK, promptSpeech]. The model internally aligns the text embedding with the prompt speech tokens to determine where the prompt audio "covers" in the text, then generates speech for the remaining portion. If the prompt text doesn't match the actual audio content, the alignment fails and the model produces extra or wrong speech. This was the cause of our initial output quality issues — the reference audio transcripts were swapped with the generated output text from the demo page.

Mel Spectrogram Compatibility

Three different mel spectrogram configurations are used by different parts of the pipeline:

Config	Sample Rate	Mels	FFT	Hop	Range	Used By
Kaldi	16kHz	80	400	160	20-7600 Hz	CAMPPlus speaker embedding
Whisper	16kHz	128	400	160	0-8000 Hz	Speech Tokenizer v3
Flow	24kHz	80	1024	256	0-12000 Hz	Flow conditioning mel

Each has different normalization: Kaldi uses log + per-feature mean subtraction, Whisper uses log10 + clamp(max-8) + shift, flow uses plain log. All implemented in AudioMath.swift using Accelerate.

Project Structure

App/
  Services/TTS/
    OnDeviceTTSEngine.swift      # Pipeline orchestrator (actor)
    LLMInference.swift           # Stage 1: Qwen2-0.5B autoregressive decode
    FlowVocoderInference.swift   # Stage 2+3: flow matching + HiFT vocoder
    BPETokenizer.swift           # Pure Swift Qwen2/GPT-2 byte-level BPE
    AudioMath.swift              # DSP: STFT, iSTFT, mel, Gaussian noise (Accelerate)
    ONNXSessionManager.swift     # ONNX Runtime sessions + CoreML EP
    WAVEncoder.swift             # PCM Float32 → 16-bit WAV
    ModelDownloadManager.swift   # Downloads ~500MB of ONNX models
    TTSEngine.swift              # Protocol definition
  Views/TTS/
    TTSView.swift                # Main TTS interface
    VoicePickerView.swift        # Voice selection sheet
    TTSPlayerView.swift          # Audio playback controls
    TTSHistoryView.swift         # Generation history
  ViewModels/
    TTSViewModel.swift           # TTS generation + playback state
    TTSHistoryViewModel.swift    # History management
  Resources/voices/              # Pre-extracted voice data (8 voices)

Packages/
  DialectCore/                   # Voice catalog, data models (pure Swift, Linux-testable)
  SRSEngine/                     # FSRS spaced repetition for dialect learning

tools/
  extract_voice_onnx.py          # Extract voice data from audio via ONNX (CPU)
  generate_voice_data.py         # Batch voice generation via DashScope API

Concurrency Model

Swift 6 strict concurrency throughout:

OnDeviceTTSEngine — actor (thread-safe pipeline orchestration)
LLMInference — actor (isolated ONNX session calls)
FlowVocoderInference — actor (isolated ONNX session calls)
ONNXSessionManager — actor (shared session cache)
BPETokenizer — struct: Sendable (immutable after init)
ViewModels — @MainActor @Observable (UI-bound state)

Requirements

iOS 17.0+
Xcode 16+ with Swift 6
~500MB storage for ONNX models (downloaded on first launch)
XcodeGen to generate the .xcodeproj

Setup

git clone https://github.com/Psypeal/DialectLearn.git
cd DialectLearn

# Generate Xcode project
xcodegen generate

# Open and build
open BabYap.xcodeproj

Models are downloaded automatically on first launch. Firebase configuration (GoogleService-Info.plist) is required for auth — see project.yml for dependencies.

Dependencies

Dependency	Version	Purpose
ONNX Runtime	1.20+	On-device neural network inference
Firebase iOS SDK	11.0+	Auth, Firestore, Cloud Storage, Remote Config

References

CosyVoice 3 Paper: CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training (Du et al., 2025)
CosyVoice 2 Paper: CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models (Du et al., 2024)
Code: FunAudioLLM/CosyVoice — Official Python/PyTorch implementation
Demo: CosyVoice 3.0 audio samples
ONNX Models: ayousanz/cosy-voice3-onnx — Community ONNX export used by this project
Official Weights: FunAudioLLM/CosyVoice2-0.5B — Official PyTorch weights (Apache 2.0)

License

MIT — see LICENSE.

The CosyVoice 3 ONNX model weights are subject to their own license from Alibaba/Tongyi SpeechTeam. Voice data from Nexdata is used under CC license.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
App		App
Content		Content
Packages		Packages
Scripts		Scripts
UITests		UITests
firebase		firebase
phase7-artifacts		phase7-artifacts
test-audio		test-audio
tools		tools
.gitignore		.gitignore
ExportOptions.plist		ExportOptions.plist
LICENSE		LICENSE
Package.swift		Package.swift
README.md		README.md
install.sh		install.sh
project.yml		project.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BabYap — On-Device CosyVoice 3 TTS for iOS

What This Does

Architecture

Technical Deep Dive

Stage 1: LLM — Autoregressive Speech Token Generation

Stage 2: Flow Matching — Tokens to Mel Spectrogram

Stage 3: HiFT Vocoder — Mel to Waveform

Pure Swift BPE Tokenizer

DSP Utilities (Accelerate Framework)

ONNX Runtime Session Management

Voice Cloning

How Zero-Shot Cloning Works

Voice Data Format

Extracting New Voices

Bundled Voices

ONNX Models

Engineering Challenges

FP16 KV Cache Preservation

Flow Estimator Batch=2

Prompt Text Alignment

Mel Spectrogram Compatibility

Project Structure

Concurrency Model

Requirements

Setup

Dependencies

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BabYap — On-Device CosyVoice 3 TTS for iOS

What This Does

Architecture

Technical Deep Dive

Stage 1: LLM — Autoregressive Speech Token Generation

Stage 2: Flow Matching — Tokens to Mel Spectrogram

Stage 3: HiFT Vocoder — Mel to Waveform

Pure Swift BPE Tokenizer

DSP Utilities (Accelerate Framework)

ONNX Runtime Session Management

Voice Cloning

How Zero-Shot Cloning Works

Voice Data Format

Extracting New Voices

Bundled Voices

ONNX Models

Engineering Challenges

FP16 KV Cache Preservation

Flow Estimator Batch=2

Prompt Text Alignment

Mel Spectrogram Compatibility

Project Structure

Concurrency Model

Requirements

Setup

Dependencies

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages