Skip to content

Psypeal/DialectLearn

Repository files navigation

BabYap — On-Device CosyVoice 3 TTS for iOS

On-device text-to-speech using CosyVoice 3 (Alibaba/Tongyi) running entirely on iOS via ONNX Runtime. No server, no API calls — the full inference pipeline runs locally on the device.

Status: Work in progress. The 3-stage pipeline runs end-to-end on-device and produces audio. Output quality is still being tuned.

What This Does

Type Chinese text, pick a dialect voice, and generate speech — all on-device:

  1. Text input — Mandarin, Cantonese, or any Chinese dialect text
  2. Voice selection — 8 bundled voices across 5 dialects, cloned from short reference clips
  3. On-device generation — Full CosyVoice 3 inference via 12 ONNX models (~500MB)
  4. Audio output — 24kHz 16-bit WAV, playable and exportable

Zero-shot voice cloning: any voice can be added from a 2-6 second audio clip. No fine-tuning, no retraining.


Architecture

CosyVoice 3 is a 3-stage neural TTS system. This project implements all 3 stages in Swift using ONNX Runtime, running the CosyVoice 3 ONNX models on iOS. The Swift code was written by translating the reference Python implementation.

  Text ("你好世界")  +  Voice Data (speaker embedding + prompt audio)
                              |
                              v
  ┌─────────────────────────────────────────────────────────────┐
  │  Stage 1: LLM (Qwen2-0.5B)                                 │
  │                                                             │
  │  BPE tokenize text → embed → concat [SOS, text, TASK,      │
  │  prompt_speech] → autoregressive decode with KV cache →     │
  │  top-k sample speech tokens until EOS                       │
  │                                                             │
  │  Input:  text tokens + prompt speech tokens                 │
  │  Output: ~N discrete speech token IDs (codebook size 6561)  │
  └──────────────────────┬──────────────────────────────────────┘
                         │  speech tokens [N]
                         v
  ┌─────────────────────────────────────────────────────────────┐
  │  Stage 2: Flow Matching                                     │
  │                                                             │
  │  Token embed → pre-lookahead conv → repeat_interleave →     │
  │  build mu [80, melLen] + conds (prompt mel) →               │
  │  10-step Euler ODE with flow estimator →                    │
  │  strip prompt mel portion                                   │
  │                                                             │
  │  Input:  prompt_tokens + speech_tokens, speaker embedding   │
  │  Output: mel spectrogram [80, melLen2]                      │
  └──────────────────────┬──────────────────────────────────────┘
                         │  mel spectrogram
                         v
  ┌─────────────────────────────────────────────────────────────┐
  │  Stage 3: HiFT Vocoder                                      │
  │                                                             │
  │  F0 predictor → source generator → STFT(source) →          │
  │  HiFT decoder(mel, source_stft) → magnitude + phase →      │
  │  iSTFT with overlap-add → clip to [-0.99, 0.99]            │
  │                                                             │
  │  Input:  mel spectrogram [80, melLen]                       │
  │  Output: PCM Float32 audio @ 24kHz                          │
  └──────────────────────┬──────────────────────────────────────┘
                         │
                         v
                   WAV encoder → 16-bit PCM file

Technical Deep Dive

Stage 1: LLM — Autoregressive Speech Token Generation

File: App/Services/TTS/LLMInference.swift

The LLM is a Qwen2-0.5B variant fine-tuned to generate discrete speech tokens from text. It uses 5 ONNX model files:

Model Purpose I/O
text_embedding_fp32 BPE token IDs → hidden embeddings [1, seq] int64[1, seq, 896]
llm_speech_embedding_fp16 Speech token IDs → hidden embeddings [1, seq] int64[1, seq, 896]
llm_backbone_initial_fp16 Initial forward pass (prefill) [1, totalSeq, 896] → hidden + KV cache
llm_backbone_decode_fp16 Autoregressive decode (1 token at a time) [1, 1, 896] + KV cache → hidden + new KV cache
llm_decoder_fp16 Hidden state → logits over speech codebook [1, 1, 896][1, 6761]

Input construction:

[SOS(1×896), textEmb(combinedLen×896), TASK(1×896), promptSpeech(promptLen×896)]

Where combinedLen = BPE(promptText) + BPE(targetText). The prompt text must exactly match the reference audio transcript — misalignment causes the model to generate wrong content.

Special token IDs:

  • SOS: 6561 (start of speech)
  • EOS: 6562 (end of speech — stops generation)
  • TASK: 6563 (zero-shot task identifier)

Decoding: Top-k sampling (k=25) with softmax. Max tokens = max(200, min(4000, targetTextTokens * 40)). Each decode step appends to the KV cache.

FP16 KV cache handling: The backbone models output FP16 KV cache tensors. The implementation preserves the actual ONNX tensor element type and shape metadata across decode steps — blindly wrapping FP16 bytes as FP32 would corrupt every subsequent step. See KVCacheState struct.

Stage 2: Flow Matching — Tokens to Mel Spectrogram

File: App/Services/TTS/FlowVocoderInference.swift (flow section)

Conditional flow matching converts discrete speech tokens into a continuous mel spectrogram. Uses 4 ONNX models:

Model Purpose
flow_token_embedding Speech tokens → [1, totalSeq, 80] embeddings
flow_pre_lookahead Conv + repeat_interleave → [1, melLen, 80]
flow_speaker_projection L2-normalized 192-dim speaker emb → [1, 80] projected
flow.decoder.estimator Velocity field estimation for ODE solver

Key details:

  • Token-mel ratio = 2: Each speech token produces 2 mel frames. So melLen = totalTokens × 2.
  • Prompt mel conditioning: The first melLen1 = promptTokens × 2 mel frames are filled with the prompt audio's mel spectrogram (linearly resized if frame counts differ). The rest is zero.
  • Euler ODE solver: 10 steps, dt = 0.1. Initializes from Gaussian noise, iteratively denoises using the velocity field:
    x₀ = N(0, 1)
    for t in [0.0, 0.1, ..., 0.9]:
        v = estimator(x, mask, mu, t, spks, cond)
        x = x + v × dt
    
  • Batch=2 constraint: The flow estimator ONNX model was exported with hardcoded batch size 2. All inputs are duplicated and only the first batch of output is used.
  • Prompt stripping: After flow matching, the first melLen1 mel frames (prompt portion) are removed. The output contains only the generated portion.

Stage 3: HiFT Vocoder — Mel to Waveform

File: App/Services/TTS/FlowVocoderInference.swift (vocoder section)

HiFT (Harmonic-plus-noise with iSTFT) converts mel spectrograms to audio waveforms. Uses 3 ONNX models:

Model Purpose
hift_f0_predictor_fp32 Mel [1, 80, melLen] → F0 [1, melLen]
hift_source_generator_fp32 F0 [1, 1, melLen] → source signal [1, 1, timeUp]
hift_decoder_fp32 Mel + source STFT → magnitude [1, 9, outLen] + phase [1, 9, outLen]

Signal processing pipeline:

  1. Predict F0 (fundamental frequency) from mel
  2. Generate source signal from F0
  3. STFT of source signal (n_fft=16, hop=4, Hann window, center=true)
  4. HiFT decoder combines mel + source STFT → magnitude + phase
  5. Clip magnitude to max 100 (not log-domain — the model outputs linear magnitude)
  6. iSTFT with overlap-add reconstruction → raw audio
  7. Clip audio to [-0.99, 0.99]

All vocoder models use FP32 for numerical stability — the STFT/iSTFT operations are sensitive to precision.

Pure Swift BPE Tokenizer

File: App/Services/TTS/BPETokenizer.swift

A from-scratch implementation of Qwen2/GPT-2 byte-level BPE tokenization. No Python dependency, no SentencePiece, no external tokenizer libraries.

How it works:

  1. Regex split: Split input text using the GPT-2 pattern ('s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ...)
  2. Byte encoding: Map each UTF-8 byte to a printable Unicode character (GPT-2's byte-to-char scheme — printable ASCII/Latin-1 map to themselves, non-printable bytes map to U+0100+)
  3. BPE merging: Iteratively find and merge the highest-priority pair from merges.txt until no more merges apply
  4. Vocabulary lookup: Map merged tokens to integer IDs via vocab.json

Loads vocab.json (~11MB, 151k+ entries) and merges.txt at runtime from the models directory.

DSP Utilities (Accelerate Framework)

File: App/Services/TTS/AudioMath.swift

All signal processing runs on-device using Apple's Accelerate framework (vDSP):

Function What it does
stft() Short-time Fourier Transform using vDSP_DFT_zrop — periodic Hann window, reflect padding, output as [2×nFreqs, nFrames]
istft() Inverse STFT with overlap-add and window normalization
melSpectrogram() Power mel spectrogram with Slaney-normalized filterbank (matches librosa)
kaldiMel80() 80-mel at 16kHz with log + mean normalization — for CAMPPlus speaker embedding
whisperMel128() 128-mel at 16kHz with Whisper normalization — for Speech Tokenizer v3
mel80At24kHz() 80-mel at 24kHz — for flow conditioning
gaussianNoise() Box-Muller transform for standard normal samples
l2Normalize() L2 normalization using vDSP_svesq + vDSP_vsdiv
linearResize() Linear interpolation resize for mel frame count alignment

ONNX Runtime Session Management

File: App/Services/TTS/ONNXSessionManager.swift

  • Actor-isolated: Thread-safe session management via Swift actor
  • Lazy loading: Sessions are created on first use and cached
  • CoreML Execution Provider: Enabled on real devices (not simulator) for Neural Engine / GPU acceleration via ORTCoreMLExecutionProviderOptions
  • Graph optimization: Level .all for maximum ONNX graph folding
  • Memory management: Unloads all sessions on UIApplication.didReceiveMemoryWarningNotification
  • Sendable conformance: ORTSession and ORTValue are marked @retroactive @unchecked Sendable for actor boundary crossing

Voice Cloning

How Zero-Shot Cloning Works

CosyVoice 3 clones a voice from a short reference clip (2-6 seconds). The reference audio provides three pieces of information:

  1. Speaker embedding (192-dim, from CAMPPlus) — captures speaker identity (timbre, pitch range)
  2. Prompt speech tokens (from Speech Tokenizer v3) — discrete representation of the reference speech content
  3. Prompt mel spectrogram (80-bin, 24kHz) — continuous acoustic features for flow conditioning

At inference time, the LLM generates new speech tokens conditioned on the speaker's voice characteristics, and the flow model uses the prompt mel to anchor the acoustic style.

Voice Data Format

Each voice is a directory under App/Resources/voices/{voiceId}/ containing 4 files:

File Format Example size
speaker_embedding.bin 192 × Float32, little-endian 768 bytes
prompt_tokens.bin N × Int64, little-endian 344-1112 bytes (43-139 tokens)
prompt_mel.bin Int32 frame count + frames × 80 × Float32 50-167 KB
prompt_text.txt UTF-8 transcript Short sentence

Critical: The prompt_text.txt must exactly match what the reference audio says. The LLM uses the combined text (prompt + target) with the prompt speech tokens to align text to speech. A mismatch causes the model to generate wrong or extra content.

Extracting New Voices

python tools/extract_voice_onnx.py \
  --audio my_reference.wav \
  --text "Exact transcript of the audio" \
  --voice-id my_custom_voice \
  --models-dir models \
  --output-dir App/Resources/voices/

Requires campplus.onnx and speech_tokenizer_v3.onnx in the models directory (available from FunAudioLLM/CosyVoice2-0.5B).

Bundled Voices

Voice ID Language Gender Tokens Duration Source
longjiaxin_v3 Cantonese F 43 ~1.7s CosyVoice 3 demo
longjiayi_v3 Cantonese F 115 ~4.6s Nexdata HK corpus
longanyue_v3 Cantonese M 128 ~5.1s Nexdata HK corpus
longanmin_v3 Sichuan F 113 ~4.5s CosyVoice 3 demo
longlaotie_v3 Northeast Mandarin M 97 ~3.9s CosyVoice 3 demo
longshange_v3 Shanghainese M 69 ~2.8s CosyVoice 3 demo
longanyang Mandarin M 127 ~5.1s CosyVoice 3 demo
longanhuan Mandarin F 139 ~5.6s CosyVoice 3 demo

ONNX Models

12 ONNX model files are required (~500MB total). They are downloaded on first app launch.

Stage Model File Precision Purpose
LLM text_embedding_fp32.onnx FP32 BPE token IDs → text embeddings
LLM llm_speech_embedding_fp16.onnx FP16 Speech token IDs → speech embeddings
LLM llm_backbone_initial_fp16.onnx FP16 Prefill forward pass (produces KV cache)
LLM llm_backbone_decode_fp16.onnx FP16 Single-token decode step (consumes/extends KV cache)
LLM llm_decoder_fp16.onnx FP16 Hidden state → logits [6761]
Flow flow_token_embedding_fp16.onnx FP16 Speech token embedding for flow
Flow flow_pre_lookahead_fp16.onnx FP16 Conv + repeat_interleave upsampling
Flow flow_speaker_projection_fp16.onnx FP16 Speaker embedding → 80-dim projection
Flow flow.decoder.estimator.fp16.onnx FP16 Velocity field estimator (10x per generation)
HiFT hift_f0_predictor_fp32.onnx FP32 Mel → fundamental frequency
HiFT hift_source_generator_fp32.onnx FP32 F0 → excitation source signal
HiFT hift_decoder_fp32.onnx FP32 Mel + source → magnitude + phase

Additionally, the BPE tokenizer needs vocab.json (~11MB) and merges.txt.

FP16 vs FP32 rationale: LLM backbone and flow models use FP16 to reduce model size and memory. HiFT vocoder and text embedding stay FP32 because STFT/iSTFT and embedding lookup are sensitive to precision.


Engineering Challenges

FP16 KV Cache Preservation

The LLM backbone outputs FP16 KV cache tensors. The autoregressive decode loop passes KV cache back as input on every step. If the FP16 bytes are reinterpreted as FP32 (which doubles the apparent element count and reinterprets every pair of FP16 bytes as a garbage FP32 value), the entire generation collapses. The solution: preserve the ONNX tensor's actual elementType and shape metadata alongside the raw bytes, and reconstruct the ORTValue with the correct type on each decode step.

Flow Estimator Batch=2

The CosyVoice ONNX export hardcodes batch dimension to 2 in the flow estimator model. Passing batch=1 causes a shape mismatch error. The workaround: duplicate all inputs to batch=2, run inference, and use only the first batch of the output. This is a known artifact of the export process.

Prompt Text Alignment

The LLM input is [SOS, textEmb(promptText + targetText), TASK, promptSpeech]. The model internally aligns the text embedding with the prompt speech tokens to determine where the prompt audio "covers" in the text, then generates speech for the remaining portion. If the prompt text doesn't match the actual audio content, the alignment fails and the model produces extra or wrong speech. This was the cause of our initial output quality issues — the reference audio transcripts were swapped with the generated output text from the demo page.

Mel Spectrogram Compatibility

Three different mel spectrogram configurations are used by different parts of the pipeline:

Config Sample Rate Mels FFT Hop Range Used By
Kaldi 16kHz 80 400 160 20-7600 Hz CAMPPlus speaker embedding
Whisper 16kHz 128 400 160 0-8000 Hz Speech Tokenizer v3
Flow 24kHz 80 1024 256 0-12000 Hz Flow conditioning mel

Each has different normalization: Kaldi uses log + per-feature mean subtraction, Whisper uses log10 + clamp(max-8) + shift, flow uses plain log. All implemented in AudioMath.swift using Accelerate.


Project Structure

App/
  Services/TTS/
    OnDeviceTTSEngine.swift      # Pipeline orchestrator (actor)
    LLMInference.swift           # Stage 1: Qwen2-0.5B autoregressive decode
    FlowVocoderInference.swift   # Stage 2+3: flow matching + HiFT vocoder
    BPETokenizer.swift           # Pure Swift Qwen2/GPT-2 byte-level BPE
    AudioMath.swift              # DSP: STFT, iSTFT, mel, Gaussian noise (Accelerate)
    ONNXSessionManager.swift     # ONNX Runtime sessions + CoreML EP
    WAVEncoder.swift             # PCM Float32 → 16-bit WAV
    ModelDownloadManager.swift   # Downloads ~500MB of ONNX models
    TTSEngine.swift              # Protocol definition
  Views/TTS/
    TTSView.swift                # Main TTS interface
    VoicePickerView.swift        # Voice selection sheet
    TTSPlayerView.swift          # Audio playback controls
    TTSHistoryView.swift         # Generation history
  ViewModels/
    TTSViewModel.swift           # TTS generation + playback state
    TTSHistoryViewModel.swift    # History management
  Resources/voices/              # Pre-extracted voice data (8 voices)

Packages/
  DialectCore/                   # Voice catalog, data models (pure Swift, Linux-testable)
  SRSEngine/                     # FSRS spaced repetition for dialect learning

tools/
  extract_voice_onnx.py          # Extract voice data from audio via ONNX (CPU)
  generate_voice_data.py         # Batch voice generation via DashScope API

Concurrency Model

Swift 6 strict concurrency throughout:

  • OnDeviceTTSEngineactor (thread-safe pipeline orchestration)
  • LLMInferenceactor (isolated ONNX session calls)
  • FlowVocoderInferenceactor (isolated ONNX session calls)
  • ONNXSessionManageractor (shared session cache)
  • BPETokenizerstruct: Sendable (immutable after init)
  • ViewModels — @MainActor @Observable (UI-bound state)

Requirements

  • iOS 17.0+
  • Xcode 16+ with Swift 6
  • ~500MB storage for ONNX models (downloaded on first launch)
  • XcodeGen to generate the .xcodeproj

Setup

git clone https://github.com/Psypeal/DialectLearn.git
cd DialectLearn

# Generate Xcode project
xcodegen generate

# Open and build
open BabYap.xcodeproj

Models are downloaded automatically on first launch. Firebase configuration (GoogleService-Info.plist) is required for auth — see project.yml for dependencies.

Dependencies

Dependency Version Purpose
ONNX Runtime 1.20+ On-device neural network inference
Firebase iOS SDK 11.0+ Auth, Firestore, Cloud Storage, Remote Config

References

License

MIT — see LICENSE.

The CosyVoice 3 ONNX model weights are subject to their own license from Alibaba/Tongyi SpeechTeam. Voice data from Nexdata is used under CC license.

About

Chinese dialect learning iOS app

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors