🔬 Official PyTorch Implementation of TaDiCodec
📄 Paper: TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling
This repository is designed to provide comprehensive implementations for our series of diffusion-based speech tokenizer research works. Currently, it primarily features TaDiCodec, with plans to include additional in-progress works in the future. Specifically, the repository includes:
- 🧠 A simple PyTorch implementation of the TaDiCodec tokenizer
- 🎯 Token-based zero-shot TTS models based on TaDiCodec:
- 🤖 Autoregressive based TTS models
- 🌊 Masked diffusion (a.k.a. Masked Genrative Model (MGM) based TTS models
- 🏋️ Training scripts for tokenizer and TTS models
- 🤗 Hugging Face and 🔮 ModelScope (to be updated) for easy access to pre-trained models
Short Intro on TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling:
We introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec), a novel approach to speech tokenization that employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of 6.25 Hz and a corresponding bitrate of 0.0875 kbps with a single-layer codebook for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS).
- 🚀 [2025-08-25] We release the offical implementation of TaDiCodec and the TTS models based on TaDiCodec.
- 🔥 [2025-08-25] TaDiCodec paper released! Check out our arXiv preprint
- 📦 [2025-08-25] Added auto-download functionality from Hugging Face for all models!
🔥 Current Status: Active Development 🔥
This project is under active development. Check back frequently for updates!
- 🏗️ Repository Structure Setup
- 📝 Documentation Framework
- 🧠 TaDiCodec Model Architecture
- NAR Llama-style transformers for encoder and decoder architectures
- text-aware flow matching (diffusion) decoder
- vocoder for mel2wav
- ⚡ Inference Pipeline
- Basic inference pipeline
- Auto-download from Hugging Face
- Add auto-ASR for text input
- 🏋️ TaDiCodec Training Scripts
- 💾 Dataset and Dataloader
- 🤖 Autoregressive Models
- Model architecture
- Pre-training models loading and inference
- Training scripts
- 🌊 Masked Diffusion Models
- Model architecture
- Pre-training models loading and inference
- Training scripts
- Add evaluation scripts
- 🛸 Diffusion-based Speech Tokenizer without text conditioning
Download our pre-trained models for instant inference
Model | 🤗 Hugging Face | 👷 Status |
---|---|---|
🚀 TaDiCodec | ✅ | |
🚀 TaDiCodec-old | 🚧 |
Note: TaDiCodec-old is the old version of TaDiCodec, the TaDiCodec-TTS-AR-Phi-3.5-4B is based on TaDiCodec-old.
- ModelScope will be updated soon.
# 🤗 Load from Hugging Face with Auto-Download
from models.tts.tadicodec.inference_tadicodec import TaDiCodecPipline
from models.tts.llm_tts.inference_llm_tts import TTSInferencePipeline
from models.tts.llm_tts.inference_mgm_tts import MGMInferencePipeline
# Load TaDiCodec tokenizer (auto-downloads from HF if not found locally)
tokenizer = TaDiCodecPipline.from_pretrained("amphion/TaDiCodec")
# Load AR TTS model (auto-downloads from HF if not found locally)
tts_model = TTSInferencePipeline.from_pretrained(
tadicodec_path="amphion/TaDiCodec",
llm_path="amphion/TaDiCodec-TTS-AR-Qwen2.5-0.5B"
)
# Load MGM TTS model (auto-downloads from HF if not found locally)
mgm_model = MGMInferencePipeline.from_pretrained(
tadicodec_path="amphion/TaDiCodec",
mgm_path="amphion/TaDiCodec-TTS-MGM-0.6B"
)
# You can also use local paths if you have models downloaded
# tts_model = TTSInferencePipeline.from_pretrained(
# tadicodec_path="./ckpt/TaDiCodec",
# llm_path="./ckpt/TaDiCodec-TTS-AR-Qwen2.5-0.5B"
# )
Select one of the 2 pytorch lines depending on your hardware
# Clone the repository
git clone https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer
# Install dependencies
conda create -n tadicodec python=3.10
conda activate tadicodec
pip install setuptools wheel psutil packaging ninja numpy hf_xet
# pytorch
# CUDA
pip install torch==2.8.0 torchaudio --index-strategy unsafe-best-match --extra-index-url https://download.pytorch.org/whl/cu128
# OR CPU only
pip install torch==2.8.0 torchaudio
pip install flash_attn==2.7.4.post1
pip install -r requirements.txt
This assumes you are using powershell Select one of the 2 pytorch lines depending on your hardware Select one of the 2 flash_attn sections on if you want to use a pre-build whl or to compile your own
# Clone the repository
git clone https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer
# Install dependencies
conda create -n tadicodec python=3.10
conda activate tadicodec
pip install setuptools wheel psutil packaging ninja numpy hf_xet
# pytorch
# CUDA
pip install torch==2.8.0 torchaudio --index-strategy unsafe-best-match --extra-index-url https://download.pytorch.org/whl/cu128
# OR CPU only
pip install torch==2.8.0 torchaudio
# flash_attn
# use a pre-built wheel
pip install https://huggingface.co/kim512/flash_attn-2.7.4.post1/resolve/main/flash_attn-2.7.4.post1-cu128-torch2.8.0-cp310-cp310-win_amd64.whl
# OR compile your own, change MAX_JOBS to match your CPU, ideally 4 to 8. If you have lots of RAM make this number smaller.
$Env:MAX_JOBS="6"
$Env:CUDA_PATH="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"
pip install -v flash-attn==2.7.4.post1 --no-build-isolation
# install requirements
pip install -r requirements.txt
Select one of the 2 pytorch lines depending on your hardware
# Clone the repository
git clone https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer
# Install python and dependencies
uv python install 3.10
uv venv --python 3.10
uv pip install setuptools wheel psutil packaging ninja numpy hf_xet
# pytorch
# CUDA
pip install torch==2.8.0 torchaudio --index-strategy unsafe-best-match --extra-index-url https://download.pytorch.org/whl/cu128
# CPU only
pip install torch==2.8.0 torchaudio
pip install flash_attn==2.7.4.post1
pip install -r requirements.txt
This assumes you are using powershell Select one of the 2 pytorch lines depending on your hardware Select one of the 2 flash_attn sections on if you want to use a pre-build whl or to compile your own
# Clone the repository
git clone https://github.com/HeCheng0625/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer
# Install python and dependencies
uv python install 3.10
uv venv --python 3.10
uv pip install setuptools wheel psutil packaging ninja numpy hf_xet
# pytorch
# CUDA
uv pip install torch==2.8.0 torchaudio --index-strategy unsafe-best-match --extra-index-url https://download.pytorch.org/whl/cu128
# OR CPU only
uv pip install torch==2.8.0 torchaudio
# flash_attn
# use a pre-built wheel
uv pip install https://huggingface.co/kim512/flash_attn-2.7.4.post1/resolve/main/flash_attn-2.7.4.post1-cu128-torch2.8.0-cp310-cp310-win_amd64.whl
# OR compile your own, change MAX_JOBS to match your CPU, ideally 4 to 8. If you have lots of RAM make this number smaller.
$Env:MAX_JOBS="6"
$Env:CUDA_PATH="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"
uv pip install -v flash-attn==2.7.4.post1 --no-build-isolation
# install requirements
uv pip install -r requirements.txt
All models support automatic download from Hugging Face! Simply use the Hugging Face model ID instead of local paths:
# Models will be automatically downloaded on first use
from models.tts.tadicodec.inference_tadicodec import TaDiCodecPipline
from models.tts.llm_tts.inference_llm_tts import TTSInferencePipeline
# Auto-download TaDiCodec
tokenizer = TaDiCodecPipline.from_pretrained("amphion/TaDiCodec")
# Auto-download TTS pipeline (downloads both TaDiCodec and LLM)
pipeline = TTSInferencePipeline.from_pretrained(
tadicodec_path="amphion/TaDiCodec",
llm_path="amphion/TaDiCodec-TTS-AR-Qwen2.5-0.5B"
)
Note: Models are cached locally after first download for faster subsequent use.
Please refer to the use_examples folder for more detailed usage examples.
cd Diffusion-Speech-Tokenizer
# conda linux and windows
conda activate tadicodec
# UV linux
./venv/Scripts/activate.sh
# UV powershell
.\.venv\Scripts\activate.ps1
# download models
python test_auto_download.py
cd use_examples
python test_auto_download.py
python test_llm_tts.py
python test_mgm_tts.py
python test_rec.py
# Example: Using TaDiCodec for speech tokenization
import torch
import soundfile as sf
from models.tts.tadicodec.inference_tadicodec import TaDiCodecPipline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Auto-download from Hugging Face if not found locally
pipe = TaDiCodecPipline.from_pretrained(ckpt_dir="amphion/TaDiCodec", device=device)
# Or use local path if you have models downloaded
# pipe = TaDiCodecPipline.from_pretrained(ckpt_dir="./ckpt/TaDiCodec", device=device)
# Text of the prompt audio
prompt_text = "In short, we embarked on a mission to make America great again, for all Americans."
# Text of the target audio
target_text = "But to those who knew her well, it was a symbol of her unwavering determination and spirit."
# Input audio path of the prompt audio
prompt_speech_path = "./use_examples/test_audio/trump_0.wav"
# Input audio path of the target audio
speech_path = "./use_examples/test_audio/trump_1.wav"
rec_audio = pipe(
text=target_text,
speech_path=speech_path,
prompt_text=prompt_text,
prompt_speech_path=prompt_speech_path
)
sf.write("./use_examples/test_audio/trump_rec.wav", rec_audio, 24000)
import torch
import soundfile as sf
from models.tts.llm_tts.inference_llm_tts import TTSInferencePipeline
# from models.tts.llm_tts.inference_mgm_tts import MGMInferencePipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Create AR TTS pipeline with auto-download from Hugging Face
pipeline = TTSInferencePipeline.from_pretrained(
tadicodec_path="amphion/TaDiCodec",
llm_path="amphion/TaDiCodec-TTS-AR-Qwen2.5-0.5B",
device=device,
)
# Or use local paths if you have models downloaded
# pipeline = TTSInferencePipeline.from_pretrained(
# tadicodec_path="./ckpt/TaDiCodec",
# llm_path="./ckpt/TaDiCodec-TTS-AR-Qwen2.5-0.5B",
# device=device,
# )
# Generate speech with code-switching support
audio = pipeline(
text="但是 to those who 知道 her well, it was a 标志 of her unwavering 决心 and spirit.",
prompt_text="In short, we embarked on a mission to make America great again, for all Americans.",
prompt_speech_path="./use_examples/test_audio/trump_0.wav",
)
sf.write("./use_examples/test_audio/lm_tts_output.wav", audio, 24000)
- To be updated
- To be updated
If you find this repository useful, please cite our paper:
TaDiCodec:
@article{tadicodec2025,
title={TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling},
author={Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, Zhizheng Wu},
journal={arXiv preprint},
year={2025},
url={https://arxiv.org/abs/2508.16790}
}
Amphion:
@inproceedings{amphion,
author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
year={2024}
}
MaskGCT:
@inproceedings{wang2024maskgct,
author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
booktitle = {{ICLR}},
publisher = {OpenReview.net},
year = {2025}
}
TaDiCodec is licensed under the Apache 2.0.
-
MGM-based TTS is built upon MaskGCT.
-
Vocos vocoder is built upon Vocos.
-
NAR Llama-style transformers is built upon transformers.
-
(Binary Spherical Quantization) BSQ is built upon vector-quantize-pytorch and bsq-vit.
-
Training codebase is built upon Amphion and accelerate.