Skip to content

Commit cbe012d

Browse files
authored
Valle Recipe for WenetSpeech4TTS, LibriTTS, LibriTTS-R (k2-fsa#1805)
* add valle * update readme
1 parent 57451b0 commit cbe012d

16 files changed

+4675
-15
lines changed

egs/libritts/TTS/README.md

+58-7
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Introduction
22

3-
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members.
4-
The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus.
3+
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members.
4+
The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus.
55
The main differences from the LibriSpeech corpus are listed below:
66
1. The audio files are at 24kHz sampling rate.
77
2. The speech is split at sentence breaks.
@@ -11,16 +11,16 @@ The main differences from the LibriSpeech corpus are listed below:
1111
For more information, refer to the paper "LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech", Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu, arXiv, 2019. If you use the LibriTTS corpus in your work, please cite this paper where it was introduced.
1212

1313
> [!CAUTION]
14-
> The next-gen Kaldi framework provides tools and models for generating high-quality, synthetic speech (Text-to-Speech, TTS).
14+
> The next-gen Kaldi framework provides tools and models for generating high-quality, synthetic speech (Text-to-Speech, TTS).
1515
> While these recipes has the potential to advance various fields such as accessibility, language education, and AI-driven solutions, it also carries certain ethical and legal responsibilities.
16-
>
16+
>
1717
> By using this framework, you agree to the following:
1818
> 1. Legal and Ethical Use: You shall not use this framework, or any models derived from it, for any unlawful or unethical purposes. This includes, but is not limited to: Creating voice clones without the explicit, informed consent of the individual whose voice is being cloned. Engaging in any form of identity theft, impersonation, or fraud using cloned voices. Violating any local, national, or international laws regarding privacy, intellectual property, or personal data.
19-
>
19+
>
2020
> 2. Responsibility of Use: The users of this framework are solely responsible for ensuring that their use of voice cloning technologies complies with all applicable laws and ethical guidelines. We explicitly disclaim any liability for misuse of the technology.
21-
>
21+
>
2222
> 3. Attribution and Use of Open-Source Components: This project is provided under the Apache 2.0 license. Users must adhere to the terms of this license and provide appropriate attribution when required.
23-
>
23+
>
2424
> 4. No Warranty: This framework is provided “as-is,” without warranty of any kind, either express or implied. We do not guarantee that the use of this software will comply with legal requirements or that it will not infringe the rights of third parties.
2525
2626

@@ -49,3 +49,54 @@ To inference, use:
4949
--epoch 400 \
5050
--tokens data/tokens.txt
5151
```
52+
53+
# [VALL-E](https://arxiv.org/abs/2301.02111)
54+
55+
./valle contains the code for training VALL-E TTS model.
56+
57+
Checkpoints and training logs can be found [here](https://huggingface.co/yuekai/vall-e_libritts). The demo of the model trained with libritts and [libritts-r](https://www.openslr.org/141/) is available [here](https://huggingface.co/spaces/yuekai/valle-libritts-demo).
58+
59+
Preparation:
60+
61+
```
62+
bash prepare.sh --start-stage 4
63+
```
64+
65+
The training command is given below:
66+
67+
```
68+
world_size=8
69+
exp_dir=exp/valle
70+
71+
## Train AR model
72+
python3 valle/train.py --max-duration 320 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1 \
73+
--num-buckets 6 --dtype "bfloat16" --save-every-n 1000 --valid-interval 2000 \
74+
--share-embedding true --norm-first true --add-prenet false \
75+
--decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
76+
--base-lr 0.03 --warmup-steps 200 --average-period 0 \
77+
--num-epochs 20 --start-epoch 1 --start-batch 0 --accumulate-grad-steps 1 \
78+
--exp-dir ${exp_dir} --world-size ${world_size}
79+
80+
## Train NAR model
81+
# cd ${exp_dir}
82+
# ln -s ${exp_dir}/best-valid-loss.pt epoch-99.pt # --start-epoch 100=99+1
83+
# cd -
84+
python3 valle/train.py --max-duration 160 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 2 \
85+
--num-buckets 6 --dtype "float32" --save-every-n 1000 --valid-interval 2000 \
86+
--share-embedding true --norm-first true --add-prenet false \
87+
--decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
88+
--base-lr 0.03 --warmup-steps 200 --average-period 0 \
89+
--num-epochs 40 --start-epoch 100 --start-batch 0 --accumulate-grad-steps 2 \
90+
--exp-dir ${exp_dir} --world-size ${world_size}
91+
```
92+
93+
To inference, use:
94+
```
95+
huggingface-cli login
96+
huggingface-cli download --local-dir ${exp_dir} yuekai/vall-e_libritts
97+
top_p=1.0
98+
python3 valle/infer.py --output-dir demos_epoch_${epoch}_avg_${avg}_top_p_${top_p} \
99+
--top-k -1 --temperature 1.0 \
100+
--text ./libritts.txt \
101+
--checkpoint ${exp_dir}/epoch-${epoch}-avg-${avg}.pt --top-p ${top_p}
102+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../wenetspeech4tts/TTS/local/compute_neural_codec_and_prepare_text_tokens.py

egs/libritts/TTS/prepare.sh

+35-8
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ if [ $stage -le -1 ] && [ $stop_stage -ge -1 ]; then
3232
cd vits/monotonic_align
3333
python setup.py build_ext --inplace
3434
cd ../../
35-
else
35+
else
3636
log "monotonic_align lib already built"
3737
fi
3838
fi
@@ -75,11 +75,11 @@ if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
7575
log "Stage 2: Compute Spectrogram for LibriTTS"
7676
mkdir -p data/spectrogram
7777
if [ ! -e data/spectrogram/.libritts.done ]; then
78-
./local/compute_spectrogram_libritts.py --sampling-rate $sampling_rate
78+
./local/compute_spectrogram_libritts.py --sampling-rate $sampling_rate
7979
touch data/spectrogram/.libritts.done
8080
fi
8181

82-
# Here we shuffle and combine the train-clean-100, train-clean-360 and
82+
# Here we shuffle and combine the train-clean-100, train-clean-360 and
8383
# train-other-500 together to form the training set.
8484
if [ ! -f data/spectrogram/libritts_cuts_train-all-shuf.jsonl.gz ]; then
8585
cat <(gunzip -c data/spectrogram/libritts_cuts_train-clean-100.jsonl.gz) \
@@ -88,7 +88,7 @@ if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
8888
shuf | gzip -c > data/spectrogram/libritts_cuts_train-all-shuf.jsonl.gz
8989
fi
9090

91-
# Here we shuffle and combine the train-clean-100, train-clean-360
91+
# Here we shuffle and combine the train-clean-100, train-clean-360
9292
# together to form the training set.
9393
if [ ! -f data/spectrogram/libritts_cuts_train-clean-460.jsonl.gz ]; then
9494
cat <(gunzip -c data/spectrogram/libritts_cuts_train-clean-100.jsonl.gz) \
@@ -108,10 +108,10 @@ if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
108108
log "Stage 3: Prepare phoneme tokens for LibriTTS"
109109
# We assume you have installed piper_phonemize and espnet_tts_frontend.
110110
# If not, please install them with:
111-
# - piper_phonemize:
111+
# - piper_phonemize:
112112
# refer to https://github.com/rhasspy/piper-phonemize,
113113
# could install the pre-built wheels from https://github.com/csukuangfj/piper-phonemize/releases/tag/2023.12.5
114-
# - espnet_tts_frontend:
114+
# - espnet_tts_frontend:
115115
# `pip install espnet_tts_frontend`, refer to https://github.com/espnet/espnet_tts_frontend/
116116
if [ ! -e data/spectrogram/.libritts_with_token.done ]; then
117117
./local/prepare_tokens_libritts.py
@@ -123,12 +123,39 @@ if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
123123
log "Stage 4: Generate token file"
124124
# We assume you have installed piper_phonemize and espnet_tts_frontend.
125125
# If not, please install them with:
126-
# - piper_phonemize:
126+
# - piper_phonemize:
127127
# refer to https://github.com/rhasspy/piper-phonemize,
128128
# could install the pre-built wheels from https://github.com/csukuangfj/piper-phonemize/releases/tag/2023.12.5
129-
# - espnet_tts_frontend:
129+
# - espnet_tts_frontend:
130130
# `pip install espnet_tts_frontend`, refer to https://github.com/espnet/espnet_tts_frontend/
131131
if [ ! -e data/tokens.txt ]; then
132132
./local/prepare_token_file.py --tokens data/tokens.txt
133133
fi
134134
fi
135+
136+
audio_feats_dir=data/tokenized
137+
dataset_parts="--dataset-parts all" # debug "-p dev-clean -p test-clean"
138+
if [ $stage -le 5 ] && [ $stop_stage -ge 5 ]; then
139+
log "Stage 5: Tokenize/Fbank LibriTTS for valle"
140+
mkdir -p ${audio_feats_dir}
141+
if [ ! -e ${audio_feats_dir}/.libritts.tokenize.done ]; then
142+
python3 ./local/compute_neural_codec_and_prepare_text_tokens.py --dataset-parts "${dataset_parts}" \
143+
--audio-extractor "Encodec" \
144+
--batch-duration 400 \
145+
--src-dir "data/manifests" \
146+
--output-dir "${audio_feats_dir}"
147+
fi
148+
touch ${audio_feats_dir}/.libritts.tokenize.done
149+
150+
lhotse combine \
151+
${audio_feats_dir}/libritts_cuts_train-clean-100.jsonl.gz \
152+
${audio_feats_dir}/libritts_cuts_train-clean-360.jsonl.gz \
153+
${audio_feats_dir}/libritts_cuts_train-other-500.jsonl.gz \
154+
${audio_feats_dir}/cuts_train.jsonl.gz
155+
lhotse copy \
156+
${audio_feats_dir}/libritts_cuts_dev-clean.jsonl.gz \
157+
${audio_feats_dir}/cuts_dev.jsonl.gz
158+
lhotse copy \
159+
${audio_feats_dir}/libritts_cuts_test-clean.jsonl.gz \
160+
${audio_feats_dir}/cuts_test.jsonl.gz
161+
fi

egs/libritts/TTS/valle

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../wenetspeech4tts/TTS/valle/

egs/wenetspeech4tts/TTS/README.md

+72
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Introduction
2+
3+
[**WenetSpeech4TTS**](https://huggingface.co/datasets/Wenetspeech4TTS/WenetSpeech4TTS) is a multi-domain **Mandarin** corpus derived from the open-sourced [WenetSpeech](https://arxiv.org/abs/2110.03370) dataset.
4+
5+
> [!CAUTION]
6+
> The next-gen Kaldi framework provides tools and models for generating high-quality, synthetic speech (Text-to-Speech, TTS).
7+
> While these recipes has the potential to advance various fields such as accessibility, language education, and AI-driven solutions, it also carries certain ethical and legal responsibilities.
8+
>
9+
> By using this framework, you agree to the following:
10+
> 1. Legal and Ethical Use: You shall not use this framework, or any models derived from it, for any unlawful or unethical purposes. This includes, but is not limited to: Creating voice clones without the explicit, informed consent of the individual whose voice is being cloned. Engaging in any form of identity theft, impersonation, or fraud using cloned voices. Violating any local, national, or international laws regarding privacy, intellectual property, or personal data.
11+
>
12+
> 2. Responsibility of Use: The users of this framework are solely responsible for ensuring that their use of voice cloning technologies complies with all applicable laws and ethical guidelines. We explicitly disclaim any liability for misuse of the technology.
13+
>
14+
> 3. Attribution and Use of Open-Source Components: This project is provided under the Apache 2.0 license. Users must adhere to the terms of this license and provide appropriate attribution when required.
15+
>
16+
> 4. No Warranty: This framework is provided “as-is,” without warranty of any kind, either express or implied. We do not guarantee that the use of this software will comply with legal requirements or that it will not infringe the rights of third parties.
17+
18+
19+
# [VALL-E](https://arxiv.org/abs/2301.02111)
20+
21+
./valle contains the code for training VALL-E TTS model.
22+
23+
Checkpoints and training logs can be found [here](https://huggingface.co/yuekai/vall-e_wenetspeech4tts). The demo of the model trained with Wenetspeech4TTS Premium (945 hours) is available [here](https://huggingface.co/spaces/yuekai/valle_wenetspeech4tts_demo).
24+
25+
Preparation:
26+
27+
```
28+
bash prepare.sh
29+
```
30+
31+
The training command is given below:
32+
33+
```
34+
world_size=8
35+
exp_dir=exp/valle
36+
37+
## Train AR model
38+
python3 valle/train.py --max-duration 320 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1 \
39+
--num-buckets 6 --dtype "bfloat16" --save-every-n 1000 --valid-interval 2000 \
40+
--share-embedding true --norm-first true --add-prenet false \
41+
--decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
42+
--base-lr 0.03 --warmup-steps 200 --average-period 0 \
43+
--num-epochs 20 --start-epoch 1 --start-batch 0 --accumulate-grad-steps 1 \
44+
--exp-dir ${exp_dir} --world-size ${world_size}
45+
46+
## Train NAR model
47+
# cd ${exp_dir}
48+
# ln -s ${exp_dir}/best-valid-loss.pt epoch-99.pt # --start-epoch 100=99+1
49+
# cd -
50+
python3 valle/train.py --max-duration 160 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 2 \
51+
--num-buckets 6 --dtype "float32" --save-every-n 1000 --valid-interval 2000 \
52+
--share-embedding true --norm-first true --add-prenet false \
53+
--decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
54+
--base-lr 0.03 --warmup-steps 200 --average-period 0 \
55+
--num-epochs 40 --start-epoch 100 --start-batch 0 --accumulate-grad-steps 2 \
56+
--exp-dir ${exp_dir} --world-size ${world_size}
57+
```
58+
59+
To inference, use:
60+
```
61+
huggingface-cli login
62+
huggingface-cli download --local-dir ${exp_dir} yuekai/vall-e_wenetspeech4tts
63+
top_p=1.0
64+
python3 valle/infer.py --output-dir demos_epoch_${epoch}_avg_${avg}_top_p_${top_p} \
65+
--top-k -1 --temperature 1.0 \
66+
--text ./aishell3.txt \
67+
--checkpoint ${exp_dir}/epoch-${epoch}-avg-${avg}.pt \
68+
--text-extractor pypinyin_initials_finals --top-p ${top_p}
69+
```
70+
71+
# Credits
72+
- [vall-e](https://github.com/lifeiteng/vall-e)

0 commit comments

Comments
 (0)