We provide the implementation for speech-to-unit translation (S2UT) proposed in "Direct speech-to-speech translation with discrete units (Lee et al. 2021)" and also the transformer-based implementation of the speech-to-spectrogram translation (S2SPECT, or transformer-based Translatotron) baseline in the paper.
Unit config | Unit size | Vocoder dataset | Model |
---|---|---|---|
HuBERT Base, Librispeech, layer 6 | 100 | LJSpeech | ckpt, config |
- (optional) To prepare S2S data from a speech-to-text translation (ST) dataset, see fairseq-S^2 for pre-trained TTS models and instructions on how to train and decode TTS models.
- Prepare two folders,
$SRC_AUDIO
and$TGT_AUDIO
, with${SPLIT}/${SAMPLE_ID}.wav
for source and target speech under each folder, separately. Note that for S2UT experiments, target audio sampling rate should be in 16,000 Hz, and for S2SPECT experiments, target audio sampling rate is recommended to be in 22,050 Hz. - To prepare target discrete units for S2UT model training, see Generative Spoken Language Modeling (speech2unit) for pre-trained k-means models, checkpoints, and instructions on how to decode units from speech. Set the output target unit files (
--out_quantized_file_path
) as${TGT_AUDIO}/${SPLIT}.txt
. In Lee et al. 2021, we use 100 units from the sixth layer (--layer 6
) of the HuBERT Base model.
Speech-to-speech data
S2UT
- Set
--reduce-unit
for training S2UT reduced model - Pre-trained vocoder and config (
$VOCODER_CKPT
,$VOCODER_CFG
) can be downloaded from the Pretrained Models section. They are not required if--eval-inference
is not going to be set during model training.
# $SPLIT1, $SPLIT2, etc. are split names such as train, dev, test, etc.
python examples/speech_to_speech/preprocessing/prep_s2ut_data.py \
--source-dir $SRC_AUDIO --target-dir $TGT_AUDIO --data-split $SPLIT1 $SPLIT2 \
--output-root $DATA_ROOT --reduce-unit \
--vocoder-checkpoint $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG
S2SPECT
# $SPLIT1, $SPLIT2, etc. are split names such as train, dev, test, etc.
python examples/speech_to_speech/preprocessing/prep_s2spect_data.py \
--source-dir $SRC_AUDIO --target-dir $TGT_AUDIO --data-split $SPLIT1 $SPLIT2 \
--output-root $DATA_ROOT
Multitask data
- For each multitask
$TASK_NAME
, prepare${DATA_ROOT}/${TASK_NAME}/${SPLIT}.tsv
files for each split following the format below: (Two tab separated columns. The sample_ids should match with the sample_ids for the speech-to-speech data in${DATA_ROOT}/${SPLIT}.tsv
.)
id tgt_text
sample_id_0 token1 token2 token3 ...
sample_id_1 token1 token2 token3 ...
...
- For each multitask
$TASK_NAME
, prepare${DATA_ROOT}/${TASK_NAME}/dict.txt
, a dictionary in fairseq format with all tokens for the targets for$TASK_NAME
. - Create
config_multitask.yaml
. Below is an example of the config used for S2UT reduced with Fisher experiments including two encoder multitasks (source_letter
,target_letter
) and one decoder CTC task (decoder_target_ctc
).
source_letter: # $TASK_NAME
decoder_type: transformer
dict: ${DATA_ROOT}/source_letter/dict.txt
data: ${DATA_ROOT}/source_letter
encoder_layer: 6
loss_weight: 8.0
target_letter:
decoder_type: transformer
dict: ${DATA_ROOT}/target_letter/dict.txt
data: ${DATA_ROOT}/target_letter
encoder_layer: 8
loss_weight: 8.0
decoder_target_ctc:
decoder_type: ctc
dict: ${DATA_ROOT}/decoder_target_ctc/dict.txt
data: ${DATA_ROOT}/decoder_target_ctc
decoder_layer: 3
loss_weight: 1.6
Speech-to-unit translation (S2UT)
Here's an example for training Fisher S2UT models with 100 discrete units as target:
fairseq-train $DATA_ROOT \
--config-yaml config.yaml --multitask-config-yaml config_multitask.yaml \
--task speech_to_speech --target-is-code --target-code-size 100 --vocoder code_hifigan \
--criterion speech_to_unit --label-smoothing 0.2 \
--arch s2ut_transformer_fisher --share-decoder-input-output-embed \
--dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
--train-subset train --valid-subset dev \
--save-dir ${MODEL_DIR} \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --warmup-updates 10000 \
--optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \
--max-update 400000 --max-tokens 20000 --max-target-positions 3000 --update-freq 4 \
--seed 1 --fp16 --num-workers 8
- Adjust
--update-freq
accordingly for different #GPUs. In the above we set--update-freq 4
to simulate training with 4 GPUs. - Set
--n-frames-per-step 5
to train an S2UT stacked system with reduction ratio r=5. (Use$DATA_ROOT
prepared without--reduce-unit
.) - (optional) one can turn on tracking MCD loss during training for checkpoint selection by setting
--eval-inference --eval-args '{"beam": 1, "max_len_a": 1}' --best-checkpoint-metric mcd_loss
. It is recommended to sample a smaller subset as the validation set as MCD loss computation is time-consuming.
Speech-to-spectrogram translation (S2SPECT)
Here's an example for training Fisher S2SPECT models with reduction ratio r=5:
fairseq-train $DATA_ROOT \
--config-yaml config.yaml --multitask-config-yaml config_multitask.yaml \
--task speech_to_speech --n-frames-per-step 5 \
--criterion speech_to_spectrogram \
--arch s2spect_transformer_fisher --decoder-normalize-before \
--dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
--train-subset train --valid-subset dev \
--save-dir ${MODEL_DIR} \
--eval-inference --best-checkpoint-metric mcd_loss \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --warmup-updates 10000 \
--optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 --weight-decay 1e-6 \
--max-update 400000 --max-tokens 80000 --max-tokens-valid 30000 --required-batch-size-multiple 1 \
--max-target-positions 3000 --update-freq 16 \
--seed 1 --fp16 --num-workers 8
- Adjust
--update-freq
accordingly for different #GPUs. In the above we set--update-freq 16
to simulate training with 16 GPUs. - We recommend turning on MCD loss during training for the best checkpoint selection.
Unit-based HiFi-GAN vocoder
The vocoder is trained with the speech-resynthesis repo. See here for instructions on how to train the unit-based HiFi-GAN vocoder with duration prediction. The same vocoder can support waveform generation for both reduced unit sequences (with --dur-prediction
set during inference) and original unit sequences.
Speech-to-unit translation (S2UT)
- Follow the same inference process as in fairseq-S2T to generate unit sequences (
${RESULTS_PATH}/generate-${GEN_SUBSET}.txt
).
fairseq-generate $DATA_ROOT \
--config-yaml config.yaml --multitask-config-yaml config_multitask.yaml \
--task speech_to_speech --target-is-code --target-code-size 100 --vocoder code_hifigan \
--path $MODEL_DIR/checkpoint_best.pt --gen-subset $GEN_SUBSET \
--max-tokens 50000 \
--beam 10 --max-len-a 1 \
--results-path ${RESULTS_PATH}
- Set
--beam 1 --n-frames-per-step $r
for decoding with S2UT stacked models.
- Convert unit sequences to waveform.
grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt | \
sed 's/^D-//ig' | sort -nk1 | cut -f3 \
> ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit
python examples/speech_to_speech/generate_waveform_from_code.py \
--in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit \
--vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG \
--results-path ${RESULTS_PATH} --dur-prediction
- Set
--dur-prediction
for generating audio for S2UT reduced models.
Speech-to-spectrogram translation (S2SPECT)
Follow the same inference process as in fairseq-S^2 to generate waveform.
# assume using a default Griffin-Lim vocoder
python examples/speech_synthesis/generate_waveform.py $DATA_ROOT \
--config-yaml config.yaml --multitask-config-yaml config_multitask.yaml \
--task speech_to_speech --n-frames-per-step 5 \
--path $MODEL_DIR/checkpoint_best.pt --gen-subset $GEN_SUBSET \
--max-tokens 50000 \
--results-path ${RESULTS_PATH} --dump-waveforms --output-sample-rate 16000
In addition to using the default Griffin-Lim vocoder, one can also finetune a HiFi-GAN vocoder for the S2SPECT model by following the instructions in the HiFi-GAN repo.
Multitask decoding
Coming soon.
To evaluate speech translation output, we first apply ASR on the speech output and then compute BLEU score betweent the ASR decoded text and the references using sacreBLEU.
En
- ASR: We use the "Wav2Vec 2.0 Large (LV-60) + Self Training / 960 hours / Libri-Light + Librispeech" En ASR model open-sourced by the wav2vec project. See instructions on how to run inference with a wav2vec-based ASR model. The model is also available on Hugging Face.
- Text normalization: We use the text cleaner at https://github.com/keithito/tacotron for pre-processing reference English text for ASR BLEU evaluation.