Skip to content

Latest commit

 

History

History
315 lines (258 loc) · 11.6 KB

RESULTS.md

File metadata and controls

315 lines (258 loc) · 11.6 KB

Results

zipformer (zipformer + pruned stateless transducer)

See k2-fsa#1261 for more details.

zipformer

Non-streaming

Training on normalized text, i.e. Upper case without punctuation

normal-scaled model, number of model parameters: 65805511, i.e., 65.81 M

You can find a pretrained model, training logs at: https://www.modelscope.cn/models/pkufool/icefall-asr-zipformer-libriheavy-20230926/summary

Note: The repository above contains three models trained on different subset of libriheavy exp(large set), exp_medium_subset(medium set), exp_small_subset(small set).

Results of models:

training set decoding method librispeech clean librispeech other libriheavy clean libriheavy other comment
small greedy search 4.19 9.99 4.75 10.25 --epoch 90 --avg 20
small modified beam search 4.05 9.89 4.68 10.01 --epoch 90 --avg 20
medium greedy search 2.39 4.85 2.90 6.6 --epoch 60 --avg 20
medium modified beam search 2.35 4.82 2.90 6.57 --epoch 60 --avg 20
large greedy search 1.67 3.32 2.24 5.61 --epoch 16 --avg 3
large modified beam search 1.62 3.36 2.20 5.57 --epoch 16 --avg 3

The training command is:

export CUDA_VISIBLE_DEVICES="0,1,2,3"

python ./zipformer/train.py \
    --world-size 4 \
    --master-port 12365 \
    --exp-dir zipformer/exp \
    --num-epochs 60 \   # 16 for large; 90 for small
    --lr-hours 15000 \  # 20000 for large; 5000 for small
    --use-fp16 1 \
    --start-epoch 1 \
    --bpe-model data/lang_bpe_500/bpe.model \
    --max-duration 1000 \
    --subset medium

The decoding command is:

export CUDA_VISIBLE_DEVICES="0"
for m in greedy_search modified_beam_search; do
  ./zipformer/decode.py \
      --epoch 16 \
      --avg 3 \
      --exp-dir zipformer/exp \
      --max-duration 1000 \
      --causal 0 \
      --decoding-method $m
done

Training on full formatted text, i.e. with casing and punctuation

normal-scaled model, number of model parameters: 66074067 , i.e., 66M

You can find a pretrained model, training logs at: https://www.modelscope.cn/models/pkufool/icefall-asr-zipformer-libriheavy-punc-20230830/summary

Note: The repository above contains three models trained on different subset of libriheavy exp(large set), exp_medium_subset(medium set), exp_small_subset(small set).

Results of models:

training set decoding method libriheavy clean (WER) libriheavy other (WER) libriheavy clean (CER) libriheavy other (CER) comment
small modified beam search 13.04 19.54 4.51 7.90 --epoch 88 --avg 41
medium modified beam search 9.84 13.39 3.02 5.10 --epoch 50 --avg 15
large modified beam search 7.76 11.32 2.41 4.22 --epoch 16 --avg 2

The training command is:

export CUDA_VISIBLE_DEVICES="0,1,2,3"

python ./zipformer/train.py \
    --world-size 4 \
    --master-port 12365 \
    --exp-dir zipformer/exp \
    --num-epochs 60 \   # 16 for large; 90 for small
    --lr-hours 15000 \  # 20000 for large; 10000 for small
    --use-fp16 1 \
    --train-with-punctuation 1 \
    --start-epoch 1 \
    --bpe-model data/lang_punc_bpe_756/bpe.model \
    --max-duration 1000 \
    --subset medium

The decoding command is:

export CUDA_VISIBLE_DEVICES="0"
for m in greedy_search modified_beam_search; do
  ./zipformer/decode.py \
      --epoch 16 \
      --avg 3 \
      --exp-dir zipformer/exp \
      --max-duration 1000 \
      --causal 0 \
      --decoding-method $m
done

Zipformer PromptASR (zipformer + PromptASR + BERT text encoder)

See k2-fsa#1250 for commit history and our paper https://arxiv.org/abs/2309.07414 for more details.

Training on the medium subset, with content & style prompt, no context list

You can find a pre-trained model, training logs, decoding logs, and decoding results at: https://huggingface.co/marcoyang/icefall-promptasr-libriheavy-zipformer-BERT-2023-10-10

The training command is:

causal=0
subset=medium
memory_dropout_rate=0.05
text_encoder_type=BERT

python ./zipformer_prompt_asr/train_bert_encoder.py \
    --world-size 4 \
    --start-epoch 1 \
    --num-epochs 60 \
    --exp-dir ./zipformer_prompt_asr/exp \
    --use-fp16 True \
    --memory-dropout-rate $memory_dropout_rate \
    --causal $causal \
    --subset $subset \
    --manifest-dir data/fbank \
    --bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
    --max-duration 1000 \
    --text-encoder-type $text_encoder_type \
    --text-encoder-dim 768 \
    --use-context-list 0 \
    --top-k $top_k \
    --use-style-prompt 1

The decoding results using utterance-level context (epoch-60-avg-10):

decoding method lh-test-clean lh-test-other comment
modified_beam_search 3.13 6.78 --use-pre-text False --use-style-prompt False
modified_beam_search 2.86 5.93 --pre-text-transform upper-no-punc --style-text-transform upper-no-punc
modified_beam_search 2.6 5.5 --pre-text-transform mixed-punc --style-text-transform mixed-punc

The decoding command is:

for style in mixed-punc upper-no-punc; do
    python ./zipformer_prompt_asr/decode_bert.py \
        --epoch 60 \
        --avg 10 \
        --use-averaged-model True \
        --post-normalization True \
        --causal False \
        --exp-dir ./zipformer_prompt_asr/exp \
        --manifest-dir data/fbank \
        --bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
        --max-duration 1000 \
        --decoding-method modified_beam_search \
        --beam-size 4 \
        --text-encoder-type BERT \
        --text-encoder-dim 768 \
        --memory-layer 0 \
        --use-ls-test-set False \
        --use-ls-context-list False \
        --max-prompt-lens 1000 \
        --use-pre-text True \
        --use-style-prompt True \
        --style-text-transform $style \
        --pre-text-transform $style \
        --compute-CER 0
done
Training on the medium subset, with content & style prompt, with context list

You can find a pre-trained model, training logs, decoding logs, and decoding results at: https://huggingface.co/marcoyang/icefall-promptasr-with-context-libriheavy-zipformer-BERT-2023-10-10

This model is trained with an extra type of content prompt (context words), thus it does better on word-level context biasing. Note that to train this model, please first run prepare_prompt_asr.sh to prepare a manifest containing context words.

The training command is:

causal=0
subset=medium
memory_dropout_rate=0.05
text_encoder_type=BERT
use_context_list=True

# prepare the required data for context biasing
./prepare_prompt_asr.sh --stage 0 --stop_stage 1

python ./zipformer_prompt_asr/train_bert_encoder.py \
    --world-size 4 \
    --start-epoch 1 \
    --num-epochs 50 \
    --exp-dir ./zipformer_prompt_asr/exp \
    --use-fp16 True \
    --memory-dropout-rate $memory_dropout_rate \
    --causal $causal \
    --subset $subset \
    --manifest-dir data/fbank \
    --bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
    --max-duration 1000 \
    --text-encoder-type $text_encoder_type \
    --text-encoder-dim 768 \
    --use-context-list $use_context_list \
    --top-k 10000 \
    --use-style-prompt 1

Utterance-level biasing:

decoding method lh-test-clean lh-test-other comment
modified_beam_search 3.17 6.72 --use-pre-text 0 --use-style-prompt 0
modified_beam_search 2.91 6.24 --pre-text-transform upper-no-punc --style-text-transform upper-no-punc
modified_beam_search 2.72 5.72 --pre-text-transform mixed-punc --style-text-transform mixed-punc

The decoding command for the table above is:

for style in mixed-punc upper-no-punc; do
    python ./zipformer_prompt_asr/decode_bert.py \
        --epoch 50 \
        --avg 10 \
        --use-averaged-model True \
        --post-normalization True \
        --causal False \
        --exp-dir ./zipformer_prompt_asr/exp \
        --manifest-dir data/fbank \
        --bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
        --max-duration 1000 \
        --decoding-method modified_beam_search \
        --beam-size 4 \
        --text-encoder-type BERT \
        --text-encoder-dim 768 \
        --memory-layer 0 \
        --use-ls-test-set False \
        --use-ls-context-list False \
        --max-prompt-lens 1000 \
        --use-pre-text True \
        --use-style-prompt True \
        --style-text-transform $style \
        --pre-text-transform $style \
        --compute-CER 0
done

Word-level biasing:

The results are reported on LibriSpeech test-sets using the biasing list provided from https://arxiv.org/abs/2104.02194. You need to set --use-ls-test-set True so that the LibriSpeech test sets are used.

decoding method ls-test-clean ls-test-other comment
modified_beam_search 2.4 5.08 --use-pre-text 0 --use-style-prompt 0
modified_beam_search 2.14 4.62 --use-ls-context-list 1 --pre-text-transform mixed-punc --style-text-transform mixed-punc --ls-distractors 0
modified_beam_search 2.14 4.64 --use-ls-context-list 1 --pre-text-transform mixed-punc --style-text-transform mixed-punc --ls-distractors 100

The decoding command is for the table above is:

use_ls_test_set=1
use_ls_context_list=1

for ls_distractors in 0 100; do
    python ./zipformer_prompt_asr/decode_bert.py \
        --epoch 50 \
        --avg 10 \
        --use-averaged-model True \
        --post-normalization True \
        --causal False \
        --exp-dir ./zipformer_prompt_asr/exp \
        --manifest-dir data/fbank \
        --bpe-model data/lang_bpe_500_fallback_coverage_0.99/bpe.model \
        --max-duration 1000 \
        --decoding-method modified_beam_search \
        --beam-size 4 \
        --text-encoder-type BERT \
        --text-encoder-dim 768 \
        --memory-layer 0 \
        --use-ls-test-set $use_ls_test_setse \
        --use-ls-context-list $use_ls_context_list \
        --ls-distractors $ls_distractors \
        --max-prompt-lens 1000 \
        --use-pre-text True \
        --use-style-prompt True \
        --style-text-transform mixed-punc \
        --pre-text-transform mixed-punc \
        --compute-CER 0
done