CTC-Assisted LLM-Based Contextual ASR

Guides

CTC-Assisted LLM-Based Contextual ASR is an LLM-based contextual ASR model that first uses CTC decoding results to filter potential relevant hotwords from pre-defined hotwords list and then incorporate them into LLM prompt input to improve recognition of hotwords.

Model Architecture

We use WavLM-Large model pre-trained on 94, 000 hours of data, and fine-tuned on 960h hours of Librispeech data with CTC loss, as our speech encoder. We use the public Vicuna 7B as our large language model decoder, and a simple-structured linear projector, consisting of a 1-D convolution layer and two linear layers as our adapter. Refer to our paper for more details.

Checkpoints

We only train the linear projector in this recipe.

Encoder	Projector	LLM
CTC Fine-tuned WavLM-Large(~315.45M)	Linear(~15.74M)	vicuna-7b-v1.5(~6.7B)

Performance

Data preparation

The artificial biasing list constructed in Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion is utilized for contextual ASR testing. Refer to official Repo.
They categorize the 5,000 most frequent words in the Librispeech training corpus as common words, with the remainder classified as rare words. The biasing list generated for the test set consists of two segments: rare words in the transcriptions, and distractors sampled from the 209.2K rare words vocabulary. Biasing lists of varying lengths are generated by incorporating N = {100, 500, 1000, 2000} distractors into the lists.

The viterbi decode results of our CTC Fine-tuned WavLM-Large: test-clean, test-other (ctc_file in contextual_asr_config.py)

Decoding with checkpoints

LLM-based ASR Inference script.

bash decode_wavlm_libri960_ft_char.sh

LLM-based Contextual ASR Inference script, with different biaisng list sizes.

bash decode_wavlm_libri960_ft_char_hotwords.sh

Training the model

LLM-based ASR Training script: using CTC fine-tuned Wavlm as encoder and “Transcribe speech to text.” as prompt.

bash finetune_wavlm_libri960_ft_char.sh

LLM-based Contextual ASR Training script: using CTC fine-tuned Wavlm as encoder and "Transcribe speech to text. Some hotwords might help. The hotwords are {}.” as prompt.

bash finetune_wavlm_libri960_ft_char_hotwords.sh

Citation

You can refer to the paper for more results.

@article{yang2024ctc,
  title={CTC-Assisted LLM-Based Contextual ASR},
  author={Yang, Guanrou and Ma, Ziyang and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
  journal={Proc. SLT},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CTC-Assisted LLM-Based Contextual ASR

Guides

Model Architecture

Checkpoints

Performance

Data preparation

Decoding with checkpoints

Training the model

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

CTC-Assisted LLM-Based Contextual ASR

Guides

Model Architecture

Checkpoints

Performance

Data preparation

Decoding with checkpoints

Training the model

Citation