-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incomplete tokenizing in multilingual model evaluation #1842
Comments
sherpa-onnx has not supported byte bpe models yet. CC @pkufool |
Thank you for your response. I see. How is exporting bbpe models such as multi_zh_en done? |
I think there is nothing wrong with your export step. |
Are you going to make your pre-trained models public? We can support it in sherpa-onnx. |
I see. So is the cause of the issue I'm experiencing because sherpa-onnx does not support byte bpe models yet? Yes, to the best of my knowledge, we plan to. I'm confirming now and will let you know if there is a different answer. |
Yes, you are right. We have not supported it in sherpa-onnx, but it is doable. |
Sorry for the delay and thanks for your help. Yes, we plan to make the models public. How can you support the model in sherpa-onnx? |
@baileyeet Just added support for byte-level bpe models in sherpa-onnx. Please have a look at k2-fsa/sherpa-onnx#1633 I hope that your model will soon be available. |
Hi Next-gen Kaldi team,
I'm currently working on a Japanese-English bilingual model while referencing multi_zh_en, reazonspeech, and librispeech. After training the model on reazonspeech-all (Japanese, 35000 hours) and librispeech (English, 1000 hours), I've been able to achieve fairly good performance. With greedy search, both English and Japanese WER ranges from 3.46 to 8.35 and with modified beam search, the WER ranges from 3.28 to 8.07. I am now trying to evaluate the CER, so I exported my models with the following command:
./zipformer/export-onnx.py --tokens data/lang_bbpe_2000/tokens.txt --use-averaged-model 0 --epoch 35 --avg 1 --exp-dir zipformer/exp --num-encoder-layers "2,2,3,4,3,2" --downsampling-factor "1,2,4,8,4,2" --feedforward-dim "512,768,1024,1536,1024,768" --num-heads "4,4,4,8,4,4" --encoder-dim "192,256,384,512,384,256" --query-head-dim 32 --value-head-dim 12 --pos-head-dim 4 --pos-dim 48 --encoder-unmasked-dim "192,192,256,256,256,192" --cnn-module-kernel "31,31,15,15,15,31" --decoder-dim 512 --joiner-dim 512 --causal False --chunk-size "16,32,64,-1" --left-context-frames "64,128,256,-1" --fp16 True
And wrote a script to test the CER:
I have used a very similar script for evaluating multiple other Japanese ASR models, including the Japanese k2 model, successfully. However, the results using the bilingual onnx models are not meaningful:
It appears that the output is only returning tokens.
(This is what data/lang_bbpe_2000/tokens.txt looks like, for reference:)
I tested the evaluation with 4 datasets in Japanese and all produce very similar results, as in no significant change in WER or predicted output across datasets.
I am wondering if there is an issue with the way I am exporting the onnx models, such as missing an intermediary step or using incorrect command, which would affect the tokenization of the ASR model.
To test, I decoded with onnx using onnx_decode.py, similar to what multi_zh_en has in its directory, and the WER matches what I observed when decoding earlier (Although I was only able to check for English - similar to multi_zh_en, I symlinked the file from librispeech).
Would appreciate any thoughts or insights on this issue. Thank you!
The text was updated successfully, but these errors were encountered: