Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete tokenizing in multilingual model evaluation #1842

Closed
baileyeet opened this issue Dec 19, 2024 · 8 comments · Fixed by k2-fsa/sherpa-onnx#1633
Closed

Incomplete tokenizing in multilingual model evaluation #1842

baileyeet opened this issue Dec 19, 2024 · 8 comments · Fixed by k2-fsa/sherpa-onnx#1633

Comments

@baileyeet
Copy link

Hi Next-gen Kaldi team,

I'm currently working on a Japanese-English bilingual model while referencing multi_zh_en, reazonspeech, and librispeech. After training the model on reazonspeech-all (Japanese, 35000 hours) and librispeech (English, 1000 hours), I've been able to achieve fairly good performance. With greedy search, both English and Japanese WER ranges from 3.46 to 8.35 and with modified beam search, the WER ranges from 3.28 to 8.07. I am now trying to evaluate the CER, so I exported my models with the following command:

./zipformer/export-onnx.py --tokens data/lang_bbpe_2000/tokens.txt --use-averaged-model 0 --epoch 35 --avg 1 --exp-dir zipformer/exp --num-encoder-layers "2,2,3,4,3,2" --downsampling-factor "1,2,4,8,4,2" --feedforward-dim "512,768,1024,1536,1024,768" --num-heads "4,4,4,8,4,4" --encoder-dim "192,256,384,512,384,256" --query-head-dim 32 --value-head-dim 12 --pos-head-dim 4 --pos-dim 48 --encoder-unmasked-dim "192,192,256,256,256,192" --cnn-module-kernel "31,31,15,15,15,31" --decoder-dim 512 --joiner-dim 512 --causal False --chunk-size "16,32,64,-1" --left-context-frames "64,128,256,-1" --fp16 True

And wrote a script to test the CER:

#!/usr/bin/env python3

import onnx
import sherpa_onnx
import numpy as np
import os
import re
import sys
import json
import librosa
import num2words
import editdistance
from reazonspeech.k2.asr import transcribe, audio_from_path, TranscribeConfig
import time

BASEDIR = "k2-multi-ja-en"
PAD_SECONDS = 0.9

PUNCTUATIONS = {ord(x): "" for x in "、。「」『』,,?!!!?!?"}
ZENKAKU = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
HANKAKU = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
ZEN2HAN = str.maketrans(ZENKAKU, HANKAKU)

HF_REPO_FILES = {
    "fp32": {
        "tokens": "tokens.txt",
        "encoder": "encoder-epoch-35-avg-1.onnx",
        "decoder": "decoder-epoch-35-avg-1.onnx",
        "joiner": "joiner-epoch-35-avg-1.onnx",
    },
    "int8": {
        "tokens": "tokens.txt",
        "encoder": "encoder-epoch-35-avg-1.int8.onnx",
        "decoder": "decoder-epoch-35-avg-1.int8.onnx",
        "joiner": "joiner-epoch-35-avg-1.int8.onnx",
    },
    "int8-fp32": {
        "tokens": "tokens.txt",
        "encoder": "encoder-epoch-35-avg-1.int8.onnx",
        "decoder": "decoder-epoch-35-avg-1.onnx",
        "joiner": "joiner-epoch-35-avg-1.int8.onnx",
    }
}

def normalize(s):
    s = s.translate(PUNCTUATIONS).translate(ZEN2HAN)
    conv = lambda m: num2words.num2words(m.group(0), lang='ja')
    return re.sub(r'\d+\.?\d*', conv, s)

def load_model(device="cpu", precision="fp32"):
    if precision not in HF_REPO_FILES:
        raise ValueError("Unknown precision: '%s'" % precision)

    files = HF_REPO_FILES[precision]

    return sherpa_onnx.OfflineRecognizer.from_transducer(
        tokens=os.path.join(BASEDIR, files["tokens"]),
        encoder=os.path.join(BASEDIR, files['encoder']),
        decoder=os.path.join(BASEDIR, files['decoder']),
        joiner=os.path.join(BASEDIR, files['joiner']),
        num_threads=1,
        sample_rate=16000,
        feature_dim=80,
        decoding_method="greedy_search",
        provider=device,
    )

def main():
    if sys.stdin.isatty():
        return

    model = load_model(device='cpu')
    config = TranscribeConfig(verbose=False)

    items = []
    for line in sys.stdin:
        items.append(json.loads(line))

    errors = 0
    chars = 0
    start = time.time()
    print('start')
    for idx, item in enumerate(items):
        audio, samplerate = librosa.load(item['audio_filepath'], sr=None)

        # Apply padding
        audio = np.pad(audio,
                       pad_width=int(PAD_SECONDS * samplerate),
                       mode='constant')

        stream = model.create_stream()
        stream.accept_waveform(samplerate, audio)
        model.decode_stream(stream)
        asr = normalize(stream.result.text)
        text = normalize(item["text"])
        chars += len(text)
        dist = editdistance.eval(asr, text)
        errors += dist

        print("%05i\t%i\t%.2f%%\t%s\t%s" % (idx, dist, errors / chars * 100, item['text'], stream.result.text))
    print("CER: %.2f%%" % (errors / chars * 100))
    end = time.time()
    print("Time: %8.3fs" % (end - start))

if __name__ == '__main__':
    main()

I have used a very similar script for evaluating multiple other Japanese ASR models, including the Japanese k2 model, successfully. However, the results using the bilingual onnx models are not meaningful:

ret.text:  ƊĢō ƊĢŕ ƊĤķ ƊĤř Ɗģŕ ƊĢŔ ƍĹş ƊĢŔ ƊĢŘ Ɗģģ Ɗģģ ƌĮī ƌĩħ ƊĢŔ ƊĢŎ ƊĢĶ ƊģĮ Ɗģĵ ƍĭĢ ƊĢņ ƊĢŤ ƊĢļ
Predicted:  ƊĢō ƊĢŕ ƊĤķ ƊĤř Ɗģŕ ƊĢŔ ƍĹş ƊĢŔ ƊĢŘ Ɗģģ Ɗģģ ƌĮī ƌĩħ ƊĢŔ ƊĢŎ ƊĢĶ ƊģĮ Ɗģĵ ƍĭĢ ƊĢņ ƊĢŤ ƊĢļ
Expected: ではピンクの方のひもも半分のところを持ちます
Distance: 88
00011	88	378.96%	ではピンクの方のひもも半分のところを持ちます	 ƊĢō ƊĢŕ ƊĤķ ƊĤř Ɗģŕ ƊĢŔ ƍĹş ƊĢŔ ƊĢŘ Ɗģģ Ɗģģ ƌĮī ƌĩħ ƊĢŔ ƊĢŎ ƊĢĶ ƊģĮ Ɗģĵ ƍĭĢ ƊĢņ ƊĢŤ ƊĢļ
ret.text:  ƊĢĶ ƊĢŔ ƊĤĴ ƊĤř ƊĤĢ Ɗģő ƊĤŢ ƊĤĪ ƊĢŔ ƍĤĦ ƌŅŗ ƊĢĭ Ɗģť ƊĤħ ƎřŞ ƊĢŎ ƊĤŎ Ɗģř ƎřŞ ƊĢŔ Ǝşķ Ɗģī ƍĹş Ɗģĵ ƍĭĨ ƎŊŠ ƊĢĺ ƊĢŤ ƊĢļ
Predicted:  ƊĢĶ ƊĢŔ ƊĤĴ ƊĤř ƊĤĢ Ɗģő ƊĤŢ ƊĤĪ ƊĢŔ ƍĤĦ ƌŅŗ ƊĢĭ Ɗģť ƊĤħ ƎřŞ ƊĢŎ ƊĤŎ Ɗģř ƎřŞ ƊĢŔ Ǝşķ Ɗģī ƍĹş Ɗģĵ ƍĭĨ ƎŊŠ ƊĢĺ ƊĢŤ ƊĢļ
Expected: このパンチカードの情報がタテ糸とヨコ糸の織り方を指示します
Distance: 116
00012	116	380.59%	このパンチカードの情報がタテ糸とヨコ糸の織り方を指示します	 ƊĢĶ ƊĢŔ ƊĤĴ ƊĤř ƊĤĢ Ɗģő ƊĤŢ ƊĤĪ ƊĢŔ ƍĤĦ ƌŅŗ ƊĢĭ Ɗģť ƊĤħ ƎřŞ ƊĢŎ ƊĤŎ Ɗģř ƎřŞ ƊĢŔ Ǝşķ Ɗģī ƍĹş Ɗģĵ ƍĭĨ ƎŊŠ ƊĢĺ ƊĢŤ ƊĢļ
ret.text:  ƊģŎ ƊĤŒ ƊĤř ƊĢĭ ƎĩŜ ƏŌŐ ƊĢĬ ƊģĪ ƏŎĺ ƊĢĸ Ɗģĭ ƊĢń Ǝķń ƌŔŜ ƊĢŔ ƌŁŖ ƋŞĬ ƌŔŊ ƊĢŔ ƐĮś
Predicted:  ƊģŎ ƊĤŒ ƊĤř ƊĢĭ ƎĩŜ ƏŌŐ ƊĢĬ ƊģĪ ƏŎĺ ƊĢĸ Ɗģĭ ƊĢń Ǝķń ƌŔŜ ƊĢŔ ƌŁŖ ƋŞĬ ƌŔŊ ƊĢŔ ƐĮś
Expected: エレンが父親から託された生家の地下室の鍵

It appears that the output is only returning tokens.
(This is what data/lang_bbpe_2000/tokens.txt looks like, for reference:)

▁ƋŠŠ 48
▁ƊģŊ 49
▁ƊĢŊ 50
▁ƊģĢ 51
▁Ɗģİ 52
▁ƊĤő 53
▁ƊģĤ 54

I tested the evaluation with 4 datasets in Japanese and all produce very similar results, as in no significant change in WER or predicted output across datasets.

I am wondering if there is an issue with the way I am exporting the onnx models, such as missing an intermediary step or using incorrect command, which would affect the tokenization of the ASR model.

To test, I decoded with onnx using onnx_decode.py, similar to what multi_zh_en has in its directory, and the WER matches what I observed when decoding earlier (Although I was only able to check for English - similar to multi_zh_en, I symlinked the file from librispeech).

Would appreciate any thoughts or insights on this issue. Thank you!

@csukuangfj
Copy link
Collaborator

sherpa-onnx has not supported byte bpe models yet.

CC @pkufool

@baileyeet
Copy link
Author

Thank you for your response. I see. How is exporting bbpe models such as multi_zh_en done?

@csukuangfj
Copy link
Collaborator

Thank you for your response. I see. How is exporting bbpe models such as multi_zh_en done?

I think there is nothing wrong with your export step.

@csukuangfj
Copy link
Collaborator

Are you going to make your pre-trained models public? We can support it in sherpa-onnx.

@baileyeet
Copy link
Author

I see. So is the cause of the issue I'm experiencing because sherpa-onnx does not support byte bpe models yet?

Yes, to the best of my knowledge, we plan to. I'm confirming now and will let you know if there is a different answer.

@csukuangfj
Copy link
Collaborator

So is the cause of the issue I'm experiencing because sherpa-onnx does not support byte bpe models yet?

Yes, you are right. We have not supported it in sherpa-onnx, but it is doable.

@baileyeet
Copy link
Author

Sorry for the delay and thanks for your help. Yes, we plan to make the models public. How can you support the model in sherpa-onnx?

@csukuangfj
Copy link
Collaborator

csukuangfj commented Dec 20, 2024

Sorry for the delay and thanks for your help. Yes, we plan to make the models public. How can you support the model in sherpa-onnx?

@baileyeet Just added support for byte-level bpe models in sherpa-onnx.

Please have a look at k2-fsa/sherpa-onnx#1633

I hope that your model will soon be available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants