Incomplete tokenizing in multilingual model evaluation #1842

baileyeet · 2024-12-19T07:19:40Z

Hi Next-gen Kaldi team,

I'm currently working on a Japanese-English bilingual model while referencing multi_zh_en, reazonspeech, and librispeech. After training the model on reazonspeech-all (Japanese, 35000 hours) and librispeech (English, 1000 hours), I've been able to achieve fairly good performance. With greedy search, both English and Japanese WER ranges from 3.46 to 8.35 and with modified beam search, the WER ranges from 3.28 to 8.07. I am now trying to evaluate the CER, so I exported my models with the following command:

./zipformer/export-onnx.py --tokens data/lang_bbpe_2000/tokens.txt --use-averaged-model 0 --epoch 35 --avg 1 --exp-dir zipformer/exp --num-encoder-layers "2,2,3,4,3,2" --downsampling-factor "1,2,4,8,4,2" --feedforward-dim "512,768,1024,1536,1024,768" --num-heads "4,4,4,8,4,4" --encoder-dim "192,256,384,512,384,256" --query-head-dim 32 --value-head-dim 12 --pos-head-dim 4 --pos-dim 48 --encoder-unmasked-dim "192,192,256,256,256,192" --cnn-module-kernel "31,31,15,15,15,31" --decoder-dim 512 --joiner-dim 512 --causal False --chunk-size "16,32,64,-1" --left-context-frames "64,128,256,-1" --fp16 True

And wrote a script to test the CER:

#!/usr/bin/env python3

import onnx
import sherpa_onnx
import numpy as np
import os
import re
import sys
import json
import librosa
import num2words
import editdistance
from reazonspeech.k2.asr import transcribe, audio_from_path, TranscribeConfig
import time

BASEDIR = "k2-multi-ja-en"
PAD_SECONDS = 0.9

PUNCTUATIONS = {ord(x): "" for x in "、。「」『』，,？！!!?!?"}
ZENKAKU = "ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚＡＢＣＤＥＦＧＨＩＪＫＬＭＮＯＰＱＲＳＴＵＶＷＸＹＺ０１２３４５６７８９"
HANKAKU = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
ZEN2HAN = str.maketrans(ZENKAKU, HANKAKU)

HF_REPO_FILES = {
    "fp32": {
        "tokens": "tokens.txt",
        "encoder": "encoder-epoch-35-avg-1.onnx",
        "decoder": "decoder-epoch-35-avg-1.onnx",
        "joiner": "joiner-epoch-35-avg-1.onnx",
    },
    "int8": {
        "tokens": "tokens.txt",
        "encoder": "encoder-epoch-35-avg-1.int8.onnx",
        "decoder": "decoder-epoch-35-avg-1.int8.onnx",
        "joiner": "joiner-epoch-35-avg-1.int8.onnx",
    },
    "int8-fp32": {
        "tokens": "tokens.txt",
        "encoder": "encoder-epoch-35-avg-1.int8.onnx",
        "decoder": "decoder-epoch-35-avg-1.onnx",
        "joiner": "joiner-epoch-35-avg-1.int8.onnx",
    }
}

def normalize(s):
    s = s.translate(PUNCTUATIONS).translate(ZEN2HAN)
    conv = lambda m: num2words.num2words(m.group(0), lang='ja')
    return re.sub(r'\d+\.?\d*', conv, s)

def load_model(device="cpu", precision="fp32"):
    if precision not in HF_REPO_FILES:
        raise ValueError("Unknown precision: '%s'" % precision)

    files = HF_REPO_FILES[precision]

    return sherpa_onnx.OfflineRecognizer.from_transducer(
        tokens=os.path.join(BASEDIR, files["tokens"]),
        encoder=os.path.join(BASEDIR, files['encoder']),
        decoder=os.path.join(BASEDIR, files['decoder']),
        joiner=os.path.join(BASEDIR, files['joiner']),
        num_threads=1,
        sample_rate=16000,
        feature_dim=80,
        decoding_method="greedy_search",
        provider=device,
    )

def main():
    if sys.stdin.isatty():
        return

    model = load_model(device='cpu')
    config = TranscribeConfig(verbose=False)

    items = []
    for line in sys.stdin:
        items.append(json.loads(line))

    errors = 0
    chars = 0
    start = time.time()
    print('start')
    for idx, item in enumerate(items):
        audio, samplerate = librosa.load(item['audio_filepath'], sr=None)

        # Apply padding
        audio = np.pad(audio,
                       pad_width=int(PAD_SECONDS * samplerate),
                       mode='constant')

        stream = model.create_stream()
        stream.accept_waveform(samplerate, audio)
        model.decode_stream(stream)
        asr = normalize(stream.result.text)
        text = normalize(item["text"])
        chars += len(text)
        dist = editdistance.eval(asr, text)
        errors += dist

        print("%05i\t%i\t%.2f%%\t%s\t%s" % (idx, dist, errors / chars * 100, item['text'], stream.result.text))
    print("CER: %.2f%%" % (errors / chars * 100))
    end = time.time()
    print("Time: %8.3fs" % (end - start))

if __name__ == '__main__':
    main()

I have used a very similar script for evaluating multiple other Japanese ASR models, including the Japanese k2 model, successfully. However, the results using the bilingual onnx models are not meaningful:

ret.text:  ƊĢō ƊĢŕ ƊĤķ ƊĤř Ɗģŕ ƊĢŔ ƍĹş ƊĢŔ ƊĢŘ Ɗģģ Ɗģģ ƌĮī ƌĩħ ƊĢŔ ƊĢŎ ƊĢĶ ƊģĮ Ɗģĵ ƍĭĢ ƊĢņ ƊĢŤ ƊĢļ
Predicted:  ƊĢō ƊĢŕ ƊĤķ ƊĤř Ɗģŕ ƊĢŔ ƍĹş ƊĢŔ ƊĢŘ Ɗģģ Ɗģģ ƌĮī ƌĩħ ƊĢŔ ƊĢŎ ƊĢĶ ƊģĮ Ɗģĵ ƍĭĢ ƊĢņ ƊĢŤ ƊĢļ
Expected: ではピンクの方のひもも半分のところを持ちます
Distance: 88
00011	88	378.96%	ではピンクの方のひもも半分のところを持ちます	 ƊĢō ƊĢŕ ƊĤķ ƊĤř Ɗģŕ ƊĢŔ ƍĹş ƊĢŔ ƊĢŘ Ɗģģ Ɗģģ ƌĮī ƌĩħ ƊĢŔ ƊĢŎ ƊĢĶ ƊģĮ Ɗģĵ ƍĭĢ ƊĢņ ƊĢŤ ƊĢļ
ret.text:  ƊĢĶ ƊĢŔ ƊĤĴ ƊĤř ƊĤĢ Ɗģő ƊĤŢ ƊĤĪ ƊĢŔ ƍĤĦ ƌŅŗ ƊĢĭ Ɗģť ƊĤħ ƎřŞ ƊĢŎ ƊĤŎ Ɗģř ƎřŞ ƊĢŔ Ǝşķ Ɗģī ƍĹş Ɗģĵ ƍĭĨ ƎŊŠ ƊĢĺ ƊĢŤ ƊĢļ
Predicted:  ƊĢĶ ƊĢŔ ƊĤĴ ƊĤř ƊĤĢ Ɗģő ƊĤŢ ƊĤĪ ƊĢŔ ƍĤĦ ƌŅŗ ƊĢĭ Ɗģť ƊĤħ ƎřŞ ƊĢŎ ƊĤŎ Ɗģř ƎřŞ ƊĢŔ Ǝşķ Ɗģī ƍĹş Ɗģĵ ƍĭĨ ƎŊŠ ƊĢĺ ƊĢŤ ƊĢļ
Expected: このパンチカードの情報がタテ糸とヨコ糸の織り方を指示します
Distance: 116
00012	116	380.59%	このパンチカードの情報がタテ糸とヨコ糸の織り方を指示します	 ƊĢĶ ƊĢŔ ƊĤĴ ƊĤř ƊĤĢ Ɗģő ƊĤŢ ƊĤĪ ƊĢŔ ƍĤĦ ƌŅŗ ƊĢĭ Ɗģť ƊĤħ ƎřŞ ƊĢŎ ƊĤŎ Ɗģř ƎřŞ ƊĢŔ Ǝşķ Ɗģī ƍĹş Ɗģĵ ƍĭĨ ƎŊŠ ƊĢĺ ƊĢŤ ƊĢļ
ret.text:  ƊģŎ ƊĤŒ ƊĤř ƊĢĭ ƎĩŜ ƏŌŐ ƊĢĬ ƊģĪ ƏŎĺ ƊĢĸ Ɗģĭ ƊĢń Ǝķń ƌŔŜ ƊĢŔ ƌŁŖ ƋŞĬ ƌŔŊ ƊĢŔ ƐĮś
Predicted:  ƊģŎ ƊĤŒ ƊĤř ƊĢĭ ƎĩŜ ƏŌŐ ƊĢĬ ƊģĪ ƏŎĺ ƊĢĸ Ɗģĭ ƊĢń Ǝķń ƌŔŜ ƊĢŔ ƌŁŖ ƋŞĬ ƌŔŊ ƊĢŔ ƐĮś
Expected: エレンが父親から託された生家の地下室の鍵

It appears that the output is only returning tokens.
(This is what data/lang_bbpe_2000/tokens.txt looks like, for reference:)

▁ƋŠŠ 48
▁ƊģŊ 49
▁ƊĢŊ 50
▁ƊģĢ 51
▁Ɗģİ 52
▁ƊĤő 53
▁ƊģĤ 54

I tested the evaluation with 4 datasets in Japanese and all produce very similar results, as in no significant change in WER or predicted output across datasets.

I am wondering if there is an issue with the way I am exporting the onnx models, such as missing an intermediary step or using incorrect command, which would affect the tokenization of the ASR model.

To test, I decoded with onnx using onnx_decode.py, similar to what multi_zh_en has in its directory, and the WER matches what I observed when decoding earlier (Although I was only able to check for English - similar to multi_zh_en, I symlinked the file from librispeech).

Would appreciate any thoughts or insights on this issue. Thank you!

The text was updated successfully, but these errors were encountered:

csukuangfj · 2024-12-19T07:38:39Z

sherpa-onnx has not supported byte bpe models yet.

CC @pkufool

baileyeet · 2024-12-19T07:51:02Z

Thank you for your response. I see. How is exporting bbpe models such as multi_zh_en done?

csukuangfj · 2024-12-19T07:53:13Z

Thank you for your response. I see. How is exporting bbpe models such as multi_zh_en done?

I think there is nothing wrong with your export step.

csukuangfj · 2024-12-19T07:53:55Z

Are you going to make your pre-trained models public? We can support it in sherpa-onnx.

baileyeet · 2024-12-19T07:58:17Z

I see. So is the cause of the issue I'm experiencing because sherpa-onnx does not support byte bpe models yet?

Yes, to the best of my knowledge, we plan to. I'm confirming now and will let you know if there is a different answer.

csukuangfj · 2024-12-19T07:59:11Z

So is the cause of the issue I'm experiencing because sherpa-onnx does not support byte bpe models yet?

Yes, you are right. We have not supported it in sherpa-onnx, but it is doable.

baileyeet · 2024-12-20T02:10:45Z

Sorry for the delay and thanks for your help. Yes, we plan to make the models public. How can you support the model in sherpa-onnx?

csukuangfj · 2024-12-20T10:34:29Z

Sorry for the delay and thanks for your help. Yes, we plan to make the models public. How can you support the model in sherpa-onnx?

@baileyeet Just added support for byte-level bpe models in sherpa-onnx.

Please have a look at k2-fsa/sherpa-onnx#1633

I hope that your model will soon be available.

csukuangfj mentioned this issue Dec 20, 2024

Support decoding with byte-level BPE (bbpe) models. k2-fsa/sherpa-onnx#1633

Merged

csukuangfj closed this as completed in k2-fsa/sherpa-onnx#1633 Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incomplete tokenizing in multilingual model evaluation #1842

Incomplete tokenizing in multilingual model evaluation #1842

baileyeet commented Dec 19, 2024

csukuangfj commented Dec 19, 2024

baileyeet commented Dec 19, 2024

csukuangfj commented Dec 19, 2024

csukuangfj commented Dec 19, 2024

baileyeet commented Dec 19, 2024

csukuangfj commented Dec 19, 2024

baileyeet commented Dec 20, 2024

csukuangfj commented Dec 20, 2024 •

edited

Loading

Incomplete tokenizing in multilingual model evaluation #1842

Incomplete tokenizing in multilingual model evaluation #1842

Comments

baileyeet commented Dec 19, 2024

csukuangfj commented Dec 19, 2024

baileyeet commented Dec 19, 2024

csukuangfj commented Dec 19, 2024

csukuangfj commented Dec 19, 2024

baileyeet commented Dec 19, 2024

csukuangfj commented Dec 19, 2024

baileyeet commented Dec 20, 2024

csukuangfj commented Dec 20, 2024 • edited Loading

csukuangfj commented Dec 20, 2024 •

edited

Loading