-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding scoring scripts. #24
Comments
As Kaldi recipe develpment is converging, it's time to think about how we organize this text normalization as a post processing before WER calculation. The processing is pretty simple, containing:
Google API has shown these processing can result a WER difference up to 1-2% absolute or even more, so this is necessary for consistent/fair comparison in spontaneous speech. Now the problem is how we organize the post-processing:
options that I can think of for now:
Which one is better? Any preferences or better suggestion? I personally vote for the first solution. @wangyongqing0731 @chenguoguo @sw005320 |
I prefer option one, that is to provide working scripts for each downstream toolkit. Here is what I have in mind:
One thing that I haven't decided yet is, whether or not we should provide the scoring tool. If yes, we can make sure that everyone is using the same tool for scoring and the results will be comparable. But it definitely involves more work. |
Sounds good to me.
The reason I stick to use If we use the different toolkit, I recommend at least output the number of total words/sentences in the reference and possibly sub/del/ins breakdown, e.g.,
As long as the number of total words/sentences is the same, it is comparable. |
I just added a simple scoring tool via #35 , it uses sclite to evaluate REF and HYP. Before evaluation, the tool applies very simple text processing that we discussed in this issue topic, and I think we should keep this processing simple and stable after release. Besides, the tool is organized in |
We finally provides a recommended scoring script based on sclite https://github.com/SpeechColab/GigaSpeech/blob/main/utils/gigaspeech_scoring.py , researchers may use this tool if they want consistent comparison across different systems. We'd better leave this ISSUE open for a while, so people may read above discussion. |
We should provide scoring scripts (e.g., for normalization) so that results from different toolkits are comparable.
The text was updated successfully, but these errors were encountered: