Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding scoring scripts. #24

Open
chenguoguo opened this issue Mar 19, 2021 · 5 comments
Open

Adding scoring scripts. #24

chenguoguo opened this issue Mar 19, 2021 · 5 comments
Labels
documentation Improvements or additions to documentation

Comments

@chenguoguo
Copy link
Collaborator

We should provide scoring scripts (e.g., for normalization) so that results from different toolkits are comparable.

@dophist
Copy link
Collaborator

dophist commented Mar 28, 2021

As Kaldi recipe develpment is converging, it's time to think about how we organize this text normalization as a post processing before WER calculation.

The processing is pretty simple, containing:

  1. removal of "AH UM UH ER ERR HM ..." etc from both REF and HYP, to avoid error counting like

    REF:AH THIS ... 
    HYP:UH THIS ...
    

    these words are just meaningless in WER computation, and it's really HARD to keep them consistent between human transcriber and models(say AH UH ER may sound identical, human-annotators/models may yield these things differently from time to time)

  2. removal of "-" hyphen from both REF and HYP, to avoid error counting like:

    REF: T-SHIRT
    HYP: T SHIRT
    

    hyphen is somehow more frequent than I expected, our training text TN kept hyphen because T-SHIRT is indeed a meaningful word other than T SHIRT. But in testing and evaluation, removing them gives more robust and reasonable WER numbers.

Google API has shown these processing can result a WER difference up to 1-2% absolute or even more, so this is necessary for consistent/fair comparison in spontaneous speech.

Now the problem is how we organize the post-processing:

  • the above normalization is just simple text processing, and it's not subjected to specific formats of asr result from different frameworks(such as kaldi hyp/ref text scp/ark), how do we release these processing to make it "standard" in gigaspeech downstream toolkits?

options that I can think of for now:

  1. we can provide different text processing scripts for each downstream frameworks, e.g. GIGASPEECH_REPO/toolkits/{kaldi,espnet,...}/asr-text-post-processing.py. in this case, we KNOW the detailed format of each toolkits. With agreement, these can also go into downstream recipe code directly.
  2. or we can have a single exampler python function/awk-command in GIGASPEECH_REPO/util/ that deals with pure text instead of dealing with specific formatted files, and let downstream recipe developers decide to import/refer in their code when they feel appropriate.

Which one is better? Any preferences or better suggestion? I personally vote for the first solution. @wangyongqing0731 @chenguoguo @sw005320

@chenguoguo
Copy link
Collaborator Author

I prefer option one, that is to provide working scripts for each downstream toolkit. Here is what I have in mind:

  1. Under each toolkit, we have a script to handle the post processing, which takes care of the toolkit specific stuff, e.g., toolkits/kaldi/gigaspeech_asr_post_processing.sh
  2. The toolkit specific internally call a common script, e.g., utils/asr_post_processing.sh, which does the actual work. This way, if we have to update the post processing, we only have to update one place.

One thing that I haven't decided yet is, whether or not we should provide the scoring tool. If yes, we can make sure that everyone is using the same tool for scoring and the results will be comparable. But it definitely involves more work.

@sw005320
Copy link
Contributor

I prefer option one, that is to provide working scripts for each downstream toolkit. Here is what I have in mind:

  1. Under each toolkit, we have a script to handle the post processing, which takes care of the toolkit specific stuff, e.g., toolkits/kaldi/gigaspeech_asr_post_processing.sh
  2. The toolkit specific internally call a common script, e.g., utils/asr_post_processing.sh, which does the actual work. This way, if we have to update the post processing, we only have to update one place.

Sounds good to me.

One thing that I haven't decided yet is, whether or not we should provide the scoring tool. If yes, we can make sure that everyone is using the same tool for scoring and the results will be comparable. But it definitely involves more work.

The reason I stick to use sclite is that it is made by NIST and has been used in various ASR official benchmarks. sclite also has various analysis tools. As far as I know, Kalid scoring produces the same result, so it should be no problem.

If we use the different toolkit, I recommend at least output the number of total words/sentences in the reference and possibly sub/del/ins breakdown, e.g.,

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_asr_model_valid.acc.ave/dev 2043 51075 92.9 4.5 2.6 2.1 9.2 65.6
decode_asr_asr_model_valid.acc.ave/test 9627 175116 90.5 7.0 2.5 6.1 15.6 69.3

As long as the number of total words/sentences is the same, it is comparable.
(we can also easily detect if something is wrong in the data preparation or normalization when we check it).
sub/del/ins error breakdown can be used to detect some DP matching issues in the edit distance computation, and some format errors (e.g., in the above case, there is significantly large insertion errors in the test set. We may have some alignment or reference issues, and I actually found them based on this number and already reported it to you).

@dophist
Copy link
Collaborator

dophist commented Mar 30, 2021

I just added a simple scoring tool via #35 , it uses sclite to evaluate REF and HYP.

Before evaluation, the tool applies very simple text processing that we discussed in this issue topic, and I think we should keep this processing simple and stable after release.

Besides, the tool is organized in utils/ instead of toolkits/xxx/ , because sclite is framework independent. Recipe developers/researchers can use this when they really want an apple-to-apple evaluation comparison.

@dophist
Copy link
Collaborator

dophist commented Mar 31, 2021

We finally provides a recommended scoring script based on sclite https://github.com/SpeechColab/GigaSpeech/blob/main/utils/gigaspeech_scoring.py , researchers may use this tool if they want consistent comparison across different systems.

We'd better leave this ISSUE open for a while, so people may read above discussion.

@dophist dophist added the documentation Improvements or additions to documentation label Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants