Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get the decoding result scores from #42

Open
pengcheng-tech opened this issue Jul 22, 2021 · 4 comments
Open

How to get the decoding result scores from #42

pengcheng-tech opened this issue Jul 22, 2021 · 4 comments

Comments

@pengcheng-tech
Copy link

pengcheng-tech commented Jul 22, 2021

Hi,

Thanks for the work. I am trying to use the pre-trained model, but I don't know how to get the decoding score for the corresponding decoding results.

nbests = speech2text(speech)

text, *_ = nbests[0]

print(text)

The code above only prints text. I would like to get decoding confidence as well.

I checked speech2text class.

for hyp in nbest_hyps:
            assert isinstance(hyp, Hypothesis), type(hyp)

            # remove sos/eos and get results
            token_int = hyp.yseq[1:-1].tolist()

            # remove blank symbol id, which is assumed to be 0
            token_int = list(filter(lambda x: x != 0, token_int))

            # Change integer-ids to tokens
            token = self.converter.ids2tokens(token_int)

            if self.tokenizer is not None:
                text = self.tokenizer.tokens2text(token)
            else:
                text = None
            results.append((text, token, token_int, hyp))

        assert check_return_type(results)
        return results

From the code above I conjecture that the confidence should be obtained from the "hyp", but it is not clear to me how
to parse "hyp" to get the score.

@kamo-naoyuki
Copy link
Contributor

Hypothesis is a NamedTuple object. You can refer attributes.

https://github.com/espnet/espnet/blob/master/espnet/nets/beam_search.py#L19-L33

@pengcheng-tech
Copy link
Author

Hypothesis is a NamedTuple object. You can refer attributes.

https://github.com/espnet/espnet/blob/master/espnet/nets/beam_search.py#L19-L33

Hi, thanks for your response.

By referring to the link. I modified the code as follows:

nbests = speech2text(speech)

text, *_, score_bundle = nbests[0]

By executing the following:

print(score_bundle.score)
print(score_bundle.scores)

I got :
tensor(-57.1623, device='cuda:0')
{'decoder': tensor(-2.6879, device='cuda:0'), 'lm': tensor(-55.0374, device='cuda:0'), 'ctc': tensor(-0.8112, device='cuda:0')}

I think the number "-57.1623" is the the result of log P_encdec(y|x) + log P_ctc(y|x) + log P_lm(y), where log P_encdec(y|x) is -2.6879, log P_ctc(y|x) is -0.8112 and log P_lm(y) is -55.0374, a bit mismatch though...

If I denote -57.1623 as nbests[0].score
Can I just grab nbests[0] until nbests[100], and using nbests[0].score/ (nbests[0].score + nbests[1].score + ...+ nbests[100].score) to roughly obtain the decoding confidence score?

Thanks a lot

@kamo-naoyuki
Copy link
Contributor

score is the weighted sum of scores. You need to decide the weight when instantiation of Speech2Text class.

You can get the arbitrary n-best scores by giving nbest argument to Speech2Text, but I think it's not trivial to regard it as the confidence score.

@pengcheng-tech
Copy link
Author

Thanks for the comment.

I currently treat the "score" (i.e., -57.1623) as a rough confidence score to indicate how confident the model predicts the semantic meaning of the audio is so. From my observation, the score of nbests[0] is higher than that of the nbests[1]. I guess it is adequate for my purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants