Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about the ABX error rate #9

Open
zhixhan opened this issue Sep 26, 2024 · 4 comments
Open

Confusion about the ABX error rate #9

zhixhan opened this issue Sep 26, 2024 · 4 comments

Comments

@zhixhan
Copy link

zhixhan commented Sep 26, 2024

Thanks for your amazing work.

I evaluate the released xcodec model on LibriSpeech test-clean set using ABX error rate metric. I perform the evaluation with the continuous representations before RVQ and after RVQ, but get the result 9.9% and 13.2% for within ABX and cross ABX respectively, which are much higher than those reported in the paper. However, I get the consistent results 3.6 and 4.7 for SpeechTokenzier in the same way.

Could you please give me some suggestions? Thank you so much!

@zhenye234
Copy link
Owner

Could you please specify the version of the xcodec model?

@zhixhan
Copy link
Author

zhixhan commented Sep 29, 2024

Could you please specify the version of the xcodec model?

Thank you for your reply. I test with the model named xcodec_hubert_librispeech

@zhenye234
Copy link
Owner

Maybe you can try the continuous representation here

o_semantic = self.decoder_semantic(quantized_semantic )

@zhixhan
Copy link
Author

zhixhan commented Sep 30, 2024

Maybe you can try the continuous representation here

o_semantic = self.decoder_semantic(quantized_semantic )

Thank you for your reply! I have tested the XCodec model with o_semnatic representation and got ABX error rate 4.4% and 5.5%, which is still a little different from the result reported in your paper. (3.3% and 4.3%)

When I extracted the o_semnatic representation with SoundStream.forward method, I got the error "e_acoustic and e_semantic have different shape in dim2" at https://github.com/zhenye234/xcodec/blob/main/models/soundstream_semantic.py#L102. Thus, I added the pad operation the same as in the encode method. Although I don't think this is the cause of the inconsistent results, I don't make any other changes to the source code. Do you have any other suggestions? Thanks for your reply again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants