Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does visqol use gpu? Best settings for evaluating noise supression? #80

Open
opooladz opened this issue Dec 10, 2022 · 2 comments
Open

Comments

@opooladz
Copy link

Hi thanks for the repo.

Quick question, when I am running visqol I am not seeing any gpu usage. Should I be? Perhaps my bazel did not installed correctly or the version of TF being used is not utilizing the gpu. I am running over thousands of files and it's taking quite some time...

Also just wanted to check what the best settings are for evaluating noise suppression using visqol? I see the two flags
--use_speech_mode --use_unscaled_speech_mos_mapping, if I use this might it ignore some bands of noise that may be present in the file (I see its sensitive up to 8kHz)? Should I run visqol in audio mode and speech mode and average the two (perhaps a weighted avg)?

Thanks for your guidance in advance.

@mchinen
Copy link
Collaborator

mchinen commented Dec 12, 2022

Hi, thanks for the question! ViSQOL does have a TFLite model, but it runs on CPU and is not the main bottleneck. Even in batch mode, it evaluates the list of files serially. This could be improved.

I don't recommend averaging the two modes, because they are quite different in scale. We don't yet have support for greater than wideband speech, and it's a limitation. For noise suppression, ViSQOL will require the clean reference, which isn't always available. If you're looking for a no-reference model specifically for noise suppression, I'd recommend DNSMOS.

@opooladz
Copy link
Author

Hi, thanks for the quick response. I actually have access to the clean speech as well as the noisy speech, so I can use a reference metric. I will look into DNSMOS as well. Right now, I am using PESQ (sample referential), as well as Fréchet Audio Distance (reference-free or dataset referential).

Assume a model $X = S + N$. $S$ is speech and $N$ is noise.

I feed $X$ into a noise suppressor and get $\hat{S}$
So we have $X$ and $\hat{S},$ if we do ViSQOL( $S,X$ ) under speech settings might it actually ignore certain frequencies where noise occurs in $X$ (since it's only sensitive up to 8khz)? Same with ViSQOL( $S,\hat{S}$ )

Right now, I am getting the following results averaged over 10k samples.

Audio Settings:
ViSQOL( $S,X$ ) = 3.1
ViSQOL( $S,\hat{S}$ ) = 3.7

Speech Settings:
ViSQOL( $S,X$ ) = 1.2
ViSQOL( $S,\hat{S}$ ) = 1.9

Just wondering what the recommended settings for is using ViSQOL in my task. Perhaps they are both inciteful in different ways. If so maybe, you can help me understand the intuition/meaning of the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants