Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FYI: FUTO did an ACFT finetune of whisper that works with <30s of audio #1006

Open
thiswillbeyourgithub opened this issue Sep 12, 2024 · 0 comments

Comments

@thiswillbeyourgithub
Copy link

thiswillbeyourgithub commented Sep 12, 2024

Hi,

I just wanted to point out here a model that I found interesting and deserves to be well known IMO.

Here's the relevant part of the README.md:

The Whisper model is composed of two parts: the encoder which takes in 30 seconds of audio, and the decoder which outputs text.

The main source of latency between the model receiving audio and starting to output text is running the encoder. When running on resource-constrained devices such as phones, this latency can be big and it's important to minimize it in applications such as voice input.

One reason the encoder can be so slow is because the encoder input must always be 30 seconds. Even if the speech is 5 seconds long, it's necessary to add 25 seconds of silence and the encoder must "waste" processing time on those 25 seconds of nothing.

It'd be great if we could skip adding silence and just get the encoder to process whatever length of audio we have. In fact, we can and this is what the audio_ctx parameter in whisper.cpp does, which was ggerganov/whisper.cpp#137.

Unfortunately, the model gets surprised by this and freaks out if you mess with this parameter too much. If you set it too low, usually the decoder doesn't know when to stop, and it'll repeat itself forever.

However, this issue can be mitigated by finetuning the model to tolerate dynamic audio context. The next section proposes a way to do this.

Link: https://github.com/futo-org/whisper-acft

This is primarily meant to be used on mobile phones via their keyboard and voice apps. If I understood faster whisper correctly then maybe both approach could be combined in the future for even faster inference.

Feel free to close this of course!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant