FYI: FUTO did an ACFT finetune of whisper that works with <30s of audio #1006

thiswillbeyourgithub · 2024-09-12T11:00:13Z

Hi,

I just wanted to point out here a model that I found interesting and deserves to be well known IMO.

Here's the relevant part of the README.md:

The Whisper model is composed of two parts: the encoder which takes in 30 seconds of audio, and the decoder which outputs text.

The main source of latency between the model receiving audio and starting to output text is running the encoder. When running on resource-constrained devices such as phones, this latency can be big and it's important to minimize it in applications such as voice input.

One reason the encoder can be so slow is because the encoder input must always be 30 seconds. Even if the speech is 5 seconds long, it's necessary to add 25 seconds of silence and the encoder must "waste" processing time on those 25 seconds of nothing.

It'd be great if we could skip adding silence and just get the encoder to process whatever length of audio we have. In fact, we can and this is what the audio_ctx parameter in whisper.cpp does, which was ggerganov/whisper.cpp#137.

Unfortunately, the model gets surprised by this and freaks out if you mess with this parameter too much. If you set it too low, usually the decoder doesn't know when to stop, and it'll repeat itself forever.

However, this issue can be mitigated by finetuning the model to tolerate dynamic audio context. The next section proposes a way to do this.

Link: https://github.com/futo-org/whisper-acft

This is primarily meant to be used on mobile phones via their keyboard and voice apps. If I understood faster whisper correctly then maybe both approach could be combined in the future for even faster inference.

Feel free to close this of course!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FYI: FUTO did an ACFT finetune of whisper that works with <30s of audio #1006

FYI: FUTO did an ACFT finetune of whisper that works with <30s of audio #1006

thiswillbeyourgithub commented Sep 12, 2024 •

edited

Loading

FYI: FUTO did an ACFT finetune of whisper that works with <30s of audio #1006

FYI: FUTO did an ACFT finetune of whisper that works with <30s of audio #1006

Comments

thiswillbeyourgithub commented Sep 12, 2024 • edited Loading

thiswillbeyourgithub commented Sep 12, 2024 •

edited

Loading