You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just wanted to point out here a model that I found interesting and deserves to be well known IMO.
Here's the relevant part of the README.md:
The Whisper model is composed of two parts: the encoder which takes in 30 seconds of audio, and the decoder which outputs text.
The main source of latency between the model receiving audio and starting to output text is running the encoder. When running on resource-constrained devices such as phones, this latency can be big and it's important to minimize it in applications such as voice input.
One reason the encoder can be so slow is because the encoder input must always be 30 seconds. Even if the speech is 5 seconds long, it's necessary to add 25 seconds of silence and the encoder must "waste" processing time on those 25 seconds of nothing.
It'd be great if we could skip adding silence and just get the encoder to process whatever length of audio we have. In fact, we can and this is what the audio_ctx parameter in whisper.cpp does, which was ggerganov/whisper.cpp#137.
Unfortunately, the model gets surprised by this and freaks out if you mess with this parameter too much. If you set it too low, usually the decoder doesn't know when to stop, and it'll repeat itself forever.
However, this issue can be mitigated by finetuning the model to tolerate dynamic audio context. The next section proposes a way to do this.
This is primarily meant to be used on mobile phones via their keyboard and voice apps. If I understood faster whisper correctly then maybe both approach could be combined in the future for even faster inference.
Feel free to close this of course!
The text was updated successfully, but these errors were encountered:
Hi,
I just wanted to point out here a model that I found interesting and deserves to be well known IMO.
Here's the relevant part of the README.md:
Link: https://github.com/futo-org/whisper-acft
This is primarily meant to be used on mobile phones via their keyboard and voice apps. If I understood faster whisper correctly then maybe both approach could be combined in the future for even faster inference.
Feel free to close this of course!
The text was updated successfully, but these errors were encountered: