Improving Timestamp Accuracy #435
Replies: 19 comments 37 replies
-
It's interesting that in weather.mp4, it is not clear which version is better. Sometimes the result is more accurate above the line, other times it's more accurate below the line. E.g. below the line, at around the 10 second mark, it gets stuck on "We went into" for too long, and then rushes the following words to catch up. Above the line, it gets stuck on the word "and" around the 40 second mark and then rushes to catch up. One observation I've made that might be interesting is that when I edit the audio and actually cut out the silent parts (using VAD) with zero padding whatsoever, Whisper starts returning much shorter segments with very accurate timestamps. This could be useful in getting whisper to give more fine grained timestamps as anchor points. I see in your TODO that you eventually planned to do multiple inferences to combine the best from each, and for this it would be interesting to know ways to influence Whisper to lean toward shorter segments. Have you found other ways to influence Whisper in this direction? |
Beta Was this translation helpful? Give feedback.
-
This is awesome, i'm working on a whisper product and it's been driving me nuts how incorrect some timestamps are getting! |
Beta Was this translation helpful? Give feedback.
-
This is very, very, very, very, very helpful! Thank you! |
Beta Was this translation helpful? Give feedback.
-
Hi - nice job! Here is an example of what I have in mind: mm0.wav.mp4 |
Beta Was this translation helpful? Give feedback.
-
I'm a very new to Python and Git, but I can't get the stable_whisper from https://github.com/jianfch/stable-ts to work. When trying to execute the new version, I'm getting a error. Also, in which folder should I put audio.mp3? |
Beta Was this translation helpful? Give feedback.
-
Thank you this is awesome! One problem I am having is when I specify the task to "translate" with the modified model:
I get the following error:
Do you have an idea of what might be going wrong? |
Beta Was this translation helpful? Give feedback.
-
Thanks i will check. i was depending on youtube auto timing until now :D also have you found a way to improve punctuation success? |
Beta Was this translation helpful? Give feedback.
-
@jianfch Hi, how do make a video demo like "weather.mp4"? what tool do you use? |
Beta Was this translation helpful? Give feedback.
-
This is a good enhancement but does anyone know of someone using it with a CLI tool like the main whisper provides? |
Beta Was this translation helpful? Give feedback.
-
Hi btw I made an enhancement to whisper here using forced alignment with wav2vec, seems to refine timestamps quite well -- although needs more testing |
Beta Was this translation helpful? Give feedback.
-
I implemented this in a script and I hope It works. Thanks. I hope you make more options to select the language to be used and the option to export .txt also |
Beta Was this translation helpful? Give feedback.
-
Does this allow for auto-language detection after the silence has occurred on the |
Beta Was this translation helpful? Give feedback.
-
Just want to let you know that my project does a pretty good job of getting accurate timestamps, as it breaks the audio into nonsilent audio clips before constructing the transcript. There still can be discrepancies if the clip has audio that can't be transcribed, but it's generally pretty accurate. Check out my Show and Tell! |
Beta Was this translation helpful? Give feedback.
-
Any ideas why I'm getting the below error? I also get it with Are they not available in the latest version?
I used |
Beta Was this translation helpful? Give feedback.
-
Hi! I'm more a caption expert than programmer, so please bear with me. I've been running stable-ts with success. However, I would like to improve time stamping and would like some help applying this changes to it. Can you let me know how to apply this code? > ```python
|
Beta Was this translation helpful? Give feedback.
-
Hi @jianfch , Can you share how you created the mp4 video with aligned subtitles highlighted in green? |
Beta Was this translation helpful? Give feedback.
-
Hi. I was able to run Stable-ts on Google colab without issues for a while, but had to put a pause on my project. I'm retaking it now, and when I run my script (basically to look for .mp3 and .wav files in an specific directory and save the output to a predetermined directory), I get this error: "AssertionError: libcuda.so cannot found!" Any ideas on how to fix it? Thanks! |
Beta Was this translation helpful? Give feedback.
-
checkout this Whisper finetune, it might solve your problems: |
Beta Was this translation helpful? Give feedback.
-
This should definitely be the default behaviour in the main Whisper CLI, I can't imagine why you would want subtitles appearing 10s before anyone speaks. |
Beta Was this translation helpful? Give feedback.
-
As some discussions have pointed out (e.g. #26, #237, #375) that predicted timestamps tend to be integers, especially 0.0 for the initial timestamp. As a result, the phase/word tends to start before is the word is actually spoken. Even setting
max_initial_timestamp=None
does not appear to have much of an effect. So I added timestamp filtering heuristic to combat this issue and improve timestamp accuracy as part of stable-ts which relies on accurate segment timestamps.An example of the results:
dot.mp4
And the respective settings:
stable-ts with and without timestamp suppression:
weather.mp4
And the respective settings:
How it works:
upper_quantile
,lower_quantile
,lower_threshold
.Edit:
This post was made for version 1.X of Stable-ts. Most of the content in this post no longer applies to version 2.X.
Beta Was this translation helpful? Give feedback.
All reactions