Improving Timestamp Accuracy #435

jianfch · 2022-10-30T01:46:55Z

jianfch
Oct 30, 2022

As some discussions have pointed out (e.g. #26, #237, #375) that predicted timestamps tend to be integers, especially 0.0 for the initial timestamp. As a result, the phase/word tends to start before is the word is actually spoken. Even setting max_initial_timestamp=None does not appear to have much of an effect. So I added timestamp filtering heuristic to combat this issue and improve timestamp accuracy as part of stable-ts which relies on accurate segment timestamps.

An example of the results:

dot.mp4

And the respective settings:

import whisper 
from stable_whisper import modify_model
model = whisper.load_model('base')
result1 = model.transcribe('dot.mp4', language='en', max_initial_timestamp=None)
modify_model(model)
result2 = model.transcribe('dot.mp4', language='en')

stable-ts with and without timestamp suppression:

weather.mp4

And the respective settings:

...
# defaults at the time of writing: 
# top_focus=False, suppress_silence=True, suppress_middle=True, suppress_word_ts=True, remove_background=True
# upper_quantile=0.85, lower_quantile=0.15, lower_threshold=0.15
model = whisper.load_model('base')
modify_model(model)
result1 = model.transcribe('weather.mp4', language='en', suppress_silence=False, ts_num=16)
result2 = model.transcribe('weather.mp4', language='en', suppress_silence=True, ts_num=16, lower_quantile=0.05, lower_threshold=0.1))

How it works:

It looks at the waveform of the audio then create a mask for suppressing the timestamp tokens based whether the corresponding time in the audio is silent or not silent (i.e. suppress the timestamp token where the amplitude of waveform is zero).
The first step will not be as effective if there is background noise. Therefore, suppressing the timestamp tokens where there is also only background noise will make it more robust to noise. Assuming that audio is somewhat vocal dominant in terms of volume, this is accomplish by also suppressing parts the waveform where amplitude does not meet the amplitude threshold determined by: upper_quantile, lower_quantile, lower_threshold.

Edit:
This post was made for version 1.X of Stable-ts. Most of the content in this post no longer applies to version 2.X.

ryanheise · 2022-10-30T04:47:43Z

ryanheise
Oct 30, 2022

It's interesting that in weather.mp4, it is not clear which version is better. Sometimes the result is more accurate above the line, other times it's more accurate below the line.

E.g. below the line, at around the 10 second mark, it gets stuck on "We went into" for too long, and then rushes the following words to catch up. Above the line, it gets stuck on the word "and" around the 40 second mark and then rushes to catch up.

One observation I've made that might be interesting is that when I edit the audio and actually cut out the silent parts (using VAD) with zero padding whatsoever, Whisper starts returning much shorter segments with very accurate timestamps. This could be useful in getting whisper to give more fine grained timestamps as anchor points. I see in your TODO that you eventually planned to do multiple inferences to combine the best from each, and for this it would be interesting to know ways to influence Whisper to lean toward shorter segments. Have you found other ways to influence Whisper in this direction?

2 replies

jianfch Oct 30, 2022
Author

Generally anything will increase the frequency this line being true will shorten the segments (e.g. making max_text_token_logprob smaller by multiplying it by 0.8).

ryanheise Oct 30, 2022

Hmm, I tested this idea on Japanese, and it actually caused it to merge some segments together creating longer segments on average.

altryne · 2022-10-30T05:01:56Z

altryne
Oct 30, 2022

This is awesome, i'm working on a whisper product and it's been driving me nuts how incorrect some timestamps are getting!

0 replies

Dygear · 2022-10-30T12:22:41Z

Dygear
Oct 30, 2022

This is very, very, very, very, very helpful! Thank you!

0 replies

ggerganov · 2022-10-30T15:37:45Z

ggerganov
Oct 30, 2022

Hi - nice job!
A small suggestion for the second video in your post - I think it would be better to present the current word in the context of the surrounding words. At least for me it is easier to judge the performance this way.

Here is an example of what I have in mind:

mm0.wav.mp4

3 replies

jianfch Oct 30, 2022
Author

Great idea!

Jellun Sep 26, 2023

Hi @ggerganov , Can you share how you produce such a mp4 video with aligned subtitles highlighted in light green?

kevinsaracho Feb 11, 2024

tutorial?

Spaarpott · 2022-10-31T09:25:39Z

Spaarpott
Oct 31, 2022

I'm a very new to Python and Git, but I can't get the stable_whisper from https://github.com/jianfch/stable-ts to work.
Whisper works in Python and in Command Line as seen in the first screenshot.

When trying to execute the new version, I'm getting a error.
Probably because the stable-ts folder is in C:\users\Edwin\stable-ts\ instead of somewhere else.
Python is in C:\Users\Edwin\AppData\Local\Programs\Python\Python310
What did I do wrong?

Also, in which folder should I put audio.mp3?
I put it now in Q:\Audio.mp3 and refer to it, because I don't know where Python or Whisper suspect it.

9 replies

Spaarpott Dec 13, 2022

It works indeed, without the extra ) at the end :)

vyhuholl Dec 13, 2022

, suppress_silence=True, ts_num=16, lower_quantile=0.05, lower_threshold=0.1))
There is a additional bracket in the end, be careful! Also, when i apply this, i couldn't get a good srt, there is no break with the speeches. Comes after comes. Is there any suggestions ?

Thank you for noticing the additional bracket.

Regrettably, Whisper does not offer voice activity detection, so it cannot detect silence. Because of this, there won't be any breaks in Whisper-generated srt file.

But there is a workaround. You can split the audio into voice chunks using some model for voice activity detection (for example, this notebook combines Whisper and pyannote), save voice chunks as new audio files and then run Whisper on those files.

erdeme36 Dec 13, 2022

Thanks for the reply @vyhuholl . I thought that there should be a repo which already solved this problem. Since the begging timestamps are very good but the end ones not especially. I am also looking at the background closed captioning like [explosion]. Is there anything you can suggest ? Thanks in advance :)

JunZhan2000 Feb 21, 2023

, suppress_silence=True, ts_num=16, lower_quantile=0.05, lower_threshold=0.1))
There is a additional bracket in the end, be careful! Also, when i apply this, i couldn't get a good srt, there is no break with the speeches. Comes after comes. Is there any suggestions ?
Thank you for noticing the additional bracket.

Regrettably, Whisper does not offer voice activity detection, so it cannot detect silence. Because of this, there won't be any breaks in Whisper-generated srt file.

But there is a workaround. You can split the audio into voice chunks using some model for voice activity detection (for example, this notebook combines Whisper and pyannote), save voice chunks as new audio files and then run Whisper on those files.

I found a interesting thing that the timestamp tiny model outputs have break, but the timestamp large-v2 model outputs don't have, do you know that's why?

luisbnzsa Mar 7, 2023

Thanks for this script and all the comments, they help me a lot. I already have something in production. I would like to add something I encountered. Using the whisper model small, the default parameters miss some parts of the speech. It looks like when the conversation is not clear, it omits parts of the conversation. This is not the case when the medium or large model is used. If someone, still prefers to use the small model, I suggest to increase the silence_threshold to at least 0.2 (default is 0.1). This will help to get missing parts of the speech. The same effect will happen if suppress_silence is change to False but this will significantly increase the processing time.

resultado = modelo2.transcribe(archivo_diar, suppress_silence=True, silence_threshold = 0.2)

I hope it helps.

e-maalouly · 2022-11-08T06:54:34Z

e-maalouly
Nov 8, 2022

Thank you this is awesome!

One problem I am having is when I specify the task to "translate" with the modified model:

result = model.transcribe('temp.wav', language = "ja", task = "translate")

I get the following error:

Traceback (most recent call last):
File "transcription.py", line 193, in
main()
File "transcription.py", line 140, in main
result = model.transcribe(
File "/home/elie/transcription/stable_whisper.py", line 1065, in transcribe_word_level
result, finalized_ts_tokens, ts_logits = decode_with_fallback(segment,
File "/home/elie/transcription/stable_whisper.py", line 926, in decode_with_fallback
results, ts_tokens, ts_logits_ = model.decode(segment, options, ts_num=ts_num, alpha=alpha,
File "/share/asr/whisper/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/elie/transcription/stable_whisper.py", line 1513, in decode_word_level
result, ts = DecodingTaskWordLevel(model, options,
File "/share/asr/whisper/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/elie/transcription/stable_whisper.py", line 1436, in run
tokens, sum_logprobs, ts = self.decoder.finalize(tokens, sum_logprobs)
File "/home/elie/transcription/stable_whisper.py", line 1319, in finalize
ts_[i][seq_tuple] = self.ts[i, j]
KeyError: 0

Do you have an idea of what might be going wrong?

2 replies

jianfch Nov 8, 2022
Author

this was fixed in jianfch/stable-ts@4cbb0b8

e-maalouly Nov 8, 2022

Thanks! its's working now

FurkanGozukara · 2022-11-09T00:01:39Z

FurkanGozukara
Nov 9, 2022

Thanks i will check. i was depending on youtube auto timing until now :D

also have you found a way to improve punctuation success?

0 replies

asonaph · 2022-11-24T16:36:13Z

asonaph
Nov 24, 2022

@jianfch Hi, how do make a video demo like "weather.mp4"? what tool do you use?

1 reply

jianfch Nov 25, 2022
Author

ffmpeg with hardsubs with .ass

ffmpeg -f lavfi -i color=size=720x120:rate=25:color=black -i audio.mp3 -vf "subtitles=audio.ass:force_style='Fontsize=70'" -shortest audio_subbed.mp4

generated .ass file with results_to_sentence_word_ass

from stable_whisper import results_to_sentence_word_ass
# after you get results from modified model
results_to_sentence_word_ass(results, 'audio.ass')

Baenwort · 2022-12-02T06:07:52Z

Baenwort
Dec 2, 2022

This is a good enhancement but does anyone know of someone using it with a CLI tool like the main whisper provides?

0 replies

m-bain · 2022-12-15T00:42:30Z

m-bain
Dec 15, 2022

Hi btw I made an enhancement to whisper here using forced alignment with wav2vec, seems to refine timestamps quite well -- although needs more testing
https://github.com/m-bain/whisperX

3 replies

ppoudd1 Dec 17, 2022

Wow,Thank you,The effect is very good,I'm still running more tests. looking forward to the launch of batch processing function.

playdasegunda Dec 18, 2022

Hi btw I made an enhancement to whisper here using forced alignment with wav2vec, seems to refine timestamps quite well -- although needs more testing https://github.com/m-bain/whisperX

Unfortunately it doesn't work very well for the Portuguese language, I hope it will be compatible soon.

Spaarpott Dec 19, 2022

Hi @m-bain
I get the following error when I run this in CMD.
I did:

pip install git+https://github.com/m-bain/whisperx.git
pip install soundfile
whisper S06E00.m4a --model medium.en --output_type srt --output S06E00.srt --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --align_extend 2

It didn't recognize --output which you specify on https://github.com/m-bain/whisperX.
I saw in the other languages --output_dir, so I changed the command line to:

whisperx Whisper/S06E00.m4a --model medium.en --output_type srt --output_dir Whisper/ --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --align_extend 2

Now I get the error:
whisperx: error: argument --align_model: invalid choice: 'WAV2VEC2_ASR_LARGE_LV60K_960H' (choose from 'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large')

So to only run pip install soundfile is probably not enough?
What to do, to get a proper .srt file?

PS: Python script is also okay. I run stable-ts with a Python script which directly creates a .srt:

import os
os.chdir("C:/Users/Edwin/stable-ts")

import whisper
from stable_whisper import modify_model

model = whisper.load_model('medium')
modify_model(model)
# modified model should run just like the regular model but with additional hyperparameters and extra data in results
results = model.transcribe('Q:/S06E00.m4a', suppress_silence=True, ts_num=16, lower_quantile=0.05, lower_threshold=0.1)
stab_segments = results['segments']
first_segment_word_timestamps = stab_segments[0]['whole_word_timestamps']

# or to get token timestamps that adhere more to the top prediction
from stable_whisper import stabilize_timestamps
stab_segments = stabilize_timestamps(results, top_focus=True)

# sentence-level
from stable_whisper import results_to_sentence_srt
# after you get results from modified model
results_to_sentence_srt(results, 'Q:/S06E00.Whisper.srt')

MohammedMehdiTBER · 2023-01-19T16:29:44Z

MohammedMehdiTBER
Jan 19, 2023

I implemented this in a script and I hope It works. Thanks. I hope you make more options to select the language to be used and the option to export .txt also

0 replies

crosenboom · 2023-05-02T16:59:52Z

crosenboom
May 2, 2023

Does this allow for auto-language detection after the silence has occurred on the transcribe method? Or does it still only work for the first 30 seconds of audio on a clip?

1 reply

luisbnzsa May 3, 2023

The way I have it, the auto-language detection is applied only during the first 30 seconds. You can change that if you want but increasing the time will make the process much longer.

jh-modjeski · 2023-05-03T15:13:54Z

jh-modjeski
May 3, 2023

Just want to let you know that my project does a pretty good job of getting accurate timestamps, as it breaks the audio into nonsilent audio clips before constructing the transcript. There still can be discrepancies if the clip has audio that can't be transcribed, but it's generally pretty accurate.

Check out my Show and Tell!

Project link

11 replies

dgoryeo May 3, 2023

Got it! Thanks @jh-modjeski !

jh-modjeski May 3, 2023

Thanks for checking it out! Definitely let me know if you try using it. Most of the code was developed by ChatGPT over the course of a couple days, as I am a Java developer and was curious how well it would do with a language that I'm unfamiliar with. There was a lot of trial and error, though, and it was ultimately my design, even though I only really tweaked the code as we went along.

jh-modjeski May 3, 2023

@dgoryeo You mentioned this is for a manga movie? Are you looking to transcribe Japanese audio? I can add a parameter tonight to let you control the whisper language parameter being used. Currently it is not set / most likely defaulting to English.

dgoryeo May 3, 2023

Thanks. Very helpful of you. However at the moment my usecase seems to be not a good fit to Trys. I do not have separate channels for speakers, but consecutive scenes of same long movie.

jh-modjeski May 3, 2023

Sounds good, I will include the parameter at some point in the near future regardless. You could run Trys on a single audio source using the --experimental flag to get potentially ms accurate transcription (but no speaker identification information). I don't know how long this would take Whisper to execute, however, given the length of movies.

peterstavrou · 2023-08-15T12:57:24Z

peterstavrou
Aug 15, 2023

Any ideas why I'm getting the below error?
TypeError: DecodingOptions.__init__() got an unexpected keyword argument 'lower_quantile'

I also get it with lower_threshold

Are they not available in the latest version?

model = whisper.load_model('large-v2')
modify_model(model)

result = model.transcribe(
    input_file,
    language="nl",
    task="translate",
    fp16=False,
    suppress_silence=True,
    ts_num=16,
    lower_quantile=0.05,
    lower_threshold=0.1
    )

result.to_srt_vtt('audio.srt')

I used pip install -U git+https://github.com/jianfch/stable-ts.git

0 replies

darioai · 2023-09-12T14:21:08Z

darioai
Sep 12, 2023

As some discussions have pointed out (e.g. #26, #237, #375) that predicted timestamps tend to be integers, especially 0.0 for the initial timestamp. As a result, the phase/word tends to start before is the word is actually spoken. Even setting max_initial_timestamp=None does not appear to have much of an effect. So I added timestamp filtering heuristic to combat this issue and improve timestamp accuracy as part of stable-ts which relies on accurate segment timestamps.

An example of the results:

dot.mp4
And the respective settings:
import whisper 
from stable_whisper import modify_model
model = whisper.load_model('base')
result1 = model.transcribe('dot.mp4', language='en', max_initial_timestamp=None)
modify_model(model)
result2 = model.transcribe('dot.mp4', language='en')
stable-ts with and without timestamp suppression:

weather.mp4
And the respective settings:
...
# defaults at the time of writing: 
# top_focus=False, suppress_silence=True, suppress_middle=True, suppress_word_ts=True, remove_background=True
# upper_quantile=0.85, lower_quantile=0.15, lower_threshold=0.15
model = whisper.load_model('base')
modify_model(model)
result1 = model.transcribe('weather.mp4', language='en', suppress_silence=False, ts_num=16)
result2 = model.transcribe('weather.mp4', language='en', suppress_silence=True, ts_num=16, lower_quantile=0.05, lower_threshold=0.1))
How it works:

It looks at the waveform of the audio then create a mask for suppressing the timestamp tokens based whether the corresponding time in the audio is silent or not silent (i.e. suppress the timestamp token where the amplitude of waveform is zero).

The first step will not be as effective if there is background noise. Therefore, suppressing the timestamp tokens where there is also only background noise will make it more robust to noise. Assuming that audio is somewhat vocal dominant in terms of volume, this is accomplish by also suppressing parts the waveform where amplitude does not meet the amplitude threshold determined by: upper_quantile, lower_quantile, lower_threshold.

Hi! I'm more a caption expert than programmer, so please bear with me. I've been running stable-ts with success. However, I would like to improve time stamping and would like some help applying this changes to it. Can you let me know how to apply this code? > ```python

import whisper
from stable_whisper import modify_model
model = whisper.load_model('base')
result1 = model.transcribe('dot.mp4', language='en', max_initial_timestamp=None)
modify_model(model)
result2 = model.transcribe('dot.mp4', language='en')
I've installed stable-ts and use a command similar to this: stable-ts audio.mp3 -o audio.vtt --model large --segment_level true --word_level true, except that the command prompts the user to input source file and output folder. Do I run this command instead of stable-ts audio.mp3 -o audio.vtt --model large --segment_level true --word_level true? Should I replace this line: result2 = model.transcribe('dot.mp4', language='en') with the one that asks for the source file and output folder? Thanks!

2 replies

dgoryeo Sep 12, 2023

Hi @darioai , have you checked the latest version of stable-ts ? It has had some changes in commands since @jianfch had written this original thread.

darioai Sep 13, 2023

Thanks @dgoryeo . I use this command to install it: pip install -U git+https://github.com/jianfch/stable-ts.git, so I should be running the latest version.

Jellun · 2023-09-26T10:24:17Z

Jellun
Sep 26, 2023

Hi @jianfch , Can you share how you created the mp4 video with aligned subtitles highlighted in green?

2 replies

frankchau93 Oct 28, 2023

I want to know that too

Max160 Oct 28, 2023

Encode Comparison in stable-ts

darioai · 2024-01-10T22:25:16Z

darioai
Jan 10, 2024

Hi. I was able to run Stable-ts on Google colab without issues for a while, but had to put a pause on my project. I'm retaking it now, and when I run my script (basically to look for .mp3 and .wav files in an specific directory and save the output to a predetermined directory), I get this error: "AssertionError: libcuda.so cannot found!" Any ideas on how to fix it? Thanks!

0 replies

LaurinmyReha · 2024-09-05T18:16:19Z

LaurinmyReha
Sep 5, 2024

checkout this Whisper finetune, it might solve your problems:

https://huggingface.co/nyrahealth/CrisperWhisper

1 reply

FurkanGozukara Sep 5, 2024

checkout this Whisper finetune, it might solve your problems:

https://huggingface.co/nyrahealth/CrisperWhisper

i was looking for this

how do we use it in our app?

normally this is my command

string srCommand = @$"cmd /c {gpuIdCommand}whisper ""{srExtractMp3Name}"" --model large-v1 --language en --initial_prompt ""Welcome to our Youtube channel."" --best_of 10 --beam_size 10 --patience 10 --output_dir ""{srFileDirectory}""";

alexandervlpl · 2024-11-23T11:31:10Z

alexandervlpl
Nov 23, 2024

This should definitely be the default behaviour in the main Whisper CLI, I can't imagine why you would want subtitles appearing 10s before anyone speaks.

0 replies

Improving Timestamp Accuracy #435

Replies: 19 comments · 37 replies

jianfch Oct 30, 2022 Author

jianfch Oct 30, 2022 Author

jianfch Nov 8, 2022 Author

jianfch Nov 25, 2022 Author

Replies: 19 comments 37 replies

jianfch Oct 30, 2022
Author

jianfch Oct 30, 2022
Author

jianfch Nov 8, 2022
Author

jianfch Nov 25, 2022
Author