Get segments based on VAD #528

Ar770 · 2023-10-25T14:49:42Z

Ar770
Oct 25, 2023

I'm currently employing a fine-tuned model that was converted to CT2 format. However, the word-level timestamps and segments produced by this model preform very badly.

As an alternative, I'm considering using a forced-alignment model, which performs optimally with audio chunks of approximately 8 seconds in length.

I'm curious to know if, instead of obtaining the original Whisper segments, it might be possible to acquire segments based on VAD.
Since the "max_speech_duration_s" parameter is available, I'm wondering if there's a way to achieve this segmentation based on VAD.

toanhuynhnguyen · 2024-09-15T03:04:38Z

toanhuynhnguyen
Sep 15, 2024

If you want more accuracy for word and sentence timestamps, you can use Batched Faster-Whisper, I have tested it.
Or you can use https://github.com/m-bain/whisperX but it will be slower than Faster Whisper.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Get segments based on VAD #528

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Get segments based on VAD #528

Uh oh!

Ar770 Oct 25, 2023

Replies: 1 comment

Uh oh!

Uh oh!

toanhuynhnguyen Sep 15, 2024

Ar770
Oct 25, 2023

toanhuynhnguyen
Sep 15, 2024