New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Jiltseb · 2024-05-24T09:02:22Z

Hello everyone,

This PR adds a major update to Faster Whisper, bringing both speed and quality improvements!

Speed improvements:

Batching support: Inspired by whisper-x, this update introduces batching support allowing for a 3x speed increase. This implementation builds on whiper-x and supports more run-time arguments and external VAD segments. The batched version now runs at 64x real-time speed, compared to the previous 20x.
Faster feature extraction: We've incorporated torchaudio-based parallel STFT as an alternative to the current implementation from transformers, providing additional speed boosts. With the enable_ta_fe flag, the final version achieves an impressive 104x real-time speed. This is up to 12.5x on average compared to OpenAI implementation!

Using the batched version is straightforward:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
result = batched_model.transcribe("audio.mp3", batch_size=16)

for segment, info in result:
	print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Quality Improvements

Consistency across runs: By setting the model seed, consistency across runs is improved.
Reducing hallucinations: Stricter checks in the inference pipeline reduce unstructured or repeated phrases.
Reliable language detection: A new function detects language more reliably by considering highly confident and random segments, breaking ties to determine the major language.
Code-switching support: Handles audio with multiple languages by detecting language every 30 seconds and dynamically directing data flow. Since the exact language switching position is unknown, this can have an error within a 30 sec segment range.

Language detection Usage:

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")

Benchmarking:

A. Open source benchmarking:

Open_asr_eval solely consists of short-form audio and the average audio duration is less than 10 sec in general. Hence, using a subset of the YouTube-Commons dataset, we've tested more complex use cases with long-form audio. Whisper-medium model is used (with batch size = 8 for batched versions) for the experiments. Dataset card of youtube-commons-asr-eval is mobiuslabsgmbh/youtube-commons-asr-eval.

Speed (x real-time):

System	Speed GPU	Speed CPU
OpenAI Whisper	8.2x	4.5x
faster-whisper	20.1x	5.6x
HF Whisper (batched)	59.3x	8.4x
Batched Faster-Whisper	104x	14.6x

WER:

System	WER
OpenAI Whisper	15.1
faster-whisper	14.6
HF Whisper (batched)	16.8
Batched Faster-Whisper	13.1

B. Internal dataset:

Since the transcriptions in the open-source dataset are unverified, they can contain various types of errors. Additional internal benchmarking ensures robustness across various scenarios. A smaller test set (84 minutes) with verified ground truth is used for verifying the transcription quality and speed. The test set contains 9 audios ranging from 3 minutes to 13 minutes and various audio types.

System	WER	Speed
OpenAI Whisper	6.8	9.1x
faster-whisper	6.1	17.4x
HF Whisper (batched)	8.2	42.8x
Batched Faster-Whisper	6.5	86.6x

Batched processing speeds up long-form audio without causing an increase in WER. Users can easily switch between sequential and batched Faster Whisper versions based on specific requirements.

Thank you in advance!

Acknowledgements

This is the work done at Mobiuslabs GmbH. Contact Dr. Jilt Sebastian for any queries or requests.

PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)

SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6

Updating the base faster-whisper to 0.10.0

… logic

Support for Batched inference and language detection from multiple segments in faster-whisper

Updating the base directory

Jiltseb · 2024-06-17T20:59:45Z

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.
model = WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

segments, info = model.transcribe(
                file_path,
                beam_size=5)
Anyone else seeing a performance degradation with this when transcribing short clips without batching?
I believe this is because of the use of the VAD model by default. Would it not make sense for this feature to be default deactivated? Also i have still not been able to get it working with use_vad_model set to False. It seems that there isn't an option to skip it whatsoever:
if not vad_segments:
    if self.use_vad_model:
        vad_segments = self.vad_model(
            {
                "waveform": torch.from_numpy(audio).unsqueeze(0).float(),
                "sample_rate": 16000,
            }
        )
        vad_segments = merge_chunks(
            vad_segments,
            self.chunk_size,
            onset=self.vad_onset,
            offset=self.vad_offset,
        )
    else:
        raise RuntimeError(
            "No vad segments found. Set 'use_vad_model' to True while loading the model"
        )
How can I reproduce the speed difference you get? I have tried both versions and can confirm the speed is similar to the benchmarking dataset. There is no need for batching for the 5-second audio clip anyway (You can combine all of them with silence in between if you want to run it at once with batching) Making VAD deactivated would mean that you have to provide VAD segments for it to segment the audio. If you set use_vad_model to False, this means that you will provide external vad segments instead. What is your intention while setting use_vad_model to False?
I think if vad is not used and vad timestamps aren't provided, it should default to regular 30s chunking without any bells and whistles

Do you mean uniform chunking? This can abruptly cut in the middle of the word as well causing issues in transcription. It is possible to implement some LCS-based solution (such as in transformers) around the boundary but will affect the WER. I can easily add the case to bypass vad model if audio duration is less than 30 sec though.

Jobus0 · 2024-06-17T22:11:46Z

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing
a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.

How can I reproduce the speed difference you get? I have tried both versions and can confirm the speed is similar to the benchmarking dataset.

I just now set up two new fresh projects. On the first, I ran pip install faster-whisper. On the second, I ran pip install git+https://github.com/mobiusml/faster-whisper.git. Other than that those dependencies (and sub-dependencies), they are identical.

I then ran this simple non-batched script on both, first with a 5 seconds clip, and then with a 10 minutes clip:

import faster_whisper
import time

model = faster_whisper.WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

# warm up
segments, info = model.transcribe("benchmark.wav", beam_size=5)

total_start_time = time.time()

repeats = 10
for i in range(repeats):
    start_time = time.time()
    segments, info = model.transcribe("benchmark.wav", beam_size=5)
    print(f"Elapsed time: {time.time() - start_time:.4f}")

print()
print(f"Total elapsed time: {time.time() - total_start_time:.4f}")
print(f"Average elapsed time: {(time.time() - total_start_time)/repeats:.4f}")

Results for 5 seconds clip

repository	clip length	average elapsed time	relative %
SYSTRAN (original)	5 sec	0.1837 sec	100%
mobiusml (fork)	5 sec	0.2924 sec	159%

Note: “relative %” compares inference times, with SYSTRAN as the baseline. In this case, the fork takes 59% more time, which could be translated to it being ~38% slower.

Results for 10 minutes clip

repository	clip length	average elapsed time	relative %
SYSTRAN (original)	10 min	0.8062 sec	100%
mobiusml (fork)	10 min	0.9063 sec	112%

I've reran the script many times and get consistent results.

Screenshot of the process. Left side is the original repo, right side is the fork.

OS: Windows 11
GPU: RTX 4070
Python: 3.12

Jiltseb · 2024-06-17T22:30:32Z

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing
a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.

How can I reproduce the speed difference you get? I have tried both versions and can confirm the speed is similar to the benchmarking dataset.

I just now set up two new fresh projects. On the first, I ran pip install faster-whisper. On the second, I ran pip install git+https://github.com/mobiusml/faster-whisper.git. Other than that those dependencies (and sub-dependencies), they are identical.

I then ran this simple non-batched script on both, first with a 5 seconds clip, and then with a 10 minutes clip:
import faster_whisper
import time

model = faster_whisper.WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

# warm up
segments, info = model.transcribe("benchmark.wav", beam_size=5)

total_start_time = time.time()

repeats = 10
for i in range(repeats):
    start_time = time.time()
    segments, info = model.transcribe("benchmark.wav", beam_size=5)
    print(f"Elapsed time: {time.time() - start_time:.4f}")

print()
print(f"Total elapsed time: {time.time() - total_start_time:.4f}")
print(f"Average elapsed time: {(time.time() - total_start_time)/repeats:.4f}")
Results for 5 seconds clip

repository clip length average elapsed time relative %
SYSTRAN (original) 5 sec 0.1837 sec 100%
mobiusml (fork) 5 sec 0.2924 sec 159%
Note: “relative %” compares inference times, with SYSTRAN as the baseline. In this case, the fork takes 59% more time, which could be translated to it being ~38% slower.

Results for 10 minutes clip

repository clip length average elapsed time relative %
SYSTRAN (original) 10 min 0.8062 sec 100%
mobiusml (fork) 10 min 0.9063 sec 112%
I've reran the script many times and get consistent results.

Screenshot of the process. Left side is the original repo, right side is the fork.

OS: Windows 11 GPU: RTX 4070 Python: 3.12

I see, can you please compare and report the dev branch of faster whisper? pip install git+https://github.com/SYSTRAN/faster-whisper.git?

Jobus0 · 2024-06-18T08:42:23Z

I see, can you please compare and report the dev branch of faster whisper? pip install git+https://github.com/SYSTRAN/faster-whisper.git?

Tested repos:

SYSTRAN (1.0.2): pip install faster-whisper
SYSTRAN (master): pip install git+https://github.com/SYSTRAN/faster-whisper.git
mobiusml (master): pip install git+https://github.com/mobiusml/faster-whisper.git

Results for 5 seconds clip

repository	clip length	average elapsed time	relative %	re-run variance %
SYSTRAN (1.0.2)	5 sec	0.1737 sec	100.0%	+/- 1.8%
SYSTRAN (master)	5 sec	0.1733 sec	99.7%	+/- 1.8%
mobiusml (master)	5 sec	0.2773 sec	159.6%	+/- 1.6%

Note: "re-run variance %" is the variance of the results from re-running the script 5 times, and explains why SYSTRAN (master) is slightly faster (99.7%), and also shows that the +59.6% difference for mobiusml (master) is not random.

Jiltseb · 2024-06-18T12:24:45Z

I see, can you please compare and report the dev branch of faster whisper? pip install git+https://github.com/SYSTRAN/faster-whisper.git?

Tested repos:

SYSTRAN (1.0.2): pip install faster-whisper SYSTRAN (master): pip install git+https://github.com/SYSTRAN/faster-whisper.git mobiusml (master): pip install git+https://github.com/mobiusml/faster-whisper.git

Results for 5 seconds clip

repository clip length average elapsed time relative % re-run variance %
SYSTRAN (1.0.2) 5 sec 0.1737 sec 100.0% +/- 1.8%
SYSTRAN (master) 5 sec 0.1733 sec 99.7% +/- 1.8%
mobiusml (master) 5 sec 0.2773 sec 159.6% +/- 1.6%
Note: "re-run variance %" is the variance of the results from re-running the script 5 times, and explains why SYSTRAN (master) is slightly faster (99.7%), and also shows that the +59.6% difference for mobiusml (master) is not random.

Thanks for pointing this out and confirming the issue.

I could reproduce the error and found that this is because the garbage collector tries to clear all objects when loading the audio. Setting the resampler to None and then deleting it made sure that the object is properly removed. We can avoid the manual run gc.collect() line as it was causing the delay.

After removing the manual garbage collector, this version works fine with similar run time.

@trungkienbkhn The solution to memory leak problem with gc.collect() seems to be consistent with setting resampler to None.

… automatically processed without VAD, removed manual garbage collector, etc

Bug fixes, adding no VAD transcription, tests

trungkienbkhn · 2024-06-19T08:58:43Z

I see, can you please compare and report the dev branch of faster whisper? pip install git+https://github.com/SYSTRAN/faster-whisper.git?

Tested repos:

SYSTRAN (1.0.2): pip install faster-whisper SYSTRAN (master): pip install git+https://github.com/SYSTRAN/faster-whisper.git mobiusml (master): pip install git+https://github.com/mobiusml/faster-whisper.git

Results for 5 seconds clip

repository clip length average elapsed time relative % re-run variance %
SYSTRAN (1.0.2) 5 sec 0.1737 sec 100.0% +/- 1.8%
SYSTRAN (master) 5 sec 0.1733 sec 99.7% +/- 1.8%
mobiusml (master) 5 sec 0.2773 sec 159.6% +/- 1.6%
Note: "re-run variance %" is the variance of the results from re-running the script 5 times, and explains why SYSTRAN (master) is slightly faster (99.7%), and also shows that the +59.6% difference for mobiusml (master) is not random.

Thanks for pointing this out and confirming the issue.

I could reproduce the error and found that this is because the garbage collector tries to clear all objects when loading the audio. Setting the resampler to None and then deleting it made sure that the object is properly removed. We can avoid the manual run gc.collect() line as it was causing the delay.

After removing the manual garbage collector, this version works fine with similar run time.

@trungkienbkhn The solution to memory leak problem with gc.collect() seems to be consistent with setting resampler to None.

Hello. I confirm that after removing gc.collect(), mobiusml (master) works fine with a similar runtime as SYSTRAN (original).
However, it seems that replacing gc.collect() by resampler = None doesn't solve the memory leak problem.
I tried this example again:

Baseline
5350.244352
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872

Decode audio once
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872

After changing with resampler=None
5366.464512
5387.788288
5397.89312
5410.004992
5424.2304
5387.927552
5397.909504
5410.021376
5424.2304
5439.737856
5388.161024
5398.142976
5410.254848
5424.463872
5439.91808
5387.620352
5398.765568
5412.974592
5426.147328
5439.32416
5452.496896
5465.669632
5478.846464
5492.0192
5506.564096
5517.398016
5530.570752
5543.747584
5556.92032
5481.639936
5481.762816
5481.885696
5482.008576
5482.131456
5482.254336
5482.377216
5482.500096
5491.99872
5504.380928
5517.312

Jiltseb · 2024-06-19T09:54:56Z

I got better values when replacing garbage collector with resampler to None.
With the same settings as decode audio once:

1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.304128
1264.304128
1264.304128
1264.304128
1264.304128
1264.304128```

If you run decode_audio multiple times(Baseline):

```Baseline
1278.783488
1279.455232
1279.496192
1281.31072
1281.31072
1281.425408
1281.437696
1281.437696
1281.437696
1281.437696
1281.437696
1281.441792
1281.441792
1281.441792
1281.441792
1281.441792
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888```

I tried the same with SYSTRAN master. With multiple decode_audio calls:

```Baseline
762.056704
763.51488
763.51488
763.650048
763.65824
764.60032
764.604416
764.624896
764.694528
764.694528

@trungkienbkhn Can you check SYSTRAN master as well from your end?

MahmoudAshraf97

this removes the need for and extra dependency without losing any functionality

MahmoudAshraf97 · 2024-06-19T09:56:08Z

faster_whisper/transcribe.py

 from inspect import signature
 from typing import BinaryIO, Iterable, List, NamedTuple, Optional, Tuple, Union

 import ctranslate2
+import jsons


Suggested change

import jsons

MahmoudAshraf97 · 2024-06-19T09:57:54Z

faster_whisper/transcribe.py

+            **preprocess_params,
+            "enable_ta_fe": enable_ta_fe,
+        }
+        options_dict = jsons.dump(options)


Suggested change

options_dict = jsons.dump(options)

options_dict = options._asdict()

MahmoudAshraf97 · 2024-06-19T09:58:13Z

requirements.txt

+pyannote-audio>=3.1.1
+torch>=2.1.1 
+torchaudio>=2.1.2
+jsons>=1.6.3


Suggested change

jsons>=1.6.3

trungkienbkhn · 2024-06-19T10:29:06Z

I switched to use GPU H100 with large-v3 model and checkkout to mobius/master branch, below is my result for 100 times:

After changing
2306.183168
2314.395648
2322.096128
2306.6624
2314.36288
2322.333696
2330.034176
2338.004992
2306.711552
2314.412032
2322.382848
2330.083328
2338.054144
2306.748416
2314.448896
2322.419712
2330.120192
2338.091008
2345.734144
2346.840064
2346.840064
2346.840064
2346.840064
2346.840064
2347.106304
2361.294848
2369.265664
2376.966144
2384.93696
2392.63744
2400.608256
2408.308736
2416.279552
2424.180736
2431.881216
2439.852032
2447.552512
2455.252992
2347.106304
2347.106304
2347.106304
2347.106304
2347.106304
2361.561088
2369.261568
2377.232384
2384.932864
2392.90368
2400.60416
2408.574976
2416.275456
2424.17664
2431.87712
2439.847936
2447.548416
2455.519232
2455.67488
2455.67488
2455.67488
2455.67488
2455.67488
2455.94112
2455.94112
2455.94112
2455.94112
2455.94112
2456.211456
2456.211456
2456.211456
2456.211456
2456.481792
2456.481792
2456.481792
2456.481792
2463.100928
2462.26944
2462.26944
2462.26944
2462.26944
2462.53568
2462.53568
2462.53568
2462.53568
2462.806016
2462.806016
2462.806016
2462.806016
2462.806016
2463.076352
2463.076352
2463.076352
2463.076352
2463.346688
2463.514624
2463.514624
2463.514624
2463.514624
2463.514624
2463.780864
2463.780864

Decode audio once: 2371.44064

My code logic:

import psutil
import gc
import sys
import faster_whisper

model = faster_whisper.WhisperModel("large-v3", device="cuda")
audio_path = "tests/data/jfk.flac"
process = psutil.Process()


def monitor_memory(audio, n=100):
    for _ in range(n):
        segments, _ = model.transcribe(audio)
        text = "".join(segment.text for segment in segments)
        print(process.memory_info().rss / 1000000)

    print("")
    gc.collect()


print("After changing")
monitor_memory(audio_path)

Jiltseb · 2024-06-19T11:47:59Z

@trungkienbkhn Okay Thanks, Could you also report the results with SYSTRAN/master branch as well(same code)?

trungkienbkhn · 2024-06-19T17:21:54Z

@trungkienbkhn Okay Thanks, Could you also report the results with SYSTRAN/master branch as well(same code)?

Okay.
Same device and same code for SYSTRAN/master branch:

Baseline
1848.13568
1848.373248
1848.389632
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192

If modify code logic (replace gc.collect() by resampler=None) in the SYSTRAN/master branch:

Modify in Systran/master
1513.590784
1527.56224
1527.734272
1530.974208
1543.811072
1556.709376
1573.085184
1541.271552
1543.430144
1557.74976
1569.97632
1586.507776
1531.318272
1542.402048
1559.01952
1571.045376
1584.283648
1546.40384
1546.40384
1558.155264
1571.262464
1584.300032
1597.407232
1610.514432
1623.621632
1636.88448
1650.368512
1663.40608
1675.362304
1688.670208
1703.870464
1716.977664
1730.015232
1743.433728
1756.295168
1755.324416
1755.324416
1755.324416
1755.324416
1755.324416
1755.590656
1755.590656
1755.590656
1755.590656
1755.590656
1755.860992
1755.860992
1755.860992
1755.860992
1756.131328
1756.131328
1756.131328
1759.576064
1774.096384
1773.121536
1773.121536
1773.121536
1773.121536
1773.387776
1773.387776
1773.387776
1773.387776
1773.387776
1773.658112
1773.658112
1773.658112
1773.658112
1773.928448
1773.928448
1773.928448
1773.928448
1774.198784
1773.182976
1773.182976
1773.182976
1773.182976
1773.182976
1773.449216
1773.449216
1773.449216
1773.449216
1773.719552
1773.719552
1773.719552
1773.719552
1773.719552
1773.989888
1773.989888
1773.989888
1776.422912
1788.592128
1787.670528
1787.670528
1787.670528
1787.670528
1787.936768
1787.936768
1787.936768
1787.936768
1788.207104

Jiltseb · 2024-06-19T19:41:01Z

I assume you have first set resampler=None and then del resampler instead of replacing the gc.collect() line with resampler= None. Can you confirm? If so, let's hope pyAV fixes it (PyAV-Org/PyAV#1429).

hargunmujral · 2024-06-19T23:59:59Z

Could you include the added pyannote_vad_model.bin to the MANIFEST.in file? Otherwise the file doesn't build in the wheel and instead downloads from HF. mobiusml#17

trungkienbkhn · 2024-06-20T01:59:25Z

I assume you have first set resampler=None and then del resampler instead of replacing the gc.collect() line with resampler= None. Can you confirm? If so, let's hope pyAV fixes it (PyAV-Org/PyAV#1429).

I modified logic as below:

resampler = None
del resampler
# gc.collect()

Update MANIFEST.in to include pyannote asset

hargunmujral · 2024-06-20T17:19:20Z

I also noticed that this PR broke the prepend_punctuation and append_punctuation options in word-level timestamping. Can this be fixed?

Additionally, I occasionally see the following error: 'NoneType' object has no attribute 'split_to_word_tokens'" when doing transcription. Do you know what that could be related to? It seems to come from the tokenizer not being initialized, but I'm not sure why.

Jiltseb · 2024-06-20T21:55:21Z

The options prepend_punctuation and append_punctuation are not yet added in BatchedInference. I will check that.

I could not reproduce the error you mentioned.split_to_word_tokens can not be done on a None type tokenizer but we are setting this in the code. Can you check the result of

     self.tokenizer = Tokenizer(
                self.model.hf_tokenizer,
                self.model.model.is_multilingual,
                task=task,
                language=language,
            )

in the get_language_and_tokenizer function when this happens?

amdrozdov · 2024-06-21T20:01:42Z

Hello, I reproduced same issue with 'NoneType' object has no attribute 'split_to_word_tokens'". It happens if I set num_workers > 1 (for transcribe function). Does it happens because of multi-threading?

MahmoudAshraf97 · 2024-06-22T11:29:21Z

I assume you have first set resampler=None and then del resampler instead of replacing the gc.collect() line with resampler= None. Can you confirm? If so, let's hope pyAV fixes it (PyAV-Org/PyAV#1429).

Since torchaudio is already included in the requirements, it has a resampling algorithm that supports GPU and doesn't need ffmpeg or any external libraries, I suggest we use it and remove PyAV resampling

Jiltseb · 2024-06-22T11:58:17Z

I assume you have first set resampler=None and then del resampler instead of replacing the gc.collect() line with resampler= None. Can you confirm? If so, let's hope pyAV fixes it (PyAV-Org/PyAV#1429).

Since torchaudio is already included in the requirements, it has a resampling algorithm that supports GPU and doesn't need ffmpeg or any external libraries, I suggest we use it and remove PyAV resampling

Good idea, can you check and confirm if the loaded audio is exactly the same (torchaudio resampler can be combined with pyAV decoding)?

MahmoudAshraf97 · 2024-06-22T22:30:30Z

I implemented it and other changes here
Everyone's reviews and comments are appreciated

trungkienbkhn · 2024-06-24T04:32:59Z

I implemented it and other changes here Everyone's reviews and comments are appreciated

That is a good idea to avoid memory leak, but when I tested your changes, I found that the WER increased significantly.

Evaluating...: 499it [02:51,  2.91it/s]
WER: 19.771

Compare with origin FW (3.097) and origin multi batch FW (1.773) from here

MahmoudAshraf97 · 2024-06-24T11:52:49Z

I implemented it and other changes here Everyone's reviews and comments are appreciated

That is a good idea to avoid memory leak, but when I tested your changes, I found that the WER increased significantly.
Evaluating...: 499it [02:51,  2.91it/s]
WER: 19.771
Compare with origin FW (3.097) and origin multi batch FW (1.773) from here

it was a bug in pad_or_trim, it should be fixed now and WER dropped by another 0.85% on non-batched inference

Evaluating...: 499it [04:59,  1.67it/s]
WER: 2.242

Jiltseb added 30 commits June 9, 2023 13:52

seed, multilingual and fixes

fc54cb9

added languages in tokenizer

84d58fa

multilingual fixes

63bea66

vocabulary extension fix for downloads

b95d694

code fixes for multilingual

a8626bb

Squash long words at window and sentence boundaries

c2ca8d4

added commits specifying changes to original package

9edf960

seed, multilingual and fixes

d008650

added languages in tokenizer

2573982

multilingual fixes

8add326

vocabulary extension fix for downloads

afc3f5c

code fixes for multilingual

dd55c03

Squash long words at window and sentence boundaries

d34780e

added commits specifying changes to original package

9fab8d9

modifications based on review

162fbf0

removed LANGUAGES from tokenizer and added numpy requirements

ca6a2ba

Merge remote-tracking branch 'upstream/master'

0df6953

Merge local master to 'updated_js_v2.1'

988c528

Merge pull request #1 from mobiusml/js_asr_v2.1_pr

443eb86

PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)

Update requirements.txt

6a51407

SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6

Merge pull request #2 from SYSTRAN/master

4138e16

Updating the base faster-whisper to 0.10.0

changes to README.md

b906a98

Added BatchedInferencePipeline

0464122

Added language detection from multiple segments and batched inference…

78b5cd7

… logic

added additional packages

f397e37

changes to batched inference based on the review

83895ac

change in silence detection

e1c1699

Merge pull request #3 from mobiusml/batched_asr

b516bc8

Support for Batched inference and language detection from multiple segments in faster-whisper

Merge pull request #4 from SYSTRAN/master

3477d86

Updating the base directory

added logic for torchaudio based feature extraction

95df9eb

test scripts for word level timestamps, audios less than chunk_length…

5c3e6f2

… automatically processed without VAD, removed manual garbage collector, etc

Jiltseb mentioned this pull request Jun 18, 2024

Bug fixes, adding no VAD transcription, tests mobiusml/faster-whisper#16

Merged

Jiltseb added 2 commits June 18, 2024 12:44

added code validation

d1f4a7e

Merge pull request #16 from mobiusml/fw_changes

46310af

Bug fixes, adding no VAD transcription, tests

MahmoudAshraf97 reviewed Jun 19, 2024

View reviewed changes

MahmoudAshraf97 mentioned this pull request Jun 19, 2024

Resampler object still occupies memory after deletion PyAV-Org/PyAV#1429

Open

4 tasks

Update MANIFEST.in to include pyannote asset

7498451

Merge pull request #17 from hargunmujral/patch-1

307de38

Update MANIFEST.in to include pyannote asset

Jiltseb mentioned this pull request Jun 24, 2024

removing the need for jsons dependency mobiusml/faster-whisper#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Jiltseb commented May 24, 2024 •

edited

Loading

Jiltseb commented Jun 17, 2024

Jobus0 commented Jun 17, 2024

Jiltseb commented Jun 17, 2024

Results for 5 seconds clip

Results for 10 minutes clip

Screenshot of the process. Left side is the original repo, right side is the fork.

Jobus0 commented Jun 18, 2024

Jiltseb commented Jun 18, 2024 •

edited

Loading

Tested repos:

Results for 5 seconds clip

trungkienbkhn commented Jun 19, 2024 •

edited

Loading

Tested repos:

Results for 5 seconds clip

Jiltseb commented Jun 19, 2024 •

edited

Loading

MahmoudAshraf97 left a comment

MahmoudAshraf97 Jun 19, 2024

MahmoudAshraf97 Jun 19, 2024

MahmoudAshraf97 Jun 19, 2024

trungkienbkhn commented Jun 19, 2024 •

edited

Loading

Jiltseb commented Jun 19, 2024 •

edited

Loading

trungkienbkhn commented Jun 19, 2024 •

edited

Loading

Jiltseb commented Jun 19, 2024

hargunmujral commented Jun 19, 2024 •

edited

Loading

trungkienbkhn commented Jun 20, 2024

hargunmujral commented Jun 20, 2024 •

edited

Loading

Jiltseb commented Jun 20, 2024 •

edited

Loading

amdrozdov commented Jun 21, 2024

MahmoudAshraf97 commented Jun 22, 2024

Jiltseb commented Jun 22, 2024

MahmoudAshraf97 commented Jun 22, 2024

trungkienbkhn commented Jun 24, 2024

MahmoudAshraf97 commented Jun 24, 2024 •

edited

Loading

	options_dict = jsons.dump(options)
	options_dict = options._asdict()

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Are you sure you want to change the base?

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Conversation

Jiltseb commented May 24, 2024 • edited Loading

Speed improvements:

Quality Improvements

Benchmarking:

A. Open source benchmarking:

B. Internal dataset:

Acknowledgements

Jiltseb commented Jun 17, 2024

Jobus0 commented Jun 17, 2024

Results for 5 seconds clip

Results for 10 minutes clip

Screenshot of the process. Left side is the original repo, right side is the fork.

Jiltseb commented Jun 17, 2024

Results for 5 seconds clip

Results for 10 minutes clip

Screenshot of the process. Left side is the original repo, right side is the fork.

Jobus0 commented Jun 18, 2024

Tested repos:

Results for 5 seconds clip

Jiltseb commented Jun 18, 2024 • edited Loading

Tested repos:

Results for 5 seconds clip

trungkienbkhn commented Jun 19, 2024 • edited Loading

Tested repos:

Results for 5 seconds clip

Jiltseb commented Jun 19, 2024 • edited Loading

MahmoudAshraf97 left a comment

Choose a reason for hiding this comment

MahmoudAshraf97 Jun 19, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 Jun 19, 2024

Choose a reason for hiding this comment

MahmoudAshraf97 Jun 19, 2024

Choose a reason for hiding this comment

trungkienbkhn commented Jun 19, 2024 • edited Loading

Jiltseb commented Jun 19, 2024 • edited Loading

trungkienbkhn commented Jun 19, 2024 • edited Loading

Jiltseb commented Jun 19, 2024

hargunmujral commented Jun 19, 2024 • edited Loading

trungkienbkhn commented Jun 20, 2024

hargunmujral commented Jun 20, 2024 • edited Loading

Jiltseb commented Jun 20, 2024 • edited Loading

amdrozdov commented Jun 21, 2024

MahmoudAshraf97 commented Jun 22, 2024

Jiltseb commented Jun 22, 2024

MahmoudAshraf97 commented Jun 22, 2024

trungkienbkhn commented Jun 24, 2024

MahmoudAshraf97 commented Jun 24, 2024 • edited Loading

Jiltseb commented May 24, 2024 •

edited

Loading

Jiltseb commented Jun 18, 2024 •

edited

Loading

trungkienbkhn commented Jun 19, 2024 •

edited

Loading

Jiltseb commented Jun 19, 2024 •

edited

Loading

trungkienbkhn commented Jun 19, 2024 •

edited

Loading

Jiltseb commented Jun 19, 2024 •

edited

Loading

trungkienbkhn commented Jun 19, 2024 •

edited

Loading

hargunmujral commented Jun 19, 2024 •

edited

Loading

hargunmujral commented Jun 20, 2024 •

edited

Loading

Jiltseb commented Jun 20, 2024 •

edited

Loading

MahmoudAshraf97 commented Jun 24, 2024 •

edited

Loading