-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856
base: master
Are you sure you want to change the base?
Conversation
PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)
SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6
Updating the base faster-whisper to 0.10.0
Support for Batched inference and language detection from multiple segments in faster-whisper
Updating the base directory
Do you mean uniform chunking? This can abruptly cut in the middle of the word as well causing issues in transcription. It is possible to implement some LCS-based solution (such as in transformers) around the boundary but will affect the WER. I can easily add the case to bypass vad model if audio duration is less than 30 sec though. |
I just now set up two new fresh projects. On the first, I ran I then ran this simple non-batched script on both, first with a 5 seconds clip, and then with a 10 minutes clip: import faster_whisper
import time
model = faster_whisper.WhisperModel(
"distil-large-v3",
device="cuda",
compute_type="float16")
# warm up
segments, info = model.transcribe("benchmark.wav", beam_size=5)
total_start_time = time.time()
repeats = 10
for i in range(repeats):
start_time = time.time()
segments, info = model.transcribe("benchmark.wav", beam_size=5)
print(f"Elapsed time: {time.time() - start_time:.4f}")
print()
print(f"Total elapsed time: {time.time() - total_start_time:.4f}")
print(f"Average elapsed time: {(time.time() - total_start_time)/repeats:.4f}") Results for 5 seconds clip
Note: “relative %” compares inference times, with SYSTRAN as the baseline. In this case, the fork takes 59% more time, which could be translated to it being ~38% slower. Results for 10 minutes clip
I've reran the script many times and get consistent results. Screenshot of the process. Left side is the original repo, right side is the fork.OS: Windows 11 |
I see, can you please compare and report the dev branch of faster whisper? |
Tested repos:SYSTRAN (1.0.2): Results for 5 seconds clip
Note: "re-run variance %" is the variance of the results from re-running the script 5 times, and explains why SYSTRAN (master) is slightly faster (99.7%), and also shows that the +59.6% difference for mobiusml (master) is not random. |
Thanks for pointing this out and confirming the issue. I could reproduce the error and found that this is because the garbage collector tries to clear all objects when loading the audio. Setting the After removing the manual garbage collector, this version works fine with similar run time. @trungkienbkhn The solution to memory leak problem with |
… automatically processed without VAD, removed manual garbage collector, etc
Bug fixes, adding no VAD transcription, tests
Hello. I confirm that after removing gc.collect(), mobiusml (master) works fine with a similar runtime as SYSTRAN (original).
|
I got better values when replacing garbage collector with resampler to
@trungkienbkhn Can you check SYSTRAN master as well from your end? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this removes the need for and extra dependency without losing any functionality
from inspect import signature | ||
from typing import BinaryIO, Iterable, List, NamedTuple, Optional, Tuple, Union | ||
|
||
import ctranslate2 | ||
import jsons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import jsons |
**preprocess_params, | ||
"enable_ta_fe": enable_ta_fe, | ||
} | ||
options_dict = jsons.dump(options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options_dict = jsons.dump(options) | |
options_dict = options._asdict() |
pyannote-audio>=3.1.1 | ||
torch>=2.1.1 | ||
torchaudio>=2.1.2 | ||
jsons>=1.6.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jsons>=1.6.3 |
I switched to use GPU H100 with large-v3 model and checkkout to mobius/master branch, below is my result for 100 times:
Decode audio once: 2371.44064 My code logic: import psutil
import gc
import sys
import faster_whisper
model = faster_whisper.WhisperModel("large-v3", device="cuda")
audio_path = "tests/data/jfk.flac"
process = psutil.Process()
def monitor_memory(audio, n=100):
for _ in range(n):
segments, _ = model.transcribe(audio)
text = "".join(segment.text for segment in segments)
print(process.memory_info().rss / 1000000)
print("")
gc.collect()
print("After changing")
monitor_memory(audio_path) |
@trungkienbkhn Okay Thanks, Could you also report the results with SYSTRAN/master branch as well(same code)? |
Okay.
If modify code logic (replace gc.collect() by resampler=None) in the SYSTRAN/master branch:
|
I assume you have first set |
Could you include the added |
I modified logic as below: resampler = None
del resampler
# gc.collect() |
Update MANIFEST.in to include pyannote asset
I also noticed that this PR broke the Additionally, I occasionally see the following error: |
The options I could not reproduce the error you mentioned. self.tokenizer = Tokenizer(
self.model.hf_tokenizer,
self.model.model.is_multilingual,
task=task,
language=language,
) in the |
Hello, I reproduced same issue with |
Since |
Good idea, can you check and confirm if the loaded audio is exactly the same (torchaudio resampler can be combined with pyAV decoding)? |
I implemented it and other changes here |
That is a good idea to avoid memory leak, but when I tested your changes, I found that the WER increased significantly.
Compare with origin FW (3.097) and origin multi batch FW (1.773) from here |
it was a bug in
|
Hello everyone,
This PR adds a major update to Faster Whisper, bringing both speed and quality improvements!
Speed improvements:
Batching support: Inspired by whisper-x, this update introduces batching support allowing for a 3x speed increase. This implementation builds on whiper-x and supports more run-time arguments and external VAD segments. The batched version now runs at 64x real-time speed, compared to the previous 20x.
Faster feature extraction: We've incorporated torchaudio-based parallel STFT as an alternative to the current implementation from transformers, providing additional speed boosts. With the
enable_ta_fe
flag, the final version achieves an impressive 104x real-time speed. This is up to 12.5x on average compared to OpenAI implementation!Using the batched version is straightforward:
Quality Improvements
Language detection Usage:
Benchmarking:
A. Open source benchmarking:
Open_asr_eval solely consists of short-form audio and the average audio duration is less than 10 sec in general. Hence, using a subset of the YouTube-Commons dataset, we've tested more complex use cases with long-form audio. Whisper-medium model is used (with batch size = 8 for batched versions) for the experiments. Dataset card of youtube-commons-asr-eval is mobiuslabsgmbh/youtube-commons-asr-eval.
Speed (x real-time):
WER:
B. Internal dataset:
Since the transcriptions in the open-source dataset are unverified, they can contain various types of errors. Additional internal benchmarking ensures robustness across various scenarios. A smaller test set (84 minutes) with verified ground truth is used for verifying the transcription quality and speed. The test set contains 9 audios ranging from 3 minutes to 13 minutes and various audio types.
Batched processing speeds up long-form audio without causing an increase in WER. Users can easily switch between sequential and batched Faster Whisper versions based on specific requirements.
Thank you in advance!
Acknowledgements
This is the work done at Mobiuslabs GmbH. Contact Dr. Jilt Sebastian for any queries or requests.