Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Open
wants to merge 84 commits into
base: master
Choose a base branch
from

Conversation

Jiltseb
Copy link

@Jiltseb Jiltseb commented May 24, 2024

Hello everyone,

This PR adds a major update to Faster Whisper, bringing both speed and quality improvements!

Speed improvements:

  • Batching support: Inspired by whisper-x, this update introduces batching support allowing for a 3x speed increase. This implementation builds on whiper-x and supports more run-time arguments and external VAD segments. The batched version now runs at 64x real-time speed, compared to the previous 20x.

  • Faster feature extraction: We've incorporated torchaudio-based parallel STFT as an alternative to the current implementation from transformers, providing additional speed boosts. With the enable_ta_fe flag, the final version achieves an impressive 104x real-time speed. This is up to 12.5x on average compared to OpenAI implementation!

Using the batched version is straightforward:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
result = batched_model.transcribe("audio.mp3", batch_size=16)

for segment, info in result:
	print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Quality Improvements

  1. Consistency across runs: By setting the model seed, consistency across runs is improved.
  2. Reducing hallucinations: Stricter checks in the inference pipeline reduce unstructured or repeated phrases.
  3. Reliable language detection: A new function detects language more reliably by considering highly confident and random segments, breaking ties to determine the major language.
  4. Code-switching support: Handles audio with multiple languages by detecting language every 30 seconds and dynamically directing data flow. Since the exact language switching position is unknown, this can have an error within a 30 sec segment range.

Language detection Usage:

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")

Benchmarking:

A. Open source benchmarking:

Open_asr_eval solely consists of short-form audio and the average audio duration is less than 10 sec in general. Hence, using a subset of the YouTube-Commons dataset, we've tested more complex use cases with long-form audio. Whisper-medium model is used (with batch size = 8 for batched versions) for the experiments. Dataset card of youtube-commons-asr-eval is mobiuslabsgmbh/youtube-commons-asr-eval.

Speed (x real-time):

System Speed GPU Speed CPU
OpenAI Whisper 8.2x 4.5x
faster-whisper 20.1x 5.6x
HF Whisper (batched) 59.3x 8.4x
Batched Faster-Whisper 104x 14.6x

WER:

System WER
OpenAI Whisper 15.1
faster-whisper 14.6
HF Whisper (batched) 16.8
Batched Faster-Whisper 13.1

B. Internal dataset:

Since the transcriptions in the open-source dataset are unverified, they can contain various types of errors. Additional internal benchmarking ensures robustness across various scenarios. A smaller test set (84 minutes) with verified ground truth is used for verifying the transcription quality and speed. The test set contains 9 audios ranging from 3 minutes to 13 minutes and various audio types.

System WER Speed
OpenAI Whisper 6.8 9.1x
faster-whisper 6.1 17.4x
HF Whisper (batched) 8.2 42.8x
Batched Faster-Whisper 6.5 86.6x

Batched processing speeds up long-form audio without causing an increase in WER. Users can easily switch between sequential and batched Faster Whisper versions based on specific requirements.

Thank you in advance!

Acknowledgements

This is the work done at Mobiuslabs GmbH. Contact Dr. Jilt Sebastian for any queries or requests.

Jiltseb added 30 commits June 9, 2023 13:52
PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)
SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6
Updating the base faster-whisper to 0.10.0
Support for Batched inference and language detection from multiple segments in faster-whisper
Updating the base directory
@Jiltseb
Copy link
Author

Jiltseb commented Jun 17, 2024

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.

model = WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

segments, info = model.transcribe(
                file_path,
                beam_size=5)

Anyone else seeing a performance degradation with this when transcribing short clips without batching?

I believe this is because of the use of the VAD model by default. Would it not make sense for this feature to be default deactivated? Also i have still not been able to get it working with use_vad_model set to False. It seems that there isn't an option to skip it whatsoever:

if not vad_segments:
    if self.use_vad_model:
        vad_segments = self.vad_model(
            {
                "waveform": torch.from_numpy(audio).unsqueeze(0).float(),
                "sample_rate": 16000,
            }
        )
        vad_segments = merge_chunks(
            vad_segments,
            self.chunk_size,
            onset=self.vad_onset,
            offset=self.vad_offset,
        )
    else:
        raise RuntimeError(
            "No vad segments found. Set 'use_vad_model' to True while loading the model"
        )

How can I reproduce the speed difference you get? I have tried both versions and can confirm the speed is similar to the benchmarking dataset. There is no need for batching for the 5-second audio clip anyway (You can combine all of them with silence in between if you want to run it at once with batching) Making VAD deactivated would mean that you have to provide VAD segments for it to segment the audio. If you set use_vad_model to False, this means that you will provide external vad segments instead. What is your intention while setting use_vad_model to False?

I think if vad is not used and vad timestamps aren't provided, it should default to regular 30s chunking without any bells and whistles

Do you mean uniform chunking? This can abruptly cut in the middle of the word as well causing issues in transcription. It is possible to implement some LCS-based solution (such as in transformers) around the boundary but will affect the WER. I can easily add the case to bypass vad model if audio duration is less than 30 sec though.

@Jobus0
Copy link

Jobus0 commented Jun 17, 2024

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing
a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.

How can I reproduce the speed difference you get? I have tried both versions and can confirm the speed is similar to the benchmarking dataset.

I just now set up two new fresh projects. On the first, I ran pip install faster-whisper. On the second, I ran pip install git+https://github.com/mobiusml/faster-whisper.git. Other than that those dependencies (and sub-dependencies), they are identical.

I then ran this simple non-batched script on both, first with a 5 seconds clip, and then with a 10 minutes clip:

import faster_whisper
import time

model = faster_whisper.WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

# warm up
segments, info = model.transcribe("benchmark.wav", beam_size=5)

total_start_time = time.time()

repeats = 10
for i in range(repeats):
    start_time = time.time()
    segments, info = model.transcribe("benchmark.wav", beam_size=5)
    print(f"Elapsed time: {time.time() - start_time:.4f}")

print()
print(f"Total elapsed time: {time.time() - total_start_time:.4f}")
print(f"Average elapsed time: {(time.time() - total_start_time)/repeats:.4f}")

Results for 5 seconds clip

repository clip length average elapsed time relative %
SYSTRAN (original) 5 sec 0.1837 sec 100%
mobiusml (fork) 5 sec 0.2924 sec 159%

Note: “relative %” compares inference times, with SYSTRAN as the baseline. In this case, the fork takes 59% more time, which could be translated to it being ~38% slower.

Results for 10 minutes clip

repository clip length average elapsed time relative %
SYSTRAN (original) 10 min 0.8062 sec 100%
mobiusml (fork) 10 min 0.9063 sec 112%

I've reran the script many times and get consistent results.

Screenshot of the process. Left side is the original repo, right side is the fork.

benchmark

OS: Windows 11
GPU: RTX 4070
Python: 3.12

@Jiltseb
Copy link
Author

Jiltseb commented Jun 17, 2024

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing
a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.

How can I reproduce the speed difference you get? I have tried both versions and can confirm the speed is similar to the benchmarking dataset.

I just now set up two new fresh projects. On the first, I ran pip install faster-whisper. On the second, I ran pip install git+https://github.com/mobiusml/faster-whisper.git. Other than that those dependencies (and sub-dependencies), they are identical.

I then ran this simple non-batched script on both, first with a 5 seconds clip, and then with a 10 minutes clip:

import faster_whisper
import time

model = faster_whisper.WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

# warm up
segments, info = model.transcribe("benchmark.wav", beam_size=5)

total_start_time = time.time()

repeats = 10
for i in range(repeats):
    start_time = time.time()
    segments, info = model.transcribe("benchmark.wav", beam_size=5)
    print(f"Elapsed time: {time.time() - start_time:.4f}")

print()
print(f"Total elapsed time: {time.time() - total_start_time:.4f}")
print(f"Average elapsed time: {(time.time() - total_start_time)/repeats:.4f}")

Results for 5 seconds clip

repository clip length average elapsed time relative %
SYSTRAN (original) 5 sec 0.1837 sec 100%
mobiusml (fork) 5 sec 0.2924 sec 159%
Note: “relative %” compares inference times, with SYSTRAN as the baseline. In this case, the fork takes 59% more time, which could be translated to it being ~38% slower.

Results for 10 minutes clip

repository clip length average elapsed time relative %
SYSTRAN (original) 10 min 0.8062 sec 100%
mobiusml (fork) 10 min 0.9063 sec 112%
I've reran the script many times and get consistent results.

Screenshot of the process. Left side is the original repo, right side is the fork.

benchmark

OS: Windows 11 GPU: RTX 4070 Python: 3.12

I see, can you please compare and report the dev branch of faster whisper? pip install git+https://github.com/SYSTRAN/faster-whisper.git?

@Jobus0
Copy link

Jobus0 commented Jun 18, 2024

I see, can you please compare and report the dev branch of faster whisper? pip install git+https://github.com/SYSTRAN/faster-whisper.git?

Tested repos:

SYSTRAN (1.0.2): pip install faster-whisper
SYSTRAN (master): pip install git+https://github.com/SYSTRAN/faster-whisper.git
mobiusml (master): pip install git+https://github.com/mobiusml/faster-whisper.git

Results for 5 seconds clip

repository clip length average elapsed time relative % re-run variance %
SYSTRAN (1.0.2) 5 sec 0.1737 sec 100.0% +/- 1.8%
SYSTRAN (master) 5 sec 0.1733 sec 99.7% +/- 1.8%
mobiusml (master) 5 sec 0.2773 sec 159.6% +/- 1.6%

Note: "re-run variance %" is the variance of the results from re-running the script 5 times, and explains why SYSTRAN (master) is slightly faster (99.7%), and also shows that the +59.6% difference for mobiusml (master) is not random.

@Jiltseb
Copy link
Author

Jiltseb commented Jun 18, 2024

I see, can you please compare and report the dev branch of faster whisper? pip install git+https://github.com/SYSTRAN/faster-whisper.git?

Tested repos:

SYSTRAN (1.0.2): pip install faster-whisper SYSTRAN (master): pip install git+https://github.com/SYSTRAN/faster-whisper.git mobiusml (master): pip install git+https://github.com/mobiusml/faster-whisper.git

Results for 5 seconds clip

repository clip length average elapsed time relative % re-run variance %
SYSTRAN (1.0.2) 5 sec 0.1737 sec 100.0% +/- 1.8%
SYSTRAN (master) 5 sec 0.1733 sec 99.7% +/- 1.8%
mobiusml (master) 5 sec 0.2773 sec 159.6% +/- 1.6%
Note: "re-run variance %" is the variance of the results from re-running the script 5 times, and explains why SYSTRAN (master) is slightly faster (99.7%), and also shows that the +59.6% difference for mobiusml (master) is not random.

Thanks for pointing this out and confirming the issue.

I could reproduce the error and found that this is because the garbage collector tries to clear all objects when loading the audio. Setting the resampler to None and then deleting it made sure that the object is properly removed. We can avoid the manual run gc.collect() line as it was causing the delay.

After removing the manual garbage collector, this version works fine with similar run time.

@trungkienbkhn The solution to memory leak problem with gc.collect() seems to be consistent with setting resampler to None.

… automatically processed without VAD, removed manual garbage collector, etc
@trungkienbkhn
Copy link
Collaborator

trungkienbkhn commented Jun 19, 2024

I see, can you please compare and report the dev branch of faster whisper? pip install git+https://github.com/SYSTRAN/faster-whisper.git?

Tested repos:

SYSTRAN (1.0.2): pip install faster-whisper SYSTRAN (master): pip install git+https://github.com/SYSTRAN/faster-whisper.git mobiusml (master): pip install git+https://github.com/mobiusml/faster-whisper.git

Results for 5 seconds clip

repository clip length average elapsed time relative % re-run variance %
SYSTRAN (1.0.2) 5 sec 0.1737 sec 100.0% +/- 1.8%
SYSTRAN (master) 5 sec 0.1733 sec 99.7% +/- 1.8%
mobiusml (master) 5 sec 0.2773 sec 159.6% +/- 1.6%
Note: "re-run variance %" is the variance of the results from re-running the script 5 times, and explains why SYSTRAN (master) is slightly faster (99.7%), and also shows that the +59.6% difference for mobiusml (master) is not random.

Thanks for pointing this out and confirming the issue.

I could reproduce the error and found that this is because the garbage collector tries to clear all objects when loading the audio. Setting the resampler to None and then deleting it made sure that the object is properly removed. We can avoid the manual run gc.collect() line as it was causing the delay.

After removing the manual garbage collector, this version works fine with similar run time.

@trungkienbkhn The solution to memory leak problem with gc.collect() seems to be consistent with setting resampler to None.

Hello. I confirm that after removing gc.collect(), mobiusml (master) works fine with a similar runtime as SYSTRAN (original).
However, it seems that replacing gc.collect() by resampler = None doesn't solve the memory leak problem.
I tried this example again:

Baseline
5350.244352
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872

Decode audio once
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872

After changing with resampler=None
5366.464512
5387.788288
5397.89312
5410.004992
5424.2304
5387.927552
5397.909504
5410.021376
5424.2304
5439.737856
5388.161024
5398.142976
5410.254848
5424.463872
5439.91808
5387.620352
5398.765568
5412.974592
5426.147328
5439.32416
5452.496896
5465.669632
5478.846464
5492.0192
5506.564096
5517.398016
5530.570752
5543.747584
5556.92032
5481.639936
5481.762816
5481.885696
5482.008576
5482.131456
5482.254336
5482.377216
5482.500096
5491.99872
5504.380928
5517.312

@Jiltseb
Copy link
Author

Jiltseb commented Jun 19, 2024

I got better values when replacing garbage collector with resampler to None.
With the same settings as decode audio once:

1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.304128
1264.304128
1264.304128
1264.304128
1264.304128
1264.304128```

If you run decode_audio multiple times(Baseline):

```Baseline
1278.783488
1279.455232
1279.496192
1281.31072
1281.31072
1281.425408
1281.437696
1281.437696
1281.437696
1281.437696
1281.437696
1281.441792
1281.441792
1281.441792
1281.441792
1281.441792
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888```

I tried the same with SYSTRAN master. With multiple decode_audio calls:

```Baseline
762.056704
763.51488
763.51488
763.650048
763.65824
764.60032
764.604416
764.624896
764.694528
764.694528

@trungkienbkhn Can you check SYSTRAN master as well from your end?

Copy link
Contributor

@MahmoudAshraf97 MahmoudAshraf97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this removes the need for and extra dependency without losing any functionality

from inspect import signature
from typing import BinaryIO, Iterable, List, NamedTuple, Optional, Tuple, Union

import ctranslate2
import jsons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import jsons

**preprocess_params,
"enable_ta_fe": enable_ta_fe,
}
options_dict = jsons.dump(options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
options_dict = jsons.dump(options)
options_dict = options._asdict()

pyannote-audio>=3.1.1
torch>=2.1.1
torchaudio>=2.1.2
jsons>=1.6.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
jsons>=1.6.3

@trungkienbkhn
Copy link
Collaborator

trungkienbkhn commented Jun 19, 2024

I switched to use GPU H100 with large-v3 model and checkkout to mobius/master branch, below is my result for 100 times:

After changing
2306.183168
2314.395648
2322.096128
2306.6624
2314.36288
2322.333696
2330.034176
2338.004992
2306.711552
2314.412032
2322.382848
2330.083328
2338.054144
2306.748416
2314.448896
2322.419712
2330.120192
2338.091008
2345.734144
2346.840064
2346.840064
2346.840064
2346.840064
2346.840064
2347.106304
2361.294848
2369.265664
2376.966144
2384.93696
2392.63744
2400.608256
2408.308736
2416.279552
2424.180736
2431.881216
2439.852032
2447.552512
2455.252992
2347.106304
2347.106304
2347.106304
2347.106304
2347.106304
2361.561088
2369.261568
2377.232384
2384.932864
2392.90368
2400.60416
2408.574976
2416.275456
2424.17664
2431.87712
2439.847936
2447.548416
2455.519232
2455.67488
2455.67488
2455.67488
2455.67488
2455.67488
2455.94112
2455.94112
2455.94112
2455.94112
2455.94112
2456.211456
2456.211456
2456.211456
2456.211456
2456.481792
2456.481792
2456.481792
2456.481792
2463.100928
2462.26944
2462.26944
2462.26944
2462.26944
2462.53568
2462.53568
2462.53568
2462.53568
2462.806016
2462.806016
2462.806016
2462.806016
2462.806016
2463.076352
2463.076352
2463.076352
2463.076352
2463.346688
2463.514624
2463.514624
2463.514624
2463.514624
2463.514624
2463.780864
2463.780864

Decode audio once: 2371.44064

My code logic:

import psutil
import gc
import sys
import faster_whisper

model = faster_whisper.WhisperModel("large-v3", device="cuda")
audio_path = "tests/data/jfk.flac"
process = psutil.Process()


def monitor_memory(audio, n=100):
    for _ in range(n):
        segments, _ = model.transcribe(audio)
        text = "".join(segment.text for segment in segments)
        print(process.memory_info().rss / 1000000)

    print("")
    gc.collect()


print("After changing")
monitor_memory(audio_path)

@Jiltseb
Copy link
Author

Jiltseb commented Jun 19, 2024

@trungkienbkhn Okay Thanks, Could you also report the results with SYSTRAN/master branch as well(same code)?

@trungkienbkhn
Copy link
Collaborator

trungkienbkhn commented Jun 19, 2024

@trungkienbkhn Okay Thanks, Could you also report the results with SYSTRAN/master branch as well(same code)?

Okay.
Same device and same code for SYSTRAN/master branch:

Baseline
1848.13568
1848.373248
1848.389632
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192
1848.40192

If modify code logic (replace gc.collect() by resampler=None) in the SYSTRAN/master branch:

Modify in Systran/master
1513.590784
1527.56224
1527.734272
1530.974208
1543.811072
1556.709376
1573.085184
1541.271552
1543.430144
1557.74976
1569.97632
1586.507776
1531.318272
1542.402048
1559.01952
1571.045376
1584.283648
1546.40384
1546.40384
1558.155264
1571.262464
1584.300032
1597.407232
1610.514432
1623.621632
1636.88448
1650.368512
1663.40608
1675.362304
1688.670208
1703.870464
1716.977664
1730.015232
1743.433728
1756.295168
1755.324416
1755.324416
1755.324416
1755.324416
1755.324416
1755.590656
1755.590656
1755.590656
1755.590656
1755.590656
1755.860992
1755.860992
1755.860992
1755.860992
1756.131328
1756.131328
1756.131328
1759.576064
1774.096384
1773.121536
1773.121536
1773.121536
1773.121536
1773.387776
1773.387776
1773.387776
1773.387776
1773.387776
1773.658112
1773.658112
1773.658112
1773.658112
1773.928448
1773.928448
1773.928448
1773.928448
1774.198784
1773.182976
1773.182976
1773.182976
1773.182976
1773.182976
1773.449216
1773.449216
1773.449216
1773.449216
1773.719552
1773.719552
1773.719552
1773.719552
1773.719552
1773.989888
1773.989888
1773.989888
1776.422912
1788.592128
1787.670528
1787.670528
1787.670528
1787.670528
1787.936768
1787.936768
1787.936768
1787.936768
1788.207104

@Jiltseb
Copy link
Author

Jiltseb commented Jun 19, 2024

I assume you have first set resampler=None and then del resampler instead of replacing the gc.collect() line with resampler= None. Can you confirm? If so, let's hope pyAV fixes it (PyAV-Org/PyAV#1429).

@hargunmujral
Copy link

hargunmujral commented Jun 19, 2024

Could you include the added pyannote_vad_model.bin to the MANIFEST.in file? Otherwise the file doesn't build in the wheel and instead downloads from HF. mobiusml#17

@trungkienbkhn
Copy link
Collaborator

I assume you have first set resampler=None and then del resampler instead of replacing the gc.collect() line with resampler= None. Can you confirm? If so, let's hope pyAV fixes it (PyAV-Org/PyAV#1429).

I modified logic as below:

resampler = None
del resampler
# gc.collect()

Update MANIFEST.in to include pyannote asset
@hargunmujral
Copy link

hargunmujral commented Jun 20, 2024

I also noticed that this PR broke the prepend_punctuation and append_punctuation options in word-level timestamping. Can this be fixed?

Additionally, I occasionally see the following error: 'NoneType' object has no attribute 'split_to_word_tokens'" when doing transcription. Do you know what that could be related to? It seems to come from the tokenizer not being initialized, but I'm not sure why.

@Jiltseb
Copy link
Author

Jiltseb commented Jun 20, 2024

The options prepend_punctuation and append_punctuation are not yet added in BatchedInference. I will check that.

I could not reproduce the error you mentioned.split_to_word_tokens can not be done on a None type tokenizer but we are setting this in the code. Can you check the result of

     self.tokenizer = Tokenizer(
                self.model.hf_tokenizer,
                self.model.model.is_multilingual,
                task=task,
                language=language,
            )

in the get_language_and_tokenizer function when this happens?

@amdrozdov
Copy link

Hello, I reproduced same issue with 'NoneType' object has no attribute 'split_to_word_tokens'". It happens if I set num_workers > 1 (for transcribe function). Does it happens because of multi-threading?

@MahmoudAshraf97
Copy link
Contributor

I assume you have first set resampler=None and then del resampler instead of replacing the gc.collect() line with resampler= None. Can you confirm? If so, let's hope pyAV fixes it (PyAV-Org/PyAV#1429).

Since torchaudio is already included in the requirements, it has a resampling algorithm that supports GPU and doesn't need ffmpeg or any external libraries, I suggest we use it and remove PyAV resampling

@Jiltseb
Copy link
Author

Jiltseb commented Jun 22, 2024

I assume you have first set resampler=None and then del resampler instead of replacing the gc.collect() line with resampler= None. Can you confirm? If so, let's hope pyAV fixes it (PyAV-Org/PyAV#1429).

Since torchaudio is already included in the requirements, it has a resampling algorithm that supports GPU and doesn't need ffmpeg or any external libraries, I suggest we use it and remove PyAV resampling

Good idea, can you check and confirm if the loaded audio is exactly the same (torchaudio resampler can be combined with pyAV decoding)?

@MahmoudAshraf97
Copy link
Contributor

I implemented it and other changes here
Everyone's reviews and comments are appreciated

@trungkienbkhn
Copy link
Collaborator

I implemented it and other changes here Everyone's reviews and comments are appreciated

That is a good idea to avoid memory leak, but when I tested your changes, I found that the WER increased significantly.

Evaluating...: 499it [02:51,  2.91it/s]
WER: 19.771

Compare with origin FW (3.097) and origin multi batch FW (1.773) from here

@MahmoudAshraf97
Copy link
Contributor

MahmoudAshraf97 commented Jun 24, 2024

I implemented it and other changes here Everyone's reviews and comments are appreciated

That is a good idea to avoid memory leak, but when I tested your changes, I found that the WER increased significantly.

Evaluating...: 499it [02:51,  2.91it/s]
WER: 19.771

Compare with origin FW (3.097) and origin multi batch FW (1.773) from here

it was a bug in pad_or_trim, it should be fixed now and WER dropped by another 0.85% on non-batched inference

Evaluating...: 499it [04:59,  1.67it/s]
WER: 2.242

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet