Replies: 1 comment
-
|
Short answer: not directly. OpenAI Whisper's depending on where the model predicts timestamps. Why WhisperX looks more sentence-likeWhisperX typically uses additional processing: Because of the VAD and alignment stages, the resulting chunks often happen to resemble complete sentences more closely than raw Whisper segments. However, as you've noticed, aggressive VAD can sometimes:
which is why many users still prefer Whisper's transcription quality. Can Whisper be configured to output sentences?There is no built-in option such as: --segment_by_sentenceor sentence_segmentation=Truein OpenAI Whisper. The segment boundaries are determined internally by timestamp token predictions. Best approach: use word timestamps + sentence regroupingIf you're using: word_timestamps=Trueyou can post-process the output into sentence-like segments. Example: result = model.transcribe(
audio,
word_timestamps=True
)Then rebuild segments using:
This usually produces much cleaner sentence boundaries. Example: Raw WhisperPost-processedAlternative: NLP sentence splittingAfter transcription: text = result["text"]use a sentence tokenizer such as:
Example: import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
for sent in doc.sents:
print(sent.text)This gives sentence boundaries, though it won't provide accurate timestamps by itself. Hybrid approach (recommended)For subtitle generation and transcript processing: This often produces results very similar to WhisperX while retaining Whisper's transcription quality. Why Whisper doesn't do this nativelyWhisper is trained to predict: not linguistic sentence boundaries. A pause in speech does not necessarily mean: and a sentence can continue across multiple timestamp regions. Therefore, Whisper's segments should be viewed as: rather than: RecommendationIf you like Whisper's transcription quality but prefer WhisperX-style sentence chunks, use: word_timestamps=Trueand then regroup words into sentences based on punctuation and pause lengths. This is generally the most reliable way to obtain sentence-level segments without introducing the VAD-related omissions that some users encounter with WhisperX. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
hello, I tried whisper and compared it to whisperx. one thing i liked about whisperx is it breaks the input audio into sentences whereas whisper seems to transcribe in blocks of X seconds so the segments it generates are not sentences. is there any setting using which i can get whisper to output segments similar to whisperx? whisperx is not without its issues. i found it fails to transcribe portions of the audio (seems due to VAD) and quality was not as good as whisper even with same model.
Beta Was this translation helpful? Give feedback.
All reactions