Break audio into sentences instead of in-between sentences #2510

siddhsql · 2025-01-22T00:35:36Z

siddhsql
Jan 22, 2025

hello, I tried whisper and compared it to whisperx. one thing i liked about whisperx is it breaks the input audio into sentences whereas whisper seems to transcribe in blocks of X seconds so the segments it generates are not sentences. is there any setting using which i can get whisper to output segments similar to whisperx? whisperx is not without its issues. i found it fails to transcribe portions of the audio (seems due to VAD) and quality was not as good as whisper even with same model.

Advait251206 · 2026-06-24T18:33:50Z

Advait251206
Jun 24, 2026

Short answer: not directly.

OpenAI Whisper's segments are primarily based on the model's timestamp predictions and decoding process, not grammatical sentence boundaries. As a result, a segment may:

Contain half a sentence
Contain multiple sentences
Start mid-sentence
End mid-sentence

depending on where the model predicts timestamps.

Why WhisperX looks more sentence-like

WhisperX typically uses additional processing:

Audio
↓
VAD
↓
Whisper
↓
Forced alignment
↓
Segment refinement

Because of the VAD and alignment stages, the resulting chunks often happen to resemble complete sentences more closely than raw Whisper segments.

However, as you've noticed, aggressive VAD can sometimes:

drop low-volume speech
miss words
skip short utterances
reduce recall

which is why many users still prefer Whisper's transcription quality.

Can Whisper be configured to output sentences?

There is no built-in option such as:

--segment_by_sentence

or

sentence_segmentation=True

in OpenAI Whisper.

The segment boundaries are determined internally by timestamp token predictions.

Best approach: use word timestamps + sentence regrouping

If you're using:

word_timestamps=True

you can post-process the output into sentence-like segments.

Example:

result = model.transcribe(
    audio,
    word_timestamps=True
)

Then rebuild segments using:

punctuation (., ?, !)
pause duration
capitalization rules

This usually produces much cleaner sentence boundaries.

Example:

Raw Whisper

Segment 1:
"Hello everyone today we"

Segment 2:
"are discussing the project. The"

Segment 3:
"deadline is next week."

Post-processed

Sentence 1:
"Hello everyone, today we are discussing the project."

Sentence 2:
"The deadline is next week."

Alternative: NLP sentence splitting

After transcription:

text = result["text"]

use a sentence tokenizer such as:

:contentReference[oaicite:0]{index=0}
:contentReference[oaicite:1]{index=1}

Example:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

for sent in doc.sents:
    print(sent.text)

This gives sentence boundaries, though it won't provide accurate timestamps by itself.

Hybrid approach (recommended)

For subtitle generation and transcript processing:

Whisper
↓
word_timestamps=True
↓
sentence detection
↓
group words into sentences
↓
assign sentence start/end times

This often produces results very similar to WhisperX while retaining Whisper's transcription quality.

Why Whisper doesn't do this natively

Whisper is trained to predict:

Text
+
Timestamp tokens

not linguistic sentence boundaries.

A pause in speech does not necessarily mean:

sentence end

and a sentence can continue across multiple timestamp regions.

Therefore, Whisper's segments should be viewed as:

timing segments

rather than:

grammar/sentence segments

Recommendation

If you like Whisper's transcription quality but prefer WhisperX-style sentence chunks, use:

word_timestamps=True

and then regroup words into sentences based on punctuation and pause lengths. This is generally the most reliable way to obtain sentence-level segments without introducing the VAD-related omissions that some users encounter with WhisperX.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Break audio into sentences instead of in-between sentences #2510

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Break audio into sentences instead of in-between sentences #2510

Uh oh!

siddhsql Jan 22, 2025

Replies: 1 comment

Uh oh!

Advait251206 Jun 24, 2026

Why WhisperX looks more sentence-like

Can Whisper be configured to output sentences?

Best approach: use word timestamps + sentence regrouping

Raw Whisper

Post-processed

Alternative: NLP sentence splitting

Hybrid approach (recommended)

Why Whisper doesn't do this natively

Recommendation

siddhsql
Jan 22, 2025

Advait251206
Jun 24, 2026