Even though whisper transcribes in chunks of 30s are the vector embeddings and attention available for the further chunks ? #2326

agandhinit · 2024-09-08T15:16:29Z

agandhinit
Sep 8, 2024

I don't understand this concept fully hence asking for clarification -
Even though whisper transcribes in chunks of 30s are the vector embeddings and attention available for the further chunks.

Take an example -

Chunk 1: "The bank manager told me to sign the papers at the branch. Later, when I returned..."
Chunk 2: "...to the branch, I noticed that the teller was gone."

Chunk 1 - Clearly sets the context for a vector embedding around branch with previous context as bank.
Chunk 2 - May not know branch is in context of a tree or a bank or a river unless attention is still active here.

The reason I ask is will the quality differ to transcribe chunks of audio in 30s(done externally lets say for a stream) or pass the full audio and let whisper chunk in 30s windows. The first case as per my understanding will reset embeddings and attention (Even if i pass audio with some overlap lets say 5s it would only carry over the common part only - not from a chunk 5 mins earlier).

ryanheise · 2024-09-08T15:30:08Z

ryanheise
Sep 8, 2024

The model input includes the 30 second chunk of audio but it also includes the prompt. Context is achieved via the prompt. If you are going to do the chunking yourself, I suggest you do what Whisper does and take the output of the previous chunk/s and use that as the prompt for the next chunk to provide context. If you do chunking in the same way Whisper does it, there will be no difference in quality.

2 replies

agandhinit Sep 9, 2024
Author

The prompt is just an external input or a hint which would be ok to be passed for just the previous chunk context. Usually attention helps with the entire context AFAIU - to weigh the importance of different parts of the input sequence as well as historic (however long - not just the previous chunk).
Shouldn't this bring a quality difference ?

ryanheise Sep 9, 2024

The prompt is just an external input or a hint which would be ok to be passed for just the previous chunk context

More than that, the prompt is precisely the input that Whisper uses internally to achieve what you want to achieve. If you pass in the output of the previous chunk/s (i.e. one or more previous chunks, respecting the maximum prompt size given the context window size) as the prompt for the next chunk to provide it context, there will be no difference in quality to what you would get by letting Whisper do the chunking for you, since you would be chunking in the same manner as Whisper.

There are more details if you want to replicate Whisper's algorithm exactly, for example you will get better results if you truncate 30 second chunks to the nearest phrase boundary which you can detect via the timestamp tokens in the output. But the point is, if you do chunking in the same way that Whisper does chunking, the quality will be the same. (i.e. This addresses the reason why you asked the question: "The reason I ask is will the quality differ to transcribe chunks of audio in 30s or pass the full audio and let whisper chunk in 30s windows")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Even though whisper transcribes in chunks of 30s are the vector embeddings and attention available for the further chunks ? #2326

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Even though whisper transcribes in chunks of 30s are the vector embeddings and attention available for the further chunks ? #2326

agandhinit Sep 8, 2024

Replies: 1 comment · 2 replies

ryanheise Sep 8, 2024

agandhinit Sep 9, 2024 Author

ryanheise Sep 9, 2024

agandhinit
Sep 8, 2024

Replies: 1 comment 2 replies

ryanheise
Sep 8, 2024

agandhinit Sep 9, 2024
Author