Even though whisper transcribes in chunks of 30s are the vector embeddings and attention available for the further chunks ? #2326
Unanswered
agandhinit
asked this question in
Q&A
Replies: 1 comment 2 replies
-
The model input includes the 30 second chunk of audio but it also includes the prompt. Context is achieved via the prompt. If you are going to do the chunking yourself, I suggest you do what Whisper does and take the output of the previous chunk/s and use that as the prompt for the next chunk to provide context. If you do chunking in the same way Whisper does it, there will be no difference in quality. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I don't understand this concept fully hence asking for clarification -
Even though whisper transcribes in chunks of 30s are the vector embeddings and attention available for the further chunks.
Take an example -
Chunk 1: "The bank manager told me to sign the papers at the branch. Later, when I returned..."
Chunk 2: "...to the branch, I noticed that the teller was gone."
Chunk 1 - Clearly sets the context for a vector embedding around branch with previous context as bank.
Chunk 2 - May not know branch is in context of a tree or a bank or a river unless attention is still active here.
The reason I ask is will the quality differ to transcribe chunks of audio in 30s(done externally lets say for a stream) or pass the full audio and let whisper chunk in 30s windows. The first case as per my understanding will reset embeddings and attention (Even if i pass audio with some overlap lets say 5s it would only carry over the common part only - not from a chunk 5 mins earlier).
Beta Was this translation helpful? Give feedback.
All reactions