Whisper without translation #652

Plemeur · 2024-10-04T07:12:14Z

Hello,

I have been trying to use the whisper triton server to transcribe english and japanese, but by settings multiple languages in the text prefix <|en|><|ja|> it will always translate into the second language

I am seeing some other people reporting the same issue on different repo related to large v3, with some different tips and tricks to make it "work"

Is this a limitation of large-v3 ? Did anyone get a good result using this triton server on multi languages speech ?

The text was updated successfully, but these errors were encountered:

yuekaizhang · 2024-10-15T12:19:19Z

@Plemeur Hi, you need to detect language first. Then set text prefix to your detected language. You can't do it by setting prompt only.

hjaved202 · 2024-11-12T06:29:42Z

I was hoping to piggyback off this thread in case @Plemeur you were able to find a workaround for your stated use-case.

I am also having issues using Whisper Triton Server in multilingual contexts. Specifically my use case is transcribing English speech to English text, but occasionally Arabic speech is anticipated and we would want this to be translated to English and then transcribed to English text.

Having played around with the text prompt, I was able to get it to work for either one of the two use cases, but not both. As I do not know in advance which scenario is expected, the text prompt cannot be preset.

Strangely using HuggingFace Whisper, I found that setting it to transcribe to English was sufficient for it to translate + transcribe speech of any language to English. In contrast with Whisper Triton it returns output such as '[Arabic]' or '[Speaking foreign language]'.

@yuekaizhang could you please expand on your comment on 'detect language first'. Is there a way to have this detected on the fly?

yuekaizhang · 2024-11-13T02:54:02Z

@hjaved202
The Whisper Trt-llm solution only provides support for the forward pass of the Whisper model's encoder and decoder, as well as beam search. During decoding, users need to set the prompt themselves; it is not an end-to-end solution. Therefore, when I encounter different results between it and Hugging Face's Whisper, I generally expect that the prompts input into the models are different. Since Hugging Face has implemented multiple layers of wrapping in their code, it is necessary to print from the source code to identify the prompt they are using.

In general, for a sentence that mixes English and Arabic, if you want the output to be entirely in English, you can try using <|startoftranscript|><|en|><|transcribe|><|notimestamps|> or <|startoftranscript|><|en|><|translate|><|notimestamps|>. If you want the output to include both languages, you can try <|startoftranscript|><|en|><|ar|><|transcribe|><|notimestamps|>.

The detect_language feature refers to first using the Whisper model to compute and obtain the language code and then incorporating it into the prompt for a second computation. For more details, you can check the official Whisper repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper without translation #652

Whisper without translation #652

Plemeur commented Oct 4, 2024

yuekaizhang commented Oct 15, 2024

hjaved202 commented Nov 12, 2024

yuekaizhang commented Nov 13, 2024

Whisper without translation #652

Whisper without translation #652

Comments

Plemeur commented Oct 4, 2024

yuekaizhang commented Oct 15, 2024

hjaved202 commented Nov 12, 2024

yuekaizhang commented Nov 13, 2024