Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper without translation #652

Open
Plemeur opened this issue Oct 4, 2024 · 3 comments
Open

Whisper without translation #652

Plemeur opened this issue Oct 4, 2024 · 3 comments

Comments

@Plemeur
Copy link

Plemeur commented Oct 4, 2024

Hello,

I have been trying to use the whisper triton server to transcribe english and japanese, but by settings multiple languages in the text prefix <|en|><|ja|> it will always translate into the second language

I am seeing some other people reporting the same issue on different repo related to large v3, with some different tips and tricks to make it "work"

Is this a limitation of large-v3 ? Did anyone get a good result using this triton server on multi languages speech ?

@yuekaizhang
Copy link
Collaborator

@Plemeur Hi, you need to detect language first. Then set text prefix to your detected language. You can't do it by setting prompt only.

@hjaved202
Copy link

I was hoping to piggyback off this thread in case @Plemeur you were able to find a workaround for your stated use-case.

I am also having issues using Whisper Triton Server in multilingual contexts. Specifically my use case is transcribing English speech to English text, but occasionally Arabic speech is anticipated and we would want this to be translated to English and then transcribed to English text.

Having played around with the text prompt, I was able to get it to work for either one of the two use cases, but not both. As I do not know in advance which scenario is expected, the text prompt cannot be preset.

Strangely using HuggingFace Whisper, I found that setting it to transcribe to English was sufficient for it to translate + transcribe speech of any language to English. In contrast with Whisper Triton it returns output such as '[Arabic]' or '[Speaking foreign language]'.

@yuekaizhang could you please expand on your comment on 'detect language first'. Is there a way to have this detected on the fly?

@yuekaizhang
Copy link
Collaborator

@hjaved202
The Whisper Trt-llm solution only provides support for the forward pass of the Whisper model's encoder and decoder, as well as beam search. During decoding, users need to set the prompt themselves; it is not an end-to-end solution. Therefore, when I encounter different results between it and Hugging Face's Whisper, I generally expect that the prompts input into the models are different. Since Hugging Face has implemented multiple layers of wrapping in their code, it is necessary to print from the source code to identify the prompt they are using.

In general, for a sentence that mixes English and Arabic, if you want the output to be entirely in English, you can try using <|startoftranscript|><|en|><|transcribe|><|notimestamps|> or <|startoftranscript|><|en|><|translate|><|notimestamps|>. If you want the output to include both languages, you can try <|startoftranscript|><|en|><|ar|><|transcribe|><|notimestamps|>.

The detect_language feature refers to first using the Whisper model to compute and obtain the language code and then incorporating it into the prompt for a second computation. For more details, you can check the official Whisper repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants