Whisper Tokenizer support #7353

MithrilMan · 2024-12-23T20:34:03Z

Is your feature request related to a problem? Please describe.
Whisper tokenizer support needed

Describe the solution you'd like
Would be nice to have support for the Whisper tokenizer.

Describe alternatives you've considered
I'm new to tokenizers so I'm not sure if what I'm doing right now is correct but I'm trying to use a BpeTokenizer passing vocab and merges files and the special tokens (not straightforward because for example I'm reading this file https://huggingface.co/onnx-community/whisper-large-v3-turbo/blob/main/special_tokens_map.json and I need to read vocab file too to get the max id to know where to start from to map special token to id number)

The linked repository has even a tokenizer.json that I suppose contains already everything without the need to pass vocab and merges, but I don't see a way to use it out of the box (I haven't find a constructor that accepts a tokenizer.json file)

MithrilMan added the enhancement New feature or request label Dec 23, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged label Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper Tokenizer support #7353

Whisper Tokenizer support #7353

MithrilMan commented Dec 23, 2024

Whisper Tokenizer support #7353

Whisper Tokenizer support #7353

Comments

MithrilMan commented Dec 23, 2024