Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper Tokenizer support #7353

Open
MithrilMan opened this issue Dec 23, 2024 · 0 comments
Open

Whisper Tokenizer support #7353

MithrilMan opened this issue Dec 23, 2024 · 0 comments
Labels
enhancement New feature or request untriaged New issue has not been triaged

Comments

@MithrilMan
Copy link

Is your feature request related to a problem? Please describe.
Whisper tokenizer support needed

Describe the solution you'd like
Would be nice to have support for the Whisper tokenizer.

Describe alternatives you've considered
I'm new to tokenizers so I'm not sure if what I'm doing right now is correct but I'm trying to use a BpeTokenizer passing vocab and merges files and the special tokens (not straightforward because for example I'm reading this file https://huggingface.co/onnx-community/whisper-large-v3-turbo/blob/main/special_tokens_map.json and I need to read vocab file too to get the max id to know where to start from to map special token to id number)

The linked repository has even a tokenizer.json that I suppose contains already everything without the need to pass vocab and merges, but I don't see a way to use it out of the box (I haven't find a constructor that accepts a tokenizer.json file)

@MithrilMan MithrilMan added the enhancement New feature or request label Dec 23, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged label Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request untriaged New issue has not been triaged
Projects
None yet
Development

No branches or pull requests

1 participant