Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support model vocab size being less than tokenizer #237

Merged
merged 2 commits into from
Mar 12, 2025

Conversation

Ubospica
Copy link
Collaborator

@Ubospica Ubospica commented Mar 12, 2025

Some multimodal models have some added tokens, such as <|image|>, that are not considered in lm_head or generated by the model. In this case, the model's vocab size (i.e. the dimension of lm_head) is smaller than the tokenizer. This PR supports this case.

Note that after this PR, the vocab_size parameter of TokenizerInfo should always be the model's vocab_size (i.e. lm_head's size).

This PR also refactors the TokenizerInfo class to make the logic clearer. Now the enum VocabType will be serialized as integer, not string.

@Ubospica Ubospica merged commit 70c959f into mlc-ai:main Mar 12, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant