[Feature] Support model vocab size being less than tokenizer #237

Ubospica · 2025-03-12T20:46:15Z

Some multimodal models have some added tokens, such as <|image|>, that are not considered in lm_head or generated by the model. In this case, the model's vocab size (i.e. the dimension of lm_head) is smaller than the tokenizer. This PR supports this case.

Note that after this PR, the vocab_size parameter of TokenizerInfo should always be the model's vocab_size (i.e. lm_head's size).

This PR also refactors the TokenizerInfo class to make the logic clearer. Now the enum VocabType will be serialized as integer, not string.

Ubospica added 2 commits March 12, 2025 20:37

finish

154e1cc

fix precommit

c492eca

Ubospica merged commit 70c959f into mlc-ai:main Mar 12, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support model vocab size being less than tokenizer #237

[Feature] Support model vocab size being less than tokenizer #237

Ubospica commented Mar 12, 2025 •

edited

Loading

[Feature] Support model vocab size being less than tokenizer #237

[Feature] Support model vocab size being less than tokenizer #237

Conversation

Ubospica commented Mar 12, 2025 • edited Loading

Ubospica commented Mar 12, 2025 •

edited

Loading