[Feature] Support model vocab size being less than tokenizer #237
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Some multimodal models have some added tokens, such as <|image|>, that are not considered in lm_head or generated by the model. In this case, the model's vocab size (i.e. the dimension of lm_head) is smaller than the tokenizer. This PR supports this case.
Note that after this PR, the
vocab_size
parameter ofTokenizerInfo
should always be the model's vocab_size (i.e. lm_head's size).This PR also refactors the TokenizerInfo class to make the logic clearer. Now the enum VocabType will be serialized as integer, not string.