Skip to content

Refactor byte strings to tokens#135

Merged
samuki merged 16 commits intomainfrom
samuki/refactor-token-bytes
Apr 7, 2026
Merged

Refactor byte strings to tokens#135
samuki merged 16 commits intomainfrom
samuki/refactor-token-bytes

Conversation

@samuki
Copy link
Copy Markdown
Member

@samuki samuki commented Jan 25, 2026

Refactor: Switch from byte strings to token objects to handle duplicate byte strings. Fixes issue where multiple token IDs can decode to the same byte string. Builds on the new backend that returns token objects instead of bytes: genlm/genlm-backend#57

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 25, 2026

Codecov Report

❌ Patch coverage is 99.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
genlm/control/potential/built_in/llm.py 98.27% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@samuki samuki marked this pull request as ready for review January 25, 2026 22:44
Copy link
Copy Markdown

@shepardxia shepardxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Comments are mostly stylistic nitpicking.

pyproject.toml Outdated
"genlm-backend>=0.1.8",
"genlm-bytes>=0.1.2",
"genlm-backend @ git+https://github.com/genlm/genlm-backend.git@samuki/refactor-token-bytes",
"genlm-bytes @ git+https://github.com/genlm/genlm-bytes.git@yahya/refactor",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking for change before merging!

# Fallback: if token is plain bytes (not Token), search by byte_string content.
# This supports old code that indexes by bytes; returns the first match.
if Token.is_plain_bytes(token):
for vocab_token in self.decode:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback seems to be very slow ( O(n) ) --- could we use a dictionary to look this up ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. Updated in a0fcce0

Copy link
Copy Markdown

@shepardxia shepardxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@samuki samuki merged commit 1c6dd7c into main Apr 7, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants