Skip to content

Refactor to support duplicate tokens#14

Merged
samuki merged 12 commits intomainfrom
yahya/refactor
Apr 7, 2026
Merged

Refactor to support duplicate tokens#14
samuki merged 12 commits intomainfrom
yahya/refactor

Conversation

@yahya010
Copy link
Copy Markdown
Contributor

Supporting duplicate tokens (using the Token class in genlm-backend on the branch: genlm/genlm-backend#57). Also supporting multiple extends when multiple EOT's are found.

@yahya010
Copy link
Copy Markdown
Contributor Author

The tests work! They fail on the genlm backend import Token class since it's not updated yet! Once that is merged on the backend, both will pass.

Copy link
Copy Markdown
Member

@samuki samuki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change the target backend in pyproject.toml to check that all the tests are passing? Otherwise, it looks good, we just need to benchmark the speed.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

pyproject.toml Outdated
authors = [ { name = "Ben LeBrun", email = "benlebrun1@gmail.com" }, { name = "Tim Vieira"} ]
dependencies = [
"genlm-backend>=0.1.1",
"genlm-backend @ git+https://github.com/genlm/genlm-backend.git@samuki/refactor-token-bytes",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, but once we push the refactoring, we’ll need to change this in the new version.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah

decode (list[Token]): List of Token objects representing the token vocabulary.
Each Token must have both token_id and byte_string attributes.
device (str, optional): Device to use for weight sum and max computations ('cpu' or 'cuda').
atomic_tokens (list[bytes], optional): List of tokens that should be treated as atomic units rather than being split into bytes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic_tokens is a list[bytes], not a list of tokens. Now that we are doing this refactoring, this may be a little bit confusing. Can we rename this to something like atomic_byte_str? In general, I think it is better to refer to tokens when we have token_ids and byte_str when we have bytes in this repo. Same applies to other parameters, for example eos_tokens and eot_token, it would be good to rename them

pyproject.toml Outdated
authors = [ { name = "Ben LeBrun", email = "benlebrun1@gmail.com" }, { name = "Tim Vieira"} ]
dependencies = [
"genlm-backend>=0.1.1",
"genlm-backend @ git+https://github.com/genlm/genlm-backend.git@samuki/refactor-token-bytes",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah

Copy link
Copy Markdown

@shepardxia shepardxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! The only change needed before pushing is the pyproject.toml deps.

Copy link
Copy Markdown

@shepardxia shepardxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@samuki samuki merged commit bce88da into main Apr 7, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants