Refactor to support duplicate tokens by yahya010 · Pull Request #14 · genlm/genlm-bytes

yahya010 · 2025-12-21T02:12:35Z

Supporting duplicate tokens (using the Token class in genlm-backend on the branch: genlm/genlm-backend#57). Also supporting multiple extends when multiple EOT's are found.

yahya010 · 2025-12-21T02:19:09Z

The tests work! They fail on the genlm backend import Token class since it's not updated yet! Once that is merged on the backend, both will pass.

samuki

Could we change the target backend in pyproject.toml to check that all the tests are passing? Otherwise, it looks good, we just need to benchmark the speed.

test_gemma.py

genlm/bytes/byte_lm/heal.py

tests/test_eos_logic.py

genlm/bytes/byte_lm/trie_state.py

codecov · 2026-01-08T14:31:18Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

vicky-xef · 2026-03-23T11:02:12Z

pyproject.toml

 authors = [ { name = "Ben LeBrun", email = "benlebrun1@gmail.com" }, { name = "Tim Vieira"} ]
 dependencies = [
-    "genlm-backend>=0.1.1",
+    "genlm-backend @ git+https://github.com/genlm/genlm-backend.git@samuki/refactor-token-bytes",


This is fine for now, but once we push the refactoring, we’ll need to change this in the new version.

vicky-xef · 2026-03-23T13:30:15Z

genlm/bytes/trie.py

+            decode (list[Token]): List of Token objects representing the token vocabulary.
+                Each Token must have both token_id and byte_string attributes.
            device (str, optional): Device to use for weight sum and max computations ('cpu' or 'cuda').
            atomic_tokens (list[bytes], optional): List of tokens that should be treated as atomic units rather than being split into bytes.


atomic_tokens is a list[bytes], not a list of tokens. Now that we are doing this refactoring, this may be a little bit confusing. Can we rename this to something like atomic_byte_str? In general, I think it is better to refer to tokens when we have token_ids and byte_str when we have bytes in this repo. Same applies to other parameters, for example eos_tokens and eot_token, it would be good to rename them

ClementeP · 2026-03-26T19:13:15Z

pyproject.toml

 authors = [ { name = "Ben LeBrun", email = "benlebrun1@gmail.com" }, { name = "Tim Vieira"} ]
 dependencies = [
-    "genlm-backend>=0.1.1",
+    "genlm-backend @ git+https://github.com/genlm/genlm-backend.git@samuki/refactor-token-bytes",


genlm/bytes/trie.py

genlm/bytes/byte_lm/heal.py

genlm/bytes/trie.py

shepardxia

Looks good to me! The only change needed before pushing is the pyproject.toml deps.

shepardxia

lgtm!

refactor and add multiple extends

2a3ccdf

yahya010 requested a review from samuki December 21, 2025 02:19

yahya010 mentioned this pull request Dec 22, 2025

Unable to generate byte probabilities for Gemma 2 2B IT #13

Closed

samuki reviewed Jan 8, 2026

View reviewed changes

test_gemma.py Outdated Show resolved Hide resolved

genlm/bytes/byte_lm/heal.py Outdated Show resolved Hide resolved

tests/test_eos_logic.py Show resolved Hide resolved

genlm/bytes/byte_lm/trie_state.py Outdated Show resolved Hide resolved

yahya010 added 3 commits January 8, 2026 08:33

update tests & docstring

dc5a1b0

skip

44aaa5d

remove

5fbd796

yahya010 added 3 commits January 8, 2026 09:34

cov

3befcb0

cov

dc62b87

clean

54f5edb

samuki requested review from ClementeP, shepardxia and vicky-xef March 19, 2026 14:47

vicky-xef reviewed Mar 23, 2026

View reviewed changes

samuki added 2 commits March 23, 2026 21:41

Rename variables

e39596e

Code coverage

c22afdc

ClementeP requested changes Mar 26, 2026

View reviewed changes

vicky-xef reviewed Mar 27, 2026

View reviewed changes

genlm/bytes/trie.py Outdated Show resolved Hide resolved

Integrate feedback

9d7b25a

shepardxia reviewed Apr 2, 2026

View reviewed changes

Fallback

325b684

shepardxia approved these changes Apr 2, 2026

View reviewed changes

vicky-xef approved these changes Apr 2, 2026

View reviewed changes

Update dependencies

0f9704f

samuki merged commit bce88da into main Apr 7, 2026
3 checks passed

Conversation

yahya010 commented Dec 21, 2025

Uh oh!

yahya010 commented Dec 21, 2025

Uh oh!

samuki left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vicky-xef Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ClementeP Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

vicky-xef Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ClementeP Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shepardxia left a comment

Choose a reason for hiding this comment

Uh oh!

shepardxia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Jan 8, 2026 •

edited

Loading