Loss downweighting on DNA repetitive elements #2310

gonzalobenegas · 2026-01-09T19:49:16Z

Loss downweighting on DNA repetitive elements

Adds a new DNA preprocessing step that not only tokenizes but also defines special loss weights for repetitive elements (marked lowercase in the input sequence).
Adds a dataloader that reads both input_ids and loss_weight from disk (could also be useful for other domains)
Adds 3 experiments:
- experiments/dna/standard.py: standard training run on DNA
- experiments/dna/repeat_weight_1.0.py: repeat weight of 1.0 (as control, I checked that the training loss behaves exactly as the standard training run)
- experiments/dna/repeat_weight_0.01.py: repeat weight of 0.01 (downweighting repeats, I checked it results in better donwstream task performance)

Additionally, edits to experiments/defaults.py allow passing an additional argument window_size_bytes that allows to increase tokenization parallelism.

gonzalobenegas · 2026-01-12T21:29:20Z

Update: the model with repeat downweighting seems to have worse performance at later step counts.

The advantages of this approach require further study but I would still like to merge the current infra and experiments.

eric-czech

Looks reasonable to me. There is certainly a lot of boilerplate that could be reduced, but I don't see anything else potentially problematic. Are there any aspects of the implementation you're particularly concerned about?

eric-czech · 2026-01-15T00:53:21Z

lib/levanter/src/levanter/data/text.py

        return length


+class WeightedTokenSeqDataset(AsyncDataset[dict]):


Why not subclass TokenSeqDataset for this instead and override get_batch? There's a lot of boilerplate here otherwise.

@eric-czech I started playing with this but I'm not being allowed to subclass TokenSeqDataset since get_batch returns Sequence[dict] instead of Sequence[np.ndarray]. The approaches I've been exploring require creating a base class for both TokenSeqDataset and WeightedTokenSeqDataset. Here's where I get a bit hesitant about touching the main Marin NLP code. Do you think it's a good idea, vs. just duplicating code?

Ah, well the biggest downside to duplicating and modifying the code is that it will then become easier for changes to TokenSeqDataset to be missed on rebase of dna to main. Ideally that would result in some kind of conflict or we pick up any changes without work. This feels like a more ideal structure for it all:

from typing import Generic, Sequence, TypeVar import numpy as np T_co = TypeVar("T_co", covariant=True) class GenericTokenSeqDataset(AsyncDataset[T_co], Generic[T_co]): def __init__(self, doc_cache: TreeCache[dict], seq_len: int): ... async def _await_cache(self, key: str) -> JaggedArrayStore: ... # All other shared methods too class TokenSeqDataset(GenericTokenSeqDataset[np.ndarray]): async def get_batch(self, indices: Sequence[int]) -> Sequence[np.ndarray]: ... class WeightedTokenSeqDataset(GenericTokenSeqDataset[dict[str, np.ndarray]]): async def get_batch(self, indices: Sequence[int]) -> Sequence[dict[str, np.ndarray]]: ... ...

Here's where I get a bit hesitant about touching the main Marin NLP code

I'd say hack away! I think we'll need to find a way to do that w/ some confidence for more ambitious experiments.

eric-czech · 2026-01-15T01:03:30Z

lib/levanter/src/levanter/data/text.py

        return await self.dataset.async_len()


+class WeightedCausalLmDataset(MappedAsyncDataset[dict, LmExample]):


Similarly, looks like this could be a subclass of CausalLmDataset that passes a constructor arg to switch on what implementation of _create_lm_example gets used.

eric-czech · 2026-01-15T01:07:55Z

lib/levanter/src/levanter/data/text.py

+        if tokenizer.eos_token_id is None:
+            enforce_eos = False
+
+        if enforce_eos:


To my knowledge, tokenizers should never do this if you tokenize with add_special_tokens=False (e.g. that's how I've avoided EOS w/ the Hyena tokenizer in the past). Do you know of any tokenizers that don't respect that flag? Otherwise, it would be cleaner to use that in __call__ and then leave this kind of validation up to users/clients.

eric-czech · 2026-01-15T01:09:16Z

lib/levanter/src/levanter/data/text.py

+        }
+
+    @property
+    def num_cpus(self) -> int:


Extending BatchTokenizer to avoid needing to redefine these would make sense to me.

gonzalobenegas · 2026-01-15T15:28:34Z

Thank you for the feedback! I'm not concerned about anything in particular but wanted to understand a bit the expectations for our dna workstream. I'll make sure to specify next time I request feedback!

gonzalobenegas added 11 commits January 9, 2026 14:03

Add simple DNA experiment

9ac2b9c

First experiment running

03b8820

First attempt at repeat downweighting

09678fc

Update gpu config

47276e5

Add experiment

41bc0e5

Increase tokenization parallelism

90817af

Add experiment

d5ab940

Add standard training run on DNA

8103d63

Increase warmup

8069190

Add repeat downweighting experiments

500df3b

Fix docstrings for DNA experiments

31a71ba

gonzalobenegas requested a review from eric-czech January 9, 2026 19:49

eric-czech reviewed Jan 15, 2026

View reviewed changes

gonzalobenegas added 5 commits January 16, 2026 15:09

Refactor WeightedTokenSeqDataset

8df76df

Refactor WeightedCausalLmDataset

4426419

Refactor DNABatchTokenizer

8ef2ecb

Simplify DNABatchTokenizer

8870c3b

Make DNABatchTokenizer actually work in batched mode

ac0f995

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loss downweighting on DNA repetitive elements #2310

Loss downweighting on DNA repetitive elements #2310

Uh oh!

gonzalobenegas commented Jan 9, 2026

Uh oh!

gonzalobenegas commented Jan 12, 2026

Uh oh!

eric-czech left a comment

Uh oh!

eric-czech Jan 15, 2026

Uh oh!

gonzalobenegas Jan 15, 2026 •

edited

Loading

Uh oh!

eric-czech Jan 15, 2026

Uh oh!

eric-czech Jan 15, 2026

Uh oh!

eric-czech Jan 15, 2026

Uh oh!

eric-czech Jan 15, 2026

Uh oh!

gonzalobenegas commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return length


		class WeightedTokenSeqDataset(AsyncDataset[dict]):

		return await self.dataset.async_len()


		class WeightedCausalLmDataset(MappedAsyncDataset[dict, LmExample]):

Loss downweighting on DNA repetitive elements #2310

Are you sure you want to change the base?

Loss downweighting on DNA repetitive elements #2310

Uh oh!

Conversation

gonzalobenegas commented Jan 9, 2026