-
Notifications
You must be signed in to change notification settings - Fork 71
Logprob evals with Levanter #2193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds log probability evaluation capabilities to Levanter by implementing OLMo-3 model support, enhancing the Qwen model implementation, improving evaluation harness robustness, and adding comprehensive evaluation tasks for multilingual benchmarks.
Key Changes:
- Implements OLMo-3 model architecture with sliding window attention support
- Fixes Qwen model bias handling for compatibility with newer transformers versions
- Improves eval harness error handling for missing task results
- Adds multilingual evaluation task configurations (Belebele, MGSM, MMMLU, etc.)
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/gpu_eval/pt_lm_eval_harness.sh | New bash script for evaluating models on GPU using vLLM and lm_eval |
| lib/levanter/src/levanter/models/qwen.py | Fixes bias handling for newer transformers, adds explicit RMSNorm config with use_bias=False |
| lib/levanter/src/levanter/models/olmo.py | Implements complete OLMo-3 architecture with sliding window attention, BlockSeq layers, and per-layer attention patterns |
| lib/levanter/src/levanter/eval_harness.py | Refactors average computation to handle missing task results gracefully |
| experiments/multilingual/exp1457_multilingual_cpt_eval.py | New experiment for evaluating multilingual CPT model on LM Eval Harness tasks |
| experiments/models.py | Adds model configurations for Llama 3 70B, OLMo-3 7B/32B, and Marin 32B base |
| experiments/evals/task_configs.py | Adds extensive multilingual task configurations (122 Belebele languages, 40+ few-shot tasks, MGSM, etc.) |
| experiments/evals/exp1602b_lm_eval_selected.py | New experiment to run selected LM Eval tasks across multiple model families |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| config.Mlp, | ||
| config.activation_function, | ||
| key=k_mlp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propagate HF bias flags to Qwen MLP
QwenConfig.from_hf_config still derives use_bias from the HF attention_bias/no_bias fields, but in QwenDecoderLayer.init the flag is no longer passed to LlamaMlp.init, so the MLP layers are always created without biases. For HF Qwen checkpoints where no_bias=False (the default), the saved MLP weights include bias tensors; initializing the model biasless drops those parameters and makes checkpoint conversion/evaluation inconsistent with the HF architecture. This will prevent correct loading of any biased Qwen checkpoint.
Useful? React with 👍 / 👎.
862af69 to
25e048b
Compare
|
The other evals with vllm are currently hard due to infra issues. We have incorported those ones in GPU scripts as well. |
Helw150
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OLMo 3 support still needs changes. I'm pretty sure this won't actually lead to valid OLMo 3 results right now and stylistically should not re-write modules which are unchanged.
| attn_backend=self.attn_backend, | ||
| flash_attention_block_size=self.flash_attention_block_size, | ||
| rope=self.rope, | ||
| qk_norm=None, # HF OLMo-3 checkpoints do not use qk_norm on Q/K |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, we should be really careful here on the details - this one confuses me a bit because the model shouldn't load correctly at all, but overall we should have a round trip test similar to the below.
| def model_type(self) -> Type["QwenLMHeadModel"]: | ||
| return QwenLMHeadModel | ||
|
|
||
| @property # type: ignore[override] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this override is needed? Qwen2 doesn't use Norm bias, but also it has no use_bias at all at the config level and so will default to false in the above code.
| rope_config = RotaryEmbeddingsConfig.from_hf_config(rope_theta, hf_config.rope_scaling) | ||
| use_bias = getattr(hf_config, "attention_bias", None) | ||
| if use_bias is None: | ||
| # Qwen2Config in newer transformers drops no_bias; assume bias by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true?
Qwen 2.5 doesn't specify any bias: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/config.json
but the model doesn't have bias.

Pretty sure default False is correct (e.g. Qwen 2.5 doesn't work when I run this branch on a new eval because this is misconfigured)
KeyError: 'model.layers.mlp.gate_proj.bias'�[32m [repeated 39x across cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did it work on main branch? I failed on main and made it work with the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be a version issue?
|
Can you point me to the WandB logs for the logprob evals which were run on our cluster? I'm a bit confused at how some of these changes are loading model weights successfully and want to look at the logs. |
They are seperate for each model and each task, for example: |
|
This pull request has been inactive for 23 days and is marked as stale. |
|
This pull request has been automatically closed due to inactivity. |



Description
This PR includes the log_prob evals in #1602, including Qwen2 and Olmo3 support