Logprob evals with Levanter #2193

ziqing-huang · 2025-12-08T22:57:40Z

Description

This PR includes the log_prob evals in #1602, including Qwen2 and Olmo3 support

Copilot

Pull request overview

This PR adds log probability evaluation capabilities to Levanter by implementing OLMo-3 model support, enhancing the Qwen model implementation, improving evaluation harness robustness, and adding comprehensive evaluation tasks for multilingual benchmarks.

Key Changes:

Implements OLMo-3 model architecture with sliding window attention support
Fixes Qwen model bias handling for compatibility with newer transformers versions
Improves eval harness error handling for missing task results
Adds multilingual evaluation task configurations (Belebele, MGSM, MMMLU, etc.)

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
scripts/gpu_eval/pt_lm_eval_harness.sh	New bash script for evaluating models on GPU using vLLM and lm_eval
lib/levanter/src/levanter/models/qwen.py	Fixes bias handling for newer transformers, adds explicit RMSNorm config with use_bias=False
lib/levanter/src/levanter/models/olmo.py	Implements complete OLMo-3 architecture with sliding window attention, BlockSeq layers, and per-layer attention patterns
lib/levanter/src/levanter/eval_harness.py	Refactors average computation to handle missing task results gracefully
experiments/multilingual/exp1457_multilingual_cpt_eval.py	New experiment for evaluating multilingual CPT model on LM Eval Harness tasks
experiments/models.py	Adds model configurations for Llama 3 70B, OLMo-3 7B/32B, and Marin 32B base
experiments/evals/task_configs.py	Adds extensive multilingual task configurations (122 Belebele languages, 40+ few-shot tasks, MGSM, etc.)
experiments/evals/exp1602b_lm_eval_selected.py	New experiment to run selected LM Eval tasks across multiple model families

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-08T23:02:55Z

lib/levanter/src/levanter/models/qwen.py

            config.Mlp,
            config.activation_function,
            key=k_mlp,


Propagate HF bias flags to Qwen MLP

QwenConfig.from_hf_config still derives use_bias from the HF attention_bias/no_bias fields, but in QwenDecoderLayer.init the flag is no longer passed to LlamaMlp.init, so the MLP layers are always created without biases. For HF Qwen checkpoints where no_bias=False (the default), the saved MLP weights include bias tensors; initializing the model biasless drops those parameters and makes checkpoint conversion/evaluation inconsistent with the HF architecture. This will prevent correct loading of any biased Qwen checkpoint.

Useful? React with 👍 / 👎.

experiments/exp905b_nemotron_sft_train.py

ziqing-huang · 2025-12-11T09:44:15Z

The other evals with vllm are currently hard due to infra issues. We have incorported those ones in GPU scripts as well.

Helw150

OLMo 3 support still needs changes. I'm pretty sure this won't actually lead to valid OLMo 3 results right now and stylistically should not re-write modules which are unchanged.

lib/levanter/src/levanter/models/olmo.py

Helw150 · 2025-12-16T05:16:12Z

lib/levanter/src/levanter/models/olmo.py

+            attn_backend=self.attn_backend,
+            flash_attention_block_size=self.flash_attention_block_size,
+            rope=self.rope,
+            qk_norm=None,  # HF OLMo-3 checkpoints do not use qk_norm on Q/K


This is not correct, the OLMo models are using QK Norm.

Overall, we should be really careful here on the details - this one confuses me a bit because the model shouldn't load correctly at all, but overall we should have a round trip test similar to the below.

https://github.com/marin-community/levanter/blob/982cef7f1d8d1a642b825fcd30ab1b44a912f478/tests/test_llama3.py#L71-L126

Helw150 · 2025-12-16T05:19:56Z

lib/levanter/src/levanter/models/qwen.py

    def model_type(self) -> Type["QwenLMHeadModel"]:
        return QwenLMHeadModel

+    @property  # type: ignore[override]


I don't think this override is needed? Qwen2 doesn't use Norm bias, but also it has no use_bias at all at the config level and so will default to false in the above code.

Helw150 · 2025-12-16T06:11:39Z

lib/levanter/src/levanter/models/qwen.py

        rope_config = RotaryEmbeddingsConfig.from_hf_config(rope_theta, hf_config.rope_scaling)
+        use_bias = getattr(hf_config, "attention_bias", None)
+        if use_bias is None:
+            # Qwen2Config in newer transformers drops no_bias; assume bias by default


Is this true?

Qwen 2.5 doesn't specify any bias: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/config.json

but the model doesn't have bias.

Pretty sure default False is correct (e.g. Qwen 2.5 doesn't work when I run this branch on a new eval because this is misconfigured)

KeyError: 'model.layers.mlp.gate_proj.bias'�[32m [repeated 39x across cluster

Did it work on main branch? I failed on main and made it work with the change

might be a version issue?

Helw150 · 2025-12-16T06:19:27Z

Can you point me to the WandB logs for the logprob evals which were run on our cluster? I'm a bit confused at how some of these changes are loading model weights successfully and want to look at the logs.

ziqing-huang · 2025-12-16T09:24:41Z

Can you point me to the WandB logs for the logprob evals which were run on our cluster? I'm a bit confused at how some of these changes are loading model weights successfully and want to look at the logs.

They are seperate for each model and each task, for example:
https://wandb.ai/marin-community/marin/runs/e105yqwh
https://wandb.ai/marin-community/marin/runs/wx00lh62

Helw150 · 2025-12-16T16:43:48Z

Yeah, the OLMo 3 implementation is definitely not correct right now. The 32B model is getting worse than random chance on MMLU.

github-actions · 2026-01-09T01:05:36Z

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

github-actions · 2026-01-16T01:05:48Z

This pull request has been automatically closed due to inactivity.
If you would like to continue working on this, please reopen it or create a new PR.

Helw150 and others added 4 commits December 5, 2025 16:56

Working Multilingual Eval Branch

7821dfc

Fix Qwen2

e1cab50

Add olmo3 support

8ff7561

Add gpu script

e959f76

ziqing-huang requested review from Helw150 and Copilot December 8, 2025 22:57

Copilot started reviewing on behalf of ziqing-huang December 8, 2025 22:58 View session

File rename

3000b8a

Copilot AI reviewed Dec 8, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Dec 8, 2025

View reviewed changes

ziqing-huang and others added 2 commits December 8, 2025 15:09

Remove multilingual experiment

6461688

Add missing data_utils/count_dataset.py

25e048b

ziqing-huang force-pushed the ziqing/multi_eval branch from 862af69 to 25e048b Compare December 8, 2025 23:11

Fix olmo

ce48dc5

percyliang reviewed Dec 9, 2025

View reviewed changes

experiments/exp905b_nemotron_sft_train.py Outdated Show resolved Hide resolved

ziqing-huang added 2 commits December 8, 2025 23:43

Remove irrelated file

20d08c6

Remove irrelated file

de770c2

marin-community deleted a comment from Copilot AI Dec 9, 2025

Update GPU script

9a198dd

Helw150 requested changes Dec 16, 2025

View reviewed changes

Helw150 reviewed Dec 16, 2025

View reviewed changes

Chi Heem W and others added 5 commits December 16, 2025 01:50

linting

50d06e1

Reuse olmo blocks

d6b5049

Fix olmo3

8efe74f

Fix olmo3

056e934

Update resource config

d5dd8a3

ziqing-huang and others added 3 commits December 16, 2025 02:52

Add olmo3 test

b624d3a

change for loading from checkpoints

e92ca7a

Update task configs

e523339

github-actions bot added the stale label Jan 9, 2026

github-actions bot closed this Jan 16, 2026

Logprob evals with Levanter #2193

Logprob evals with Levanter #2193

Uh oh!

Conversation

ziqing-huang commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ziqing-huang commented Dec 11, 2025

Uh oh!

Helw150 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Helw150 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Helw150 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Helw150 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Helw150 Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ziqing-huang Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ziqing-huang Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Helw150 commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ziqing-huang commented Dec 16, 2025

Uh oh!

Helw150 commented Dec 16, 2025

Uh oh!

github-actions bot commented Jan 9, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ziqing-huang commented Dec 8, 2025 •

edited

Loading

Helw150 Dec 16, 2025 •

edited

Loading

Helw150 commented Dec 16, 2025 •

edited

Loading