Skip to content

Conversation

@ziqing-huang
Copy link
Contributor

@ziqing-huang ziqing-huang commented Dec 8, 2025

Description

This PR includes the log_prob evals in #1602, including Qwen2 and Olmo3 support

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds log probability evaluation capabilities to Levanter by implementing OLMo-3 model support, enhancing the Qwen model implementation, improving evaluation harness robustness, and adding comprehensive evaluation tasks for multilingual benchmarks.

Key Changes:

  • Implements OLMo-3 model architecture with sliding window attention support
  • Fixes Qwen model bias handling for compatibility with newer transformers versions
  • Improves eval harness error handling for missing task results
  • Adds multilingual evaluation task configurations (Belebele, MGSM, MMMLU, etc.)

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/gpu_eval/pt_lm_eval_harness.sh New bash script for evaluating models on GPU using vLLM and lm_eval
lib/levanter/src/levanter/models/qwen.py Fixes bias handling for newer transformers, adds explicit RMSNorm config with use_bias=False
lib/levanter/src/levanter/models/olmo.py Implements complete OLMo-3 architecture with sliding window attention, BlockSeq layers, and per-layer attention patterns
lib/levanter/src/levanter/eval_harness.py Refactors average computation to handle missing task results gracefully
experiments/multilingual/exp1457_multilingual_cpt_eval.py New experiment for evaluating multilingual CPT model on LM Eval Harness tasks
experiments/models.py Adds model configurations for Llama 3 70B, OLMo-3 7B/32B, and Marin 32B base
experiments/evals/task_configs.py Adds extensive multilingual task configurations (122 Belebele languages, 40+ few-shot tasks, MGSM, etc.)
experiments/evals/exp1602b_lm_eval_selected.py New experiment to run selected LM Eval tasks across multiple model families

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 170 to 172
config.Mlp,
config.activation_function,
key=k_mlp,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate HF bias flags to Qwen MLP

QwenConfig.from_hf_config still derives use_bias from the HF attention_bias/no_bias fields, but in QwenDecoderLayer.init the flag is no longer passed to LlamaMlp.init, so the MLP layers are always created without biases. For HF Qwen checkpoints where no_bias=False (the default), the saved MLP weights include bias tensors; initializing the model biasless drops those parameters and makes checkpoint conversion/evaluation inconsistent with the HF architecture. This will prevent correct loading of any biased Qwen checkpoint.

Useful? React with 👍 / 👎.

@marin-community marin-community deleted a comment from Copilot AI Dec 9, 2025
@marin-community marin-community deleted a comment from Copilot AI Dec 9, 2025
@ziqing-huang
Copy link
Contributor Author

The other evals with vllm are currently hard due to infra issues. We have incorported those ones in GPU scripts as well.

Copy link
Member

@Helw150 Helw150 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLMo 3 support still needs changes. I'm pretty sure this won't actually lead to valid OLMo 3 results right now and stylistically should not re-write modules which are unchanged.

attn_backend=self.attn_backend,
flash_attention_block_size=self.flash_attention_block_size,
rope=self.rope,
qk_norm=None, # HF OLMo-3 checkpoints do not use qk_norm on Q/K
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, the OLMo models are using QK Norm.

Image Image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, we should be really careful here on the details - this one confuses me a bit because the model shouldn't load correctly at all, but overall we should have a round trip test similar to the below.

https://github.com/marin-community/levanter/blob/982cef7f1d8d1a642b825fcd30ab1b44a912f478/tests/test_llama3.py#L71-L126

def model_type(self) -> Type["QwenLMHeadModel"]:
return QwenLMHeadModel

@property # type: ignore[override]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this override is needed? Qwen2 doesn't use Norm bias, but also it has no use_bias at all at the config level and so will default to false in the above code.

rope_config = RotaryEmbeddingsConfig.from_hf_config(rope_theta, hf_config.rope_scaling)
use_bias = getattr(hf_config, "attention_bias", None)
if use_bias is None:
# Qwen2Config in newer transformers drops no_bias; assume bias by default
Copy link
Member

@Helw150 Helw150 Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true?

Qwen 2.5 doesn't specify any bias: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/config.json

but the model doesn't have bias.
Image

Pretty sure default False is correct (e.g. Qwen 2.5 doesn't work when I run this branch on a new eval because this is misconfigured)

KeyError: 'model.layers.mlp.gate_proj.bias'�[32m [repeated 39x across cluster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did it work on main branch? I failed on main and made it work with the change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a version issue?

@Helw150
Copy link
Member

Helw150 commented Dec 16, 2025

Can you point me to the WandB logs for the logprob evals which were run on our cluster? I'm a bit confused at how some of these changes are loading model weights successfully and want to look at the logs.

@ziqing-huang
Copy link
Contributor Author

Can you point me to the WandB logs for the logprob evals which were run on our cluster? I'm a bit confused at how some of these changes are loading model weights successfully and want to look at the logs.

They are seperate for each model and each task, for example:
https://wandb.ai/marin-community/marin/runs/e105yqwh
https://wandb.ai/marin-community/marin/runs/wx00lh62

@Helw150
Copy link
Member

Helw150 commented Dec 16, 2025

Yeah, the OLMo 3 implementation is definitely not correct right now. The 32B model is getting worse than random chance on MMLU.

Screenshot 2025-12-16 at 8 42 36 AM

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

@github-actions github-actions bot added the stale label Jan 9, 2026
@github-actions
Copy link
Contributor

This pull request has been automatically closed due to inactivity.
If you would like to continue working on this, please reopen it or create a new PR.

@github-actions github-actions bot closed this Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants