Skip to content

Conversation

kothasuhas
Copy link

The no budget forcing evaluation is missing a "max_thinking_tokens" flag. Since the model released on huggingface does not naturally use a thinking token at the start of its response, this results in degraded performance (around 20% lower than the original paper on aime24_nofigures). Adding max_thinking_tokens=auto results in the performance reported in the original paper for AIME 24.

@Muennighoff
Copy link
Contributor

Thanks a lot for the PR!!

When I ran it without the max_thinking_tokens flag I got 50% on aime24_nofigures - the file is here:
https://cdn-lfs-us-1.hf.co/repos/21/71/2171f52d368b44aa97c53f2f421b6b67be90dab479e70d7e53142e2156ad793f/bc4c5ad202a3f169cd6207b28d907969a8de9df61d75c91048720a74264ed7f3?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27results_2025-01-20T18-02-29.481982.json%3B+filename%3D%22results_2025-01-20T18-02-29.481982.json%22%3B&response-content-type=application%2Fjson&Expires=1742248938&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0MjI0ODkzOH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzIxLzcxLzIxNzFmNTJkMzY4YjQ0YWE5N2M1M2YyZjQyMWI2YjY3YmU5MGRhYjQ3OWU3MGQ3ZTUzMTQyZTIxNTZhZDc5M2YvYmM0YzVhZDIwMmEzZjE2OWNkNjIwN2IyOGQ5MDc5NjlhOGRlOWRmNjFkNzVjOTEwNDg3MjBhNzQyNjRlZDdmMz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=hVG7WF5PeZZuqzfxQXFMIoUUtKtAImNCWsKxYBqvPf4-Y-cpiXXNhziKOzXGlIDa5VUHfJXlHaJGmTHVS3Vl1fqHQhspKQu1Sm7rlP8O6p2MbbxPO0HDC21RU-j5pgweT2uSMA5KnY%7ENNYJtf5dL%7EdIu6Eb7XaXIhNI%7EREp0r7OfJHk8t3CQyNbnBiVI-62--m5in9CslXsI23bVEO6l7q%7Eabs8v%7E2oJUvCpprOCD38YB56idxu12ietxzgMVMWC0yjZnimk4TSmeXDcHtwRLLhCWlNeTjmxkdMmkTK9e9%7EzMONIo%7EZImrfTbii4HSNpyQvyKpkzKg4q8ALstGkbpQ__&Key-Pair-Id=K24J24Z295AEI9

max_thinking_tokens=auto budget forces to ~30K see while this command is supposed to not use any BF; The auto one corresponds to this line:

OPENAI_API_KEY=YOUR_OPENAI_KEY PROCESSOR=gpt-4o-mini lm_eval --model vllm --model_args pretrained=simplescaling/s1-32B,tokenizer=Qwen/Qwen2.5-32B-Instruct,dtype=float32,tensor_parallel_size=8 --tasks aime24_figures,aime24_nofigures,openai_math,gpqa_diamond_openai --batch_size auto --apply_chat_template --output_path forcingauto --log_samples --gen_kwargs "max_gen_toks=32768,max_tokens_thinking=auto"

I

So I guess to get the think to still be appended but not budget force you could do sth like

OPENAI_API_KEY=YOUR_OPENAI_KEY PROCESSOR=gpt-4o-mini lm_eval --model vllm --model_args pretrained=simplescaling/s1-32B,tokenizer=Qwen/Qwen2.5-32B-Instruct,dtype=float32,tensor_parallel_size=8 --tasks aime24_figures,aime24_nofigures,openai_math,gpqa_diamond_openai --batch_size auto --apply_chat_template --output_path nottc --log_samples --gen_kwargs max_gen_toks=32768,max_thinking_tokens=33000

though I haven't tested this

@kothasuhas
Copy link
Author

Makes sense, I tested locally and making the length longer works as well, committed that change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants