Missing thinking tokens flag for no budget forcing eval #100

kothasuhas · 2025-03-17T20:45:28Z

The no budget forcing evaluation is missing a "max_thinking_tokens" flag. Since the model released on huggingface does not naturally use a thinking token at the start of its response, this results in degraded performance (around 20% lower than the original paper on aime24_nofigures). Adding max_thinking_tokens=auto results in the performance reported in the original paper for AIME 24.

Muennighoff · 2025-03-17T21:09:32Z

Thanks a lot for the PR!!

When I ran it without the max_thinking_tokens flag I got 50% on aime24_nofigures - the file is here:
https://cdn-lfs-us-1.hf.co/repos/21/71/2171f52d368b44aa97c53f2f421b6b67be90dab479e70d7e53142e2156ad793f/bc4c5ad202a3f169cd6207b28d907969a8de9df61d75c91048720a74264ed7f3?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27results_2025-01-20T18-02-29.481982.json%3B+filename%3D%22results_2025-01-20T18-02-29.481982.json%22%3B&response-content-type=application%2Fjson&Expires=1742248938&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0MjI0ODkzOH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzIxLzcxLzIxNzFmNTJkMzY4YjQ0YWE5N2M1M2YyZjQyMWI2YjY3YmU5MGRhYjQ3OWU3MGQ3ZTUzMTQyZTIxNTZhZDc5M2YvYmM0YzVhZDIwMmEzZjE2OWNkNjIwN2IyOGQ5MDc5NjlhOGRlOWRmNjFkNzVjOTEwNDg3MjBhNzQyNjRlZDdmMz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=hVG7WF5PeZZuqzfxQXFMIoUUtKtAImNCWsKxYBqvPf4-Y-cpiXXNhziKOzXGlIDa5VUHfJXlHaJGmTHVS3Vl1fqHQhspKQu1Sm7rlP8O6p2MbbxPO0HDC21RU-j5pgweT2uSMA5KnY%7ENNYJtf5dL%7EdIu6Eb7XaXIhNI%7EREp0r7OfJHk8t3CQyNbnBiVI-62--m5in9CslXsI23bVEO6l7q%7Eabs8v%7E2oJUvCpprOCD38YB56idxu12ietxzgMVMWC0yjZnimk4TSmeXDcHtwRLLhCWlNeTjmxkdMmkTK9e9%7EzMONIo%7EZImrfTbii4HSNpyQvyKpkzKg4q8ALstGkbpQ__&Key-Pair-Id=K24J24Z295AEI9

max_thinking_tokens=auto budget forces to ~30K see while this command is supposed to not use any BF; The auto one corresponds to this line:

s1/eval/commands.sh

Line 31 in 73dc02f

    
           OPENAI_API_KEY=YOUR_OPENAI_KEY PROCESSOR=gpt-4o-mini lm_eval --model vllm --model_args pretrained=simplescaling/s1-32B,tokenizer=Qwen/Qwen2.5-32B-Instruct,dtype=float32,tensor_parallel_size=8 --tasks aime24_figures,aime24_nofigures,openai_math,gpqa_diamond_openai --batch_size auto --apply_chat_template --output_path forcingauto --log_samples --gen_kwargs "max_gen_toks=32768,max_tokens_thinking=auto"

I

So I guess to get the think to still be appended but not budget force you could do sth like

OPENAI_API_KEY=YOUR_OPENAI_KEY PROCESSOR=gpt-4o-mini lm_eval --model vllm --model_args pretrained=simplescaling/s1-32B,tokenizer=Qwen/Qwen2.5-32B-Instruct,dtype=float32,tensor_parallel_size=8 --tasks aime24_figures,aime24_nofigures,openai_math,gpqa_diamond_openai --batch_size auto --apply_chat_template --output_path nottc --log_samples --gen_kwargs max_gen_toks=32768,max_thinking_tokens=33000

though I haven't tested this

kothasuhas · 2025-03-17T23:13:03Z

Makes sense, I tested locally and making the length longer works as well, committed that change

add missing thinking tokens flag

73dc02f

changing from auto to higher value

f915b6c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missing thinking tokens flag for no budget forcing eval #100

Missing thinking tokens flag for no budget forcing eval #100

Uh oh!

kothasuhas commented Mar 17, 2025

Uh oh!

Muennighoff commented Mar 17, 2025

Uh oh!

kothasuhas commented Mar 17, 2025

Uh oh!

Uh oh!

Missing thinking tokens flag for no budget forcing eval #100

Are you sure you want to change the base?

Missing thinking tokens flag for no budget forcing eval #100

Uh oh!

Conversation

kothasuhas commented Mar 17, 2025

Uh oh!

Muennighoff commented Mar 17, 2025

Uh oh!

kothasuhas commented Mar 17, 2025

Uh oh!

Uh oh!