You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m using this command to evaluate my fine-tuned 7B model on AIME24 “no figures”:
lm_eval --model vllm
--model_args pretrained=../../train/ckpts/s1-20250509_212555,dtype=float32,tensor_parallel_size=1
--tasks aime24_nofigures
--batch_size auto
--apply_chat_template
--output_path s1.1forcingauto
--log_samples
--gen_kwargs "max_gen_toks=32768,max_tokens_thinking=auto"
It loads in about 7 seconds but then takes roughly 20 minutes just to process the first example, whereas evaluating the original Qwen-7B model completes in just a couple of minutes under the same settings. If I switch to dtype=float16, throughput improves dramatically but at the cost of noticeable accuracy degradation. Is there something I’m missing in my configuration or checkpoint structure that could explain this drastic slowdown under float32?