Model | Mode | Acc | No answer | Total | Reason Lens |
---|---|---|---|---|---|
o1-preview-2024-09-12 | greedy | 93.5 | 1.25 | 800 | 504.17 |
o1-mini-2024-09-12 | greedy | 91.38 | 0.38 | 800 | 468.3 |
gpt-4o-2024-08-06 | greedy | 85 | 0.12 | 800 | 760.7 |
chatgpt-4o-latest-24-09-07 | greedy | 84.25 | 0 | 800 | 683.45 |
gpt-4o-2024-05-13 | greedy | 83.62 | 0.25 | 800 | 611.46 |
claude-3-5-sonnet-20240620 | greedy | 78.75 | 0 | 800 | 518.28 |
gemini-1.5-pro-exp-0827 | greedy | 77.62 | 0.25 | 800 | 581.76 |
gpt-4-turbo-2024-04-09 | greedy | 76.75 | 0 | 800 | 566.57 |
gpt-4o-mini-2024-07-18 | greedy | 73.5 | 0.12 | 800 | 391.32 |
gemini-1.5-pro-exp-0801 | greedy | 73 | 0.12 | 800 | 436.82 |
Mistral-Large-2 | greedy | 72.88 | 0.25 | 800 | 469.91 |
gemini-1.5-flash-exp-0827 | greedy | 72.5 | 0.38 | 800 | 631.85 |
gpt-4-0314 | greedy | 72.38 | 0 | 800 | 404.28 |
Llama-3.1-405B-Inst-fp8@together | greedy | 72.12 | 2.62 | 800 | 300.62 |
Qwen2.5-72B-Instruct | greedy | 71.88 | 0 | 800 | 531.01 |
Llama-3.1-405B-Inst@hyperbolic | greedy | 71.5 | 1.12 | 800 | 345.76 |
Llama-3.1-405B-Inst@sambanova | greedy | 71.25 | 0.12 | 800 | 414.28 |
claude-3-opus-20240229 | greedy | 68.62 | 0 | 800 | 521.62 |
deepseek-v2-chat-0628 | greedy | 68.5 | 0 | 800 | 568.12 |
deepseek-v2.5-0908 | greedy | 68.12 | 0.12 | 800 | 524.02 |
Qwen2.5-32B-Instruct | greedy | 68.12 | 0.38 | 800 | 545.23 |
deepseek-v2-coder-0724 | greedy | 67.75 | 0 | 800 | 564.88 |
gemini-1.5-pro | greedy | 66.25 | 0.25 | 800 | 385.66 |
claude-3-sonnet-20240229 | greedy | 64.75 | 0 | 800 | 749.15 |
Meta-Llama-3.1-70B-Instruct | greedy | 62.62 | 0.5 | 800 | 493.74 |
gemini-1.5-flash | greedy | 61.88 | 0.25 | 800 | 514.44 |
yi-large-preview | greedy | 58.63 | 0 | 800 | 689.52 |
yi-large | greedy | 58.38 | 0 | 800 | 628.25 |
Qwen2-72B-Instruct | greedy | 57.38 | 0 | 800 | 444.5 |
Meta-Llama-3-70B-Instruct | greedy | 57.12 | 0 | 800 | 431.53 |
gemma-2-27b-it | greedy | 55.88 | 0 | 800 | 421.67 |
claude-3-haiku-20240307 | greedy | 53.62 | 0.12 | 800 | 708.22 |
gpt-3.5-turbo-0125 | greedy | 53.25 | 0.25 | 800 | 405.27 |
Qwen2.5-7B-Instruct | greedy | 51.25 | 0.5 | 800 | 531.07 |
Athene-70B | greedy | 49.75 | 0 | 800 | 283.62 |
reka-core-20240501 | greedy | 45 | 0 | 800 | 525.5 |
gemma-2-9b-it | greedy | 44.88 | 0 | 800 | 484.51 |
Yi-1.5-9B-Chat | greedy | 43.75 | 1.62 | 800 | 593.18 |
Phi-3-mini-4k-instruct | greedy | 43.5 | 0.75 | 800 | 539.63 |
Mixtral-8x7B-Instruct-v0.1 | greedy | 43.5 | 0.25 | 800 | 463.08 |
Yi-1.5-34B-Chat | greedy | 42.88 | 0 | 800 | 561.47 |
Phi-3.5-mini-instruct | greedy | 40.88 | 3 | 800 | 625.13 |
Meta-Llama-3.1-8B-Instruct | greedy | 38.75 | 0.62 | 800 | 535.85 |
Qwen2-7B-Instruct | greedy | 36.75 | 0.12 | 800 | 368.51 |
Meta-Llama-3-8B-Instruct | greedy | 36.62 | 0.25 | 800 | 411.52 |
reka-flash-20240226 | greedy | 33.25 | 0 | 800 | 565.61 |
Qwen2.5-3B-Instruct | greedy | 32.12 | 1 | 800 | 502.87 |
gemma-2-2b-it | greedy | 20.75 | 0 | 800 | 351.05 |