You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?
Solving grid puzzles involves a significant amount of logical reasoning. Hence, it is a good domain to evaluate the reasoning capability of a model which can then guide us to improve the reasoning ability of models. However, most existing works evaluate only the final predicted answer of a puzzle, without delving into an in-depth analysis of the LLMs' reasoning chains (such as where they falter) or providing any finer metrics to evaluate them. Since LLMs may rely on simple heuristics or artifacts to predict the final answer, it is crucial to evaluate the generated reasoning chain beyond overall correctness measures, for accurately evaluating the reasoning abilities of LLMs. To this end, we first develop GridPuzzle, an evaluation dataset comprising 274 grid-based puzzles with different complexities. Second, we propose a new error taxonomy derived from manual analysis of reasoning chains from LLMs including GPT-4, Claude-3, Gemini, Mistral, and Llama-2. Then, we develop an LLM-based framework for large-scale subjective evaluation (i.e., identifying errors) and an objective metric, PuzzleEval, to evaluate the correctness of reasoning chains. Evaluating reasoning chains from LLMs leads to several interesting findings. We further show that existing prompting methods used for enhancing models' reasoning abilities do not improve performance on GridPuzzle. This highlights the importance of understanding fine-grained errors and presents a challenge for future research to enhance LLMs' puzzle-solving abilities by developing methods that address these errors.
Data Release
Please take a look at the ./data folder to access the GridPuzzle dataset and the data for all the experiments.
Scope of the dataset: The dataset consists of GridPuzzle with the original reasoning chains, the data for Auto-evaluation done by GPT-4o, the data for both the Metrics, Accuracy and PuzzleEval, and finally the data for the Mitigation strategies.
The data/ folder contains the following files:
key: The grid size of each puzzle along with the difficulty level.
id: Unique identifier for each data entry.
question: The puzzle question prompt using Zero-shot-CoT.
answer: The gold solution table for the corresponding puzzle.
Mistral-7b: Model response for the prompt in the question.
Llama-13b: Model response for the prompt in the question.
Gemini-pro: Model response for the prompt in the question.
GPT-4-turbo: Model response for the prompt in the question.
Claude-3: Model response for the prompt in the question.
Excel file format for Auto-Evaluation
key
question
answer
reasoning chain
prompt
annotated_RC
Column Descriptions
key: The grid size of each puzzle along with the difficulty level.
question: The puzzle question prompt using Zero-shot-CoT.
answer: The gold solution table for the corresponding puzzle.
reasoning chain: The reasoning Chains generated by the respective models using the question prompt.
prompt: The user prompt containing the reasoning chain which is followed after the fixed system prompts for Auto-evaluation.
annotated_RC: The annotations gieven by the Auto-evaluator for the corresponding Reasoning Chain.
Excel file format for Accuracy
puzzle
reasoning chain
gold solution
model solution
Validation(Correct/Incorrect)
Column Descriptions
puzzle: The puzzle question prompt using Zero-shot-CoT.
reasoning chain: The reasoning Chains generated by the respective models using the question prompt.
gold solution: The correct or 'gold standard' solution to the puzzle.
model solution: The final solution provided by each model, used for comparison against the gold standard.
Validation(Correct/Incorrect): A field indicating whether the model solution matches the gold standard; marked as 'Correct' or 'Incorrect'.
Excel file format for PuzzleEval
key
question
gold solution
reasoning chain
Labelled Steps
Final Conclusions
Pair-wise Relation
Validate Relations
Step Correctness Score
Average Correctness
Column Descriptions
key: The grid size of each puzzle along with the difficulty level.
question: The puzzle question prompt using Zero-shot-CoT.
gold solution: The correct or 'gold standard' solution to the puzzle.
reasoning chain: The reasoning Chains generated by the respective models using the question prompt.
Labelled Steps: Breakdown of the reasoning chain into distinct, labeled steps.
Final Conclusions: The ultimate conclusions drawn from each step of the reasoning chain.
Pair-wise Relation: The extracted pair-wise relations from each step.
Validate Relations: The results of comparing the pair-wise realtion with the Gold solution table.
Step Correctness Score: The score of each step being marked as 1 for Correct pair and 0 for Incorrect pair. These scores are averaged out for each step.
Average Correctness: The Average score of correctness score calculated for all the steps in the reasoning chain.