Solving grid puzzles involves a significant amount of logical reasoning. Hence, it is a good domain to evaluate the reasoning capability of a model which can then guide us to improve the reasoning ability of models. However, most existing works evaluate only the final predicted answer of a puzzle, without delving into an in-depth analysis of the LLMs' reasoning chains (such as where they falter) or providing any finer metrics to evaluate them. Since LLMs may rely on simple heuristics or artifacts to predict the final answer, it is crucial to evaluate the generated reasoning chain beyond overall correctness measures, for accurately evaluating the reasoning abilities of LLMs. To this end, we first develop GridPuzzle, an evaluation dataset comprising 274 grid-based puzzles with different complexities. Second, we propose a new error taxonomy derived from manual analysis of reasoning chains from LLMs including GPT-4, Claude-3, Gemini, Mistral, and Llama-2. Then, we develop an LLM-based framework for large-scale subjective evaluation (i.e., identifying errors) and an objective metric, PuzzleEval, to evaluate the correctness of reasoning chains. Evaluating reasoning chains from LLMs leads to several interesting findings. We further show that existing prompting methods used for enhancing models' reasoning abilities do not improve performance on GridPuzzle. This highlights the importance of understanding fine-grained errors and presents a challenge for future research to enhance LLMs' puzzle-solving abilities by developing methods that address these errors.
Please take a look at the ./data folder to access the GridPuzzle dataset and the data for all the experiments.
Scope of the dataset: The dataset consists of GridPuzzle with the original reasoning chains, the data for Auto-evaluation done by GPT-4o, the data for both the Metrics, Accuracy and PuzzleEval, and finally the data for the Mitigation strategies.
The data/
folder contains the following files:
├── ...
├── data/
├── GridPuzzle.csv
├── Auto-Evaluation
├── Metrics
│ ├── Accuracy
│ └── PuzzleEval
└── Mitigation
├── Mitigation Results
└── PuzzleEval Results
key | id | question | answer | Mistral-7b | Llama-13b | Gemini-pro | GPT-4-turbo | Claude-3 |
---|
- key: The grid size of each puzzle along with the difficulty level.
- id: Unique identifier for each data entry.
- question: The puzzle question prompt using Zero-shot-CoT.
- answer: The gold solution table for the corresponding puzzle.
- Mistral-7b: Model response for the prompt in the question.
- Llama-13b: Model response for the prompt in the question.
- Gemini-pro: Model response for the prompt in the question.
- GPT-4-turbo: Model response for the prompt in the question.
- Claude-3: Model response for the prompt in the question.
key | question | answer | reasoning chain | prompt | annotated_RC |
---|
- key: The grid size of each puzzle along with the difficulty level.
- question: The puzzle question prompt using Zero-shot-CoT.
- answer: The gold solution table for the corresponding puzzle.
- reasoning chain: The reasoning Chains generated by the respective models using the question prompt.
- prompt: The user prompt containing the reasoning chain which is followed after the fixed system prompts for Auto-evaluation.
- annotated_RC: The annotations gieven by the Auto-evaluator for the corresponding Reasoning Chain.
puzzle | reasoning chain | gold solution | model solution | Validation(Correct/Incorrect) |
---|
- puzzle: The puzzle question prompt using Zero-shot-CoT.
- reasoning chain: The reasoning Chains generated by the respective models using the question prompt.
- gold solution: The correct or 'gold standard' solution to the puzzle.
- model solution: The final solution provided by each model, used for comparison against the gold standard.
- Validation(Correct/Incorrect): A field indicating whether the model solution matches the gold standard; marked as 'Correct' or 'Incorrect'.
key | question | gold solution | reasoning chain | Labelled Steps | Final Conclusions | Pair-wise Relation | Validate Relations | Step Correctness Score | Average Correctness |
---|
- key: The grid size of each puzzle along with the difficulty level.
- question: The puzzle question prompt using Zero-shot-CoT.
- gold solution: The correct or 'gold standard' solution to the puzzle.
- reasoning chain: The reasoning Chains generated by the respective models using the question prompt.
- Labelled Steps: Breakdown of the reasoning chain into distinct, labeled steps.
- Final Conclusions: The ultimate conclusions drawn from each step of the reasoning chain.
- Pair-wise Relation: The extracted pair-wise relations from each step.
- Validate Relations: The results of comparing the pair-wise realtion with the Gold solution table.
- Step Correctness Score: The score of each step being marked as 1 for Correct pair and 0 for Incorrect pair. These scores are averaged out for each step.
- Average Correctness: The Average score of correctness score calculated for all the steps in the reasoning chain.