Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?

Solving grid puzzles involves a significant amount of logical reasoning. Hence, it is a good domain to evaluate the reasoning capability of a model which can then guide us to improve the reasoning ability of models. However, most existing works evaluate only the final predicted answer of a puzzle, without delving into an in-depth analysis of the LLMs' reasoning chains (such as where they falter) or providing any finer metrics to evaluate them. Since LLMs may rely on simple heuristics or artifacts to predict the final answer, it is crucial to evaluate the generated reasoning chain beyond overall correctness measures, for accurately evaluating the reasoning abilities of LLMs. To this end, we first develop GridPuzzle, an evaluation dataset comprising 274 grid-based puzzles with different complexities. Second, we propose a new error taxonomy derived from manual analysis of reasoning chains from LLMs including GPT-4, Claude-3, Gemini, Mistral, and Llama-2. Then, we develop an LLM-based framework for large-scale subjective evaluation (i.e., identifying errors) and an objective metric, PuzzleEval, to evaluate the correctness of reasoning chains. Evaluating reasoning chains from LLMs leads to several interesting findings. We further show that existing prompting methods used for enhancing models' reasoning abilities do not improve performance on GridPuzzle. This highlights the importance of understanding fine-grained errors and presents a challenge for future research to enhance LLMs' puzzle-solving abilities by developing methods that address these errors.  

Data Release

Please take a look at the ./data folder to access the GridPuzzle dataset and the data for all the experiments.

Scope of the dataset: The dataset consists of GridPuzzle with the original reasoning chains, the data for Auto-evaluation done by GPT-4o, the data for both the Metrics, Accuracy and PuzzleEval, and finally the data for the Mitigation strategies.
The data/ folder contains the following files:

├── ...
├── data/
  ├── GridPuzzle.csv
  ├── Auto-Evaluation
  ├── Metrics
  │   ├── Accuracy
  │   └── PuzzleEval
  └── Mitigation
      ├── Mitigation Results
      └── PuzzleEval Results

CSV file format for GridPuzzle

key	id	question	answer	Mistral-7b	Llama-13b	Gemini-pro	GPT-4-turbo	Claude-3

Column Descriptions

key: The grid size of each puzzle along with the difficulty level.
id: Unique identifier for each data entry.
question: The puzzle question prompt using Zero-shot-CoT.
answer: The gold solution table for the corresponding puzzle.
Mistral-7b: Model response for the prompt in the question.
Llama-13b: Model response for the prompt in the question.
Gemini-pro: Model response for the prompt in the question.
GPT-4-turbo: Model response for the prompt in the question.
Claude-3: Model response for the prompt in the question.

Excel file format for Auto-Evaluation

key	question	answer	reasoning chain	prompt	annotated_RC

Column Descriptions

key: The grid size of each puzzle along with the difficulty level.
question: The puzzle question prompt using Zero-shot-CoT.
answer: The gold solution table for the corresponding puzzle.
reasoning chain: The reasoning Chains generated by the respective models using the question prompt.
prompt: The user prompt containing the reasoning chain which is followed after the fixed system prompts for Auto-evaluation.
annotated_RC: The annotations gieven by the Auto-evaluator for the corresponding Reasoning Chain.

Excel file format for Accuracy

puzzle	reasoning chain	gold solution	model solution	Validation(Correct/Incorrect)

Column Descriptions

puzzle: The puzzle question prompt using Zero-shot-CoT.
reasoning chain: The reasoning Chains generated by the respective models using the question prompt.
gold solution: The correct or 'gold standard' solution to the puzzle.
model solution: The final solution provided by each model, used for comparison against the gold standard.
Validation(Correct/Incorrect): A field indicating whether the model solution matches the gold standard; marked as 'Correct' or 'Incorrect'.

Excel file format for PuzzleEval

key	question	gold solution	reasoning chain	Labelled Steps	Final Conclusions	Pair-wise Relation	Validate Relations	Step Correctness Score	Average Correctness

Column Descriptions

key: The grid size of each puzzle along with the difficulty level.
question: The puzzle question prompt using Zero-shot-CoT.
gold solution: The correct or 'gold standard' solution to the puzzle.
reasoning chain: The reasoning Chains generated by the respective models using the question prompt.
Labelled Steps: Breakdown of the reasoning chain into distinct, labeled steps.
Final Conclusions: The ultimate conclusions drawn from each step of the reasoning chain.
Pair-wise Relation: The extracted pair-wise relations from each step.
Validate Relations: The results of comparing the pair-wise realtion with the Gold solution table.
Step Correctness Score: The score of each step being marked as 1 for Correct pair and 0 for Incorrect pair. These scores are averaged out for each step.
Average Correctness: The Average score of correctness score calculated for all the steps in the reasoning chain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?

Data Release

CSV file format for GridPuzzle

Column Descriptions

Excel file format for Auto-Evaluation

Column Descriptions

Excel file format for Accuracy

Column Descriptions

Excel file format for PuzzleEval

Column Descriptions

Files

README.md

Latest commit

History

README.md

File metadata and controls

Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?

Data Release

CSV file format for GridPuzzle

Column Descriptions

Excel file format for Auto-Evaluation

Column Descriptions

Excel file format for Accuracy

Column Descriptions

Excel file format for PuzzleEval

Column Descriptions