Skip to content

Latest commit

 

History

History
80 lines (66 loc) · 6.08 KB

README.md

File metadata and controls

80 lines (66 loc) · 6.08 KB

Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?

Solving grid puzzles involves a significant amount of logical reasoning. Hence, it is a good domain to evaluate the reasoning capability of a model which can then guide us to improve the reasoning ability of models. However, most existing works evaluate only the final predicted answer of a puzzle, without delving into an in-depth analysis of the LLMs' reasoning chains (such as where they falter) or providing any finer metrics to evaluate them. Since LLMs may rely on simple heuristics or artifacts to predict the final answer, it is crucial to evaluate the generated reasoning chain beyond overall correctness measures, for accurately evaluating the reasoning abilities of LLMs. To this end, we first develop GridPuzzle, an evaluation dataset comprising 274 grid-based puzzles with different complexities. Second, we propose a new error taxonomy derived from manual analysis of reasoning chains from LLMs including GPT-4, Claude-3, Gemini, Mistral, and Llama-2. Then, we develop an LLM-based framework for large-scale subjective evaluation (i.e., identifying errors) and an objective metric, PuzzleEval, to evaluate the correctness of reasoning chains. Evaluating reasoning chains from LLMs leads to several interesting findings. We further show that existing prompting methods used for enhancing models' reasoning abilities do not improve performance on GridPuzzle. This highlights the importance of understanding fine-grained errors and presents a challenge for future research to enhance LLMs' puzzle-solving abilities by developing methods that address these errors.   puzzleteaser

Data Release

Please take a look at the ./data folder to access the GridPuzzle dataset and the data for all the experiments.

Scope of the dataset: The dataset consists of GridPuzzle with the original reasoning chains, the data for Auto-evaluation done by GPT-4o, the data for both the Metrics, Accuracy and PuzzleEval, and finally the data for the Mitigation strategies.
The data/ folder contains the following files:

├── ...
├── data/
  ├── GridPuzzle.csv
  ├── Auto-Evaluation
  ├── Metrics
  │   ├── Accuracy
  │   └── PuzzleEval
  └── Mitigation
      ├── Mitigation Results
      └── PuzzleEval Results

CSV file format for GridPuzzle

key id question answer Mistral-7b Llama-13b Gemini-pro GPT-4-turbo Claude-3

Column Descriptions

  • key: The grid size of each puzzle along with the difficulty level.
  • id: Unique identifier for each data entry.
  • question: The puzzle question prompt using Zero-shot-CoT.
  • answer: The gold solution table for the corresponding puzzle.
  • Mistral-7b: Model response for the prompt in the question.
  • Llama-13b: Model response for the prompt in the question.
  • Gemini-pro: Model response for the prompt in the question.
  • GPT-4-turbo: Model response for the prompt in the question.
  • Claude-3: Model response for the prompt in the question.

Excel file format for Auto-Evaluation

key question answer reasoning chain prompt annotated_RC

Column Descriptions

  • key: The grid size of each puzzle along with the difficulty level.
  • question: The puzzle question prompt using Zero-shot-CoT.
  • answer: The gold solution table for the corresponding puzzle.
  • reasoning chain: The reasoning Chains generated by the respective models using the question prompt.
  • prompt: The user prompt containing the reasoning chain which is followed after the fixed system prompts for Auto-evaluation.
  • annotated_RC: The annotations gieven by the Auto-evaluator for the corresponding Reasoning Chain.

Excel file format for Accuracy

puzzle reasoning chain gold solution model solution Validation(Correct/Incorrect)

Column Descriptions

  • puzzle: The puzzle question prompt using Zero-shot-CoT.
  • reasoning chain: The reasoning Chains generated by the respective models using the question prompt.
  • gold solution: The correct or 'gold standard' solution to the puzzle.
  • model solution: The final solution provided by each model, used for comparison against the gold standard.
  • Validation(Correct/Incorrect): A field indicating whether the model solution matches the gold standard; marked as 'Correct' or 'Incorrect'.

Excel file format for PuzzleEval

key question gold solution reasoning chain Labelled Steps Final Conclusions Pair-wise Relation Validate Relations Step Correctness Score Average Correctness

Column Descriptions

  • key: The grid size of each puzzle along with the difficulty level.
  • question: The puzzle question prompt using Zero-shot-CoT.
  • gold solution: The correct or 'gold standard' solution to the puzzle.
  • reasoning chain: The reasoning Chains generated by the respective models using the question prompt.
  • Labelled Steps: Breakdown of the reasoning chain into distinct, labeled steps.
  • Final Conclusions: The ultimate conclusions drawn from each step of the reasoning chain.
  • Pair-wise Relation: The extracted pair-wise relations from each step.
  • Validate Relations: The results of comparing the pair-wise realtion with the Gold solution table.
  • Step Correctness Score: The score of each step being marked as 1 for Correct pair and 0 for Incorrect pair. These scores are averaged out for each step.
  • Average Correctness: The Average score of correctness score calculated for all the steps in the reasoning chain.