π€ [Sudoku-Bench puzzle dataset]
π€ [Sudoku-CTC-Reasoning dataset]
π [Blog Post]
Welcome to Sudoku-Bench from SakanaAI
𧩠Table of Contents 𧩠|
---|
π Introduction |
π The Sudoku-Bench puzzle dataset |
Β Β π‘ challenge_100 puzzle dataset |
Β Β π‘ nikoli_100 puzzle dataset |
Β Β π‘ ctc puzzle dataset |
π Two ways to play |
Β Β π‘ Method 1: Text-only |
Β Β π‘ Method 2: SudokuPad app |
π Getting started |
π Partnerships |
π Citation |
Sudoku-Bench features the kind of Sudoku puzzles featured on Cracking the Cryptic (CTC). These Sudoku variants employ unique rulesets to evoke creative problem solving, and we believe are the perfect evaluation benchmark for AI reasoning models.
In this repository we provide tools for interfacing with these Sudoku variants and are releasing:
- The Sudoku-Bench dataset: a new evaluation benchmark for reasoning models
- Baseline evaluation code: baseline code for evaluating current LLMs in multi-turn Sudoku solving
- SudokuPad tools: for interfacing with the SudokuPad app created by Sven Neumann
- Cracking the Cryptic reasoning traces: thousands of hours of reasoning traces from Cracking the Cryptic, including verbal reasoning transcripts and SudokuPad state-action sequences extracted directly from YouTube videos
The Sudoku-Bench puzzle dataset includes 3 subsets:
challenge_100
: 100 Sudoku puzzles designed as a core benchmark for evaluating reasoning modelsnikoli_100
: 100 handmade standard Sudoku puzzles offered by Nikolictc
: 2565 puzzles featured in the CTC channel
challenge_100
includes:
- 15 4Γ4 puzzles
- 15 6Γ6 puzzles
- 50 9Γ9 puzzles
- 20 of the more difficult standard Sudoku puzzles from the
nikoli_100
set
The challenge_100
is designed to evaluate models on a diverse set of logical and creative reasoning. The 50 9Γ9 puzzles were selected by the hosts of CTC. These puzzles range broadly in difficulty--from elegantly simple to extremely challenging--to thoroughly test AI reasoning capabilities.
The 4Γ4 and 6Γ6 puzzles offer an easier on-ramp for reasoning models, maintaining the creative qualities of the larger puzzles, but alleviating some need for long-context reasoning.
We partnered with Nikoli, the Japanese puzzle company that popularized Sudoku in the 1980s, to curate a set of 100 beautiful handmade standard Sudoku puzzles.
The ctc
dataset contains 2565 puzzles featured in the CTC channel.
We provide two ways to interact with Sudoku-Bench puzzles.
- Use a text representation of each puzzle. The simplest way to evaluate any LLM on Sudoku-Bench.
- Use the SudokuPad game engine in-the-loop. This allows for:
- Screenshots of the sudoku board for VLM-based models
- Note-taking methods that are commonly used by human solvers such as pencil marks for candidate digits and color-coding of cells
"Differences Count - part 1" by Sujoyku and Marty Sears
Sudoku variants like those seen in Sudoku-Bench contain unique rules and visual elements.
Each puzzle in Sudoku-Bench includes a structured text representation of its visual elements using rxcy
coordinate notation. This is stored in the visual_elements
field of each puzzle. Together with the rules
and initial_board
fields, one can construct a text-only prompt. This text-based representation facilitates easy evaluation and integration with LLMs without the need for visual processing.
See the end-to-end evaluation example in src/eval
.
The SudokuPad, created by Sven Neumann is a popular puzzle app that hosts thousands of Sudoku puzzles. SudokuPad enables standard puzzle-solving strategies, such as pencil marking candidate digits and cell color-coding.
See the SudokuPad tools provided in src/sudokupad_interaction
.
To get started, navigate to
- Sudoku-Bench on Hugging Face: SakanaAI/Sudoku-Bench
- Cracking the Cryptic (CTC) reasoning traces on Hugging Face: SakanaAI/Sudoku-CTC-Reasoning
- src/eval for example of how to evaluate LLMs on Sudoku-Bench text representations of puzzles
- src/sudokupad_interaction for tools for interacting with the SudokuPad app
- src/ctc_processing for processing the CTC reasoning traces in an LM-compatible format
This project is in partnership with Cracking the Cryptic. We are grateful to Nikoli for curating the nikoli_100
dataset. Puzzle creators featured on CTC and in this repository are gratefully acknowledged in acknowledgements.md. We are grateful to Sven Neumann for help.
We welcome contributions to the Sudoku-Bench. Please note that we will not accept pull requests containing data scraped from YouTube channels or link to such data without evidence that they have explicit permission from the channel owner.
@misc{seely2025sudoku-bench,
title={{Sudoku-Bench}},
author={Seely, Jeffrey and Imajuku, Yuki and Zhao, Tianyu and Cetin, Edoardo and Jones, Llion},
howpublished = {\url{https://github.com/SakanaAI/Sudoku-Bench}},
year={2025}
}