Sudoku-Bench

🤗 [Sudoku-Bench puzzle dataset]
🤗 [Sudoku-CTC-Reasoning dataset]
📝 [Blog Post]

Welcome to Sudoku-Bench from SakanaAI

🧩 Table of Contents 🧩
🐟 Introduction
🐟 The Sudoku-Bench puzzle dataset
🐡 challenge_100 puzzle dataset
🐡 nikoli_100 puzzle dataset
🐡 ctc puzzle dataset
🐟 Two ways to play
🐡 Method 1: Text-only
🐡 Method 2: SudokuPad app
🐟 Getting started
🐟 Partnerships
🐟 Citation

Introduction

Sudoku-Bench features the kind of Sudoku puzzles featured on Cracking the Cryptic (CTC). These Sudoku variants employ unique rulesets to evoke creative problem solving, and we believe are the perfect evaluation benchmark for AI reasoning models.

In this repository we provide tools for interfacing with these Sudoku variants and are releasing:

The Sudoku-Bench dataset: a new evaluation benchmark for reasoning models
Baseline evaluation code: baseline code for evaluating current LLMs in multi-turn Sudoku solving
SudokuPad tools: for interfacing with the SudokuPad app created by Sven Neumann
Cracking the Cryptic reasoning traces: thousands of hours of reasoning traces from Cracking the Cryptic, including verbal reasoning transcripts and SudokuPad state-action sequences extracted directly from YouTube videos

The Sudoku-Bench puzzle dataset

The Sudoku-Bench puzzle dataset includes 3 subsets:

challenge_100: 100 Sudoku puzzles designed as a core benchmark for evaluating reasoning models
nikoli_100: 100 handmade standard Sudoku puzzles offered by Nikoli
ctc: 2565 puzzles featured in the CTC channel

`challenge_100` puzzle dataset

challenge_100 includes:

15 4×4 puzzles
15 6×6 puzzles
50 9×9 puzzles
20 of the more difficult standard Sudoku puzzles from the nikoli_100 set

The challenge_100 is designed to evaluate models on a diverse set of logical and creative reasoning. The 50 9×9 puzzles were selected by the hosts of CTC. These puzzles range broadly in difficulty--from elegantly simple to extremely challenging--to thoroughly test AI reasoning capabilities.

The 4×4 and 6×6 puzzles offer an easier on-ramp for reasoning models, maintaining the creative qualities of the larger puzzles, but alleviating some need for long-context reasoning.

`nikoli_100` puzzle dataset

We partnered with Nikoli, the Japanese puzzle company that popularized Sudoku in the 1980s, to curate a set of 100 beautiful handmade standard Sudoku puzzles.

`ctc` puzzle dataset

The ctc dataset contains 2565 puzzles featured in the CTC channel.

Two ways to play

We provide two ways to interact with Sudoku-Bench puzzles.

Text-only

Use a text representation of each puzzle. The simplest way to evaluate any LLM on Sudoku-Bench.

SudokuPad app

Use the SudokuPad game engine in-the-loop. This allows for:
- Screenshots of the sudoku board for VLM-based models
- Note-taking methods that are commonly used by human solvers such as pencil marks for candidate digits and color-coding of cells

Method 1: Text-only

_{"Differences Count - part 1" by Sujoyku and Marty Sears}

Sudoku variants like those seen in Sudoku-Bench contain unique rules and visual elements.

Each puzzle in Sudoku-Bench includes a structured text representation of its visual elements using rxcy coordinate notation. This is stored in the visual_elements field of each puzzle. Together with the rules and initial_board fields, one can construct a text-only prompt. This text-based representation facilitates easy evaluation and integration with LLMs without the need for visual processing.

See the end-to-end evaluation example in src/eval.

Method 2: SudokuPad app

The SudokuPad, created by Sven Neumann is a popular puzzle app that hosts thousands of Sudoku puzzles. SudokuPad enables standard puzzle-solving strategies, such as pencil marking candidate digits and cell color-coding.

See the SudokuPad tools provided in src/sudokupad_interaction.

Getting started

To get started, navigate to

Datasets

Sudoku-Bench on Hugging Face: SakanaAI/Sudoku-Bench
Cracking the Cryptic (CTC) reasoning traces on Hugging Face: SakanaAI/Sudoku-CTC-Reasoning

This repository

src/eval for example of how to evaluate LLMs on Sudoku-Bench text representations of puzzles
src/sudokupad_interaction for tools for interacting with the SudokuPad app
src/ctc_processing for processing the CTC reasoning traces in an LM-compatible format

Partnerships

This project is in partnership with Cracking the Cryptic. We are grateful to Nikoli for curating the nikoli_100 dataset. Puzzle creators featured on CTC and in this repository are gratefully acknowledged in acknowledgements.md. We are grateful to Sven Neumann for help.

Contribute

We welcome contributions to the Sudoku-Bench. Please note that we will not accept pull requests containing data scraped from YouTube channels or link to such data without evidence that they have explicit permission from the channel owner.

Citation

@misc{seely2025sudoku-bench,
  title={{Sudoku-Bench}},
  author={Seely, Jeffrey and Imajuku, Yuki and Zhao, Tianyu and Cetin, Edoardo and Jones, Llion},
  howpublished = {\url{https://github.com/SakanaAI/Sudoku-Bench}},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
acknowledgements.md		acknowledgements.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sudoku-Bench

Introduction

The Sudoku-Bench puzzle dataset

`challenge_100` puzzle dataset

`nikoli_100` puzzle dataset

`ctc` puzzle dataset

Two ways to play

Text-only

SudokuPad app

Method 1: Text-only

Method 2: SudokuPad app

Getting started

Datasets

This repository

Partnerships

Contribute

Citation

About

Releases

Packages

Languages

License

SakanaAI/Sudoku-Bench

Folders and files

Latest commit

History

Repository files navigation

Sudoku-Bench

Introduction

The Sudoku-Bench puzzle dataset

challenge_100 puzzle dataset

nikoli_100 puzzle dataset

ctc puzzle dataset

Two ways to play

Text-only

SudokuPad app

Method 1: Text-only

Method 2: SudokuPad app

Getting started

Datasets

This repository

Partnerships

Contribute

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`challenge_100` puzzle dataset

`nikoli_100` puzzle dataset

`ctc` puzzle dataset

Packages