Skip to content

An AI benchmark for creative, human-like problem solving using Sudoku variants

License

Notifications You must be signed in to change notification settings

SakanaAI/Sudoku-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Sudoku-Bench

πŸ€— [Sudoku-Bench puzzle dataset]
πŸ€— [Sudoku-CTC-Reasoning dataset]
πŸ“ [Blog Post]

Welcome to Sudoku-Bench from SakanaAI

🧩 Table of Contents 🧩
🐟 Introduction
🐟 The Sudoku-Bench puzzle dataset
  🐑 challenge_100 puzzle dataset
  🐑 nikoli_100 puzzle dataset
  🐑 ctc puzzle dataset
🐟 Two ways to play
  🐑 Method 1: Text-only
  🐑 Method 2: SudokuPad app
🐟 Getting started
🐟 Partnerships
🐟 Citation

Introduction

Sudoku-Bench features the kind of Sudoku puzzles featured on Cracking the Cryptic (CTC). These Sudoku variants employ unique rulesets to evoke creative problem solving, and we believe are the perfect evaluation benchmark for AI reasoning models.

In this repository we provide tools for interfacing with these Sudoku variants and are releasing:

  • The Sudoku-Bench dataset: a new evaluation benchmark for reasoning models
  • Baseline evaluation code: baseline code for evaluating current LLMs in multi-turn Sudoku solving
  • SudokuPad tools: for interfacing with the SudokuPad app created by Sven Neumann
  • Cracking the Cryptic reasoning traces: thousands of hours of reasoning traces from Cracking the Cryptic, including verbal reasoning transcripts and SudokuPad state-action sequences extracted directly from YouTube videos

The Sudoku-Bench puzzle dataset

The Sudoku-Bench puzzle dataset includes 3 subsets:

  • challenge_100: 100 Sudoku puzzles designed as a core benchmark for evaluating reasoning models
  • nikoli_100: 100 handmade standard Sudoku puzzles offered by Nikoli
  • ctc: 2565 puzzles featured in the CTC channel

challenge_100 puzzle dataset

Image

challenge_100 includes:

  • 15 4Γ—4 puzzles
  • 15 6Γ—6 puzzles
  • 50 9Γ—9 puzzles
  • 20 of the more difficult standard Sudoku puzzles from the nikoli_100 set

The challenge_100 is designed to evaluate models on a diverse set of logical and creative reasoning. The 50 9Γ—9 puzzles were selected by the hosts of CTC. These puzzles range broadly in difficulty--from elegantly simple to extremely challenging--to thoroughly test AI reasoning capabilities.

The 4Γ—4 and 6Γ—6 puzzles offer an easier on-ramp for reasoning models, maintaining the creative qualities of the larger puzzles, but alleviating some need for long-context reasoning.

nikoli_100 puzzle dataset

Image

We partnered with Nikoli, the Japanese puzzle company that popularized Sudoku in the 1980s, to curate a set of 100 beautiful handmade standard Sudoku puzzles.

ctc puzzle dataset

Image

The ctc dataset contains 2565 puzzles featured in the CTC channel.

Two ways to play

We provide two ways to interact with Sudoku-Bench puzzles.

Text-only

  • Use a text representation of each puzzle. The simplest way to evaluate any LLM on Sudoku-Bench.

SudokuPad app

  • Use the SudokuPad game engine in-the-loop. This allows for:
    • Screenshots of the sudoku board for VLM-based models
    • Note-taking methods that are commonly used by human solvers such as pencil marks for candidate digits and color-coding of cells

Method 1: Text-only

Image

"Differences Count - part 1" by Sujoyku and Marty Sears

Sudoku variants like those seen in Sudoku-Bench contain unique rules and visual elements.

Each puzzle in Sudoku-Bench includes a structured text representation of its visual elements using rxcy coordinate notation. This is stored in the visual_elements field of each puzzle. Together with the rules and initial_board fields, one can construct a text-only prompt. This text-based representation facilitates easy evaluation and integration with LLMs without the need for visual processing.

See the end-to-end evaluation example in src/eval.

Method 2: SudokuPad app

Image

The SudokuPad, created by Sven Neumann is a popular puzzle app that hosts thousands of Sudoku puzzles. SudokuPad enables standard puzzle-solving strategies, such as pencil marking candidate digits and cell color-coding.

See the SudokuPad tools provided in src/sudokupad_interaction.

Getting started

To get started, navigate to

Datasets

This repository

  • src/eval for example of how to evaluate LLMs on Sudoku-Bench text representations of puzzles
  • src/sudokupad_interaction for tools for interacting with the SudokuPad app
  • src/ctc_processing for processing the CTC reasoning traces in an LM-compatible format

Partnerships

This project is in partnership with Cracking the Cryptic. We are grateful to Nikoli for curating the nikoli_100 dataset. Puzzle creators featured on CTC and in this repository are gratefully acknowledged in acknowledgements.md. We are grateful to Sven Neumann for help.

Contribute

We welcome contributions to the Sudoku-Bench. Please note that we will not accept pull requests containing data scraped from YouTube channels or link to such data without evidence that they have explicit permission from the channel owner.

Citation

@misc{seely2025sudoku-bench,
  title={{Sudoku-Bench}},
  author={Seely, Jeffrey and Imajuku, Yuki and Zhao, Tianyu and Cetin, Edoardo and Jones, Llion},
  howpublished = {\url{https://github.com/SakanaAI/Sudoku-Bench}},
  year={2025}
}

About

An AI benchmark for creative, human-like problem solving using Sudoku variants

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published