CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

[📄 Paper] • [🤗 CriticLeanGPT] • [🧐 CriticLeanInstruct] • [📊 CriticLeanBench] • [📦 FineLeanCorpus]

Repo for "CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization"

💡 Introduction

The formalization of mathematical statements into executable code for theorem provers like Lean 4 remains a critical challenge in computational mathematics. While existing methods have made progress in autoformalization, they often struggle with accuracy—particularly for complex problems requiring deep semantic understanding. This paper argues that the underexplored "critic" phase, which evaluates whether generated formalizations align with original mathematical intent, is essential for advancing this field.

To address these limitations, we introduce CriticLean , a framework prioritizing a critic-driven pipeline. At its core is CriticLeanGPT , a reinforcement learning-based model trained to assess the semantic fidelity of Lean 4 outputs against natural language statements. Our iterative process combines feedback from both the Lean compiler and CriticLeanGPT to refine formalizations, enhancing robustness through successive improvements.

We also present CriticLeanBench , a benchmark containing 500 paired natural language and Lean 4 statements (250 correct, 250 incorrect), enabling rigorous evaluation of critic models. Experiments show CriticLeanGPT outperforms state-of-the-art open-source and closed-source models in accuracy.

Building on this pipeline, we develop FineLeanCorpus , an open-source dataset of 285,957 verified Lean 4 statements. Compared to prior datasets, it offers greater diversity in mathematical domains, difficulty levels, and quality, validated through iterative critical feedback.

By centering the formalization process on semantic validation, our work advances the reliability of automated theorem proving and provides scalable resources for future research in mathematical AI.

Figure 1: Illustration of CriticLean framework based on Critic-Guided Reinforcement Learning for Mathematics Autoformalization.

Framework Overview

💻 CriticLeanGPT Model

📋 CriticLeanInstruct

The CriticLeanInstruct dataset suite consists of several JSONL files, each serving specific purposes in the model training pipeline. Here's a breakdown of each file:

File Name	Description
`CriticLean_12K`	All of our critic lean data.
`CriticLean_4K`	Seed Data. A subset of the CriticLean_12K, specifically used as our Reinforcement Learning (RL) dataset.
`CriticLean_Mix_48K`	An expanded mixed dataset including: - `CriticLean_12K` - 18k math data sampled from OpenR1-Math-220k - 18k code data sampled from OpenThoughts-114k-Code_decontaminated
`CriticLean_Mix_16K`	A mixed dataset including: - `CriticLean_4K` - 6k math data sampled from OpenR1-Math-220k - 6k code data sampled from OpenThoughts-114k-Code_decontaminated

🍭 Associated Model Variants

We've trained several model variants using the CriticLean dataset to demonstrate its effectiveness. Below is a summary of the training configurations:

Base Model	SFT Applied?	SFT Data	RL Applied?	RL Data	CriticLeanGPT Model Name
Qwen2.5-Instruct	Yes	`CriticLean_4K`	No	*	Qwen2.5-Instruct-SFT(Critic Only)
Qwen2.5-Instruct	Yes	`CriticLean_Mix_16K`	No	*	Qwen2.5-Instruct-SFT(16K)
Qwen2.5-Instruct	Yes	`CriticLean_Mix_48K`	No	*	Qwen2.5-Instruct-SFT
Qwen2.5-Instruct	Yes	`CriticLean_Mix_48K`	Yes	`CriticLean_4K`	Qwen2.5-Instruct-SFT-RL
Qwen2.5-Instruct	No	*	Yes	`CriticLean_4K`	Qwen2.5-Instruct-RL
Qwen3	No	*	Yes	`CriticLean_4K`	Qwen3-RL

📋 CriticLeanBench

Overview

CriticLeanBench is a comprehensive benchmark designed to evaluate the critical reasoning capabilities of models in translating natural language mathematical statements into formally verified theorem declarations in Lean 4. This benchmark focuses on assessing whether generated Lean 4 code accurately reflects the semantic intent of original mathematical statements, addressing a key gap in existing evaluation frameworks for mathematical autoformalization.

Dual Focus: Integrates both critic capabilities and Lean 4 formal verification, distinguishing itself from existing benchmarks
Diverse Coverage: Includes problems from various mathematical domains and difficulty levels
Structured Errors: Incorrect samples are designed to exhibit common error patterns in mathematical formalization
Human-Verified: All samples undergo rigorous human validation to ensure quality and accuracy

Figure 2: An overview for the CriticLeanBench construction.

Statistics

With 500 carefully curated problem pairs (250 correct and 250 incorrect formalizations), CriticLeanBench provides a rigorous testbed for models tasked with evaluating the correctness of mathematical formalizations. Each problem features relatively lengthy inputs (average 700.94 tokens), challenging models to process and understand complex mathematical reasoning.

CriticLeanBench is unique in its integration of Lean 4 specific evaluation and its focus on mathematical formalization correctness.

Table 1 : Dataset statistics and comparison of various code benchmark datasets.

Dataset Structure

Each entry in CriticLeanBench contains:

A natural language mathematical statement
A corresponding Lean 4 formalization
A label indicating whether the formalization is semantically correct
Detailed error annotations for incorrect samples (where applicable)

Evaluation Metrics

We recommend using the following metrics when evaluating models on CriticLeanBench:

Accuracy (ACC)
True Positive Rate (TPR)
False Positive Rate (FPR)
True Negative Rate (TNR)
False Negative Rate (FNR)

Evaluation of SOTA LLMs

Table 2: Performance on CriticLeanBench. The best, the second-best and the third-best scores for each indicator are shown in box, bold and underlined, respectively.

Figure 3: Scaling Analysis of LLMs on CriticLeanBench. ˆ denoted closed-source LLMs.

📋 FineLeanCorpus

Overview

FineLeanCorpus is the largest high-quality dataset of natural language mathematical statements paired with their formalizations in Lean 4, comprising 285,957 entries. This dataset is designed to advance research in mathematical autoformalization—the translation of natural language mathematics into formal, machine-verifiable code.

The corpus is distinguished by its:

Scale: 285K entries, significantly larger than existing Lean 4 problem sets
Quality: 84% accuracy verified through manual sampling
Diversity: Coverage across multiple mathematical domains and difficulty levels
Rigorous validation: Each entry passes both syntactic checks (via Lean compiler) and semantic verification (via CriticLean model)

Statistics

FineLeanCorpus draws from diverse mathematical sources:

Table 3: Different sources and Dataset statistics of FineLeanCorpus.

Comparison with the Lean-Workbook Dataset

Table 4: Comparison of dataset statistics. FineLeanCorpus offers a transparent critic process, a higher proportion of top-tier problems, and a more balanced and diverse topic distribution compared to the highly skewed Lean-Workbook.

Figure 4: Comparison of dataset statistics. FineLeanCorpus offers a transparent critic process, a higher proportion of top-tier problems, and a more balanced and diverse topic distribution compared to the highly skewed Lean-Workbook.

Figure 5: Math Domain Distributions Comparison: FineLeanCorpus vs. Lean-Workbook.

Human Evaluation

Table 5: Human evaluation accuracy results for autoformalization performance: Comparing the base model, base model with compiler feedback and our aut- oformalization pipeline. The best score is highlighted in box.

Table 6: Results of manual sampling inspection on dataset accuracy rates.

🔗 Referenced Datasets & Links

FineLeanCorpus includes a limited portion of data sourced from the following datasets:

📜 License

Apache-2.0 license

☕️ Citation

If you find this repository helpful, please consider citing our paper:

@misc{peng2025criticleancriticguidedreinforcementlearning,
      title={CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization}, 
      author={Zhongyuan Peng and Yifan Yao and Kaijing Ma and Shuyue Guo and Yizhe Li and Yichi Zhang and Chenchen Zhang and Yifan Zhang and Zhouliang Yu and Luming Li and Minghao Liu and Yihang Xia and Jiawei Shen and Yuchen Wu and Yixin Cao and Zhaoxiang Zhang and Wenhao Huang and Jiaheng Liu and Ge Zhang},
      year={2025},
      eprint={2507.06181},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.06181}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

💡 Introduction

Framework Overview

💻 CriticLeanGPT Model

📋 CriticLeanInstruct

🍭 Associated Model Variants

📋 CriticLeanBench

Overview

Statistics

Dataset Structure

Evaluation Metrics

Evaluation of SOTA LLMs

📋 FineLeanCorpus

Overview

Statistics

Comparison with the Lean-Workbook Dataset

Human Evaluation

🔗 Referenced Datasets & Links

📜 License

☕️ Citation

About

Uh oh!

Releases

Packages

multimodal-art-projection/CriticLean

Folders and files

Latest commit

History

Repository files navigation

CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

💡 Introduction

Framework Overview

💻 CriticLeanGPT Model

📋 CriticLeanInstruct

🍭 Associated Model Variants

📋 CriticLeanBench

Overview

Statistics

Dataset Structure

Evaluation Metrics

Evaluation of SOTA LLMs

📋 FineLeanCorpus

Overview

Statistics

Comparison with the Lean-Workbook Dataset

Human Evaluation

🔗 Referenced Datasets & Links

📜 License

☕️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages