Add nvidia/OpenMathReasoning dataset (cot, tir, genselect splits) #2316

XinyuGuan · 2026-01-10T19:36:26Z

Description

Fixes #1848(Partially)

This PR adds the nvidia/OpenMathReasoning dataset to Marin's SFT dataset collection. OpenMathReasoning is a high-quality mathematical reasoning dataset containing 5.67 million samples generated by DeepSeek-R1 and QwQ-32B models (Qwen2.5-32B-Instruct to preprocess problems, and DeepSeek-R1 and QwQ-32B to generate solutions), featuring chain-of-thought solutions to olympiad-level math problems.

Dataset Details

Source: https://huggingface.co/datasets/nvidia/OpenMathReasoning
License: CC-BY-4.0
Revision: d3d0866

Splits Added

Split	Samples	Description
cot	3.2M	Chain-of-thought reasoning solutions
tir	1.72M	Tool-integrated reasoning (code execution)
genselect	566K	Generated selection samples

Field Mapping

problem → instruction column
generated_solution → response column
Metadata columns preserved: expected_answer, problem_source, problem_type, generation_model, inference_mode, pass_rate_72b_tir, used_in_kaggle

Testing Performed

Pre-commit checks passed
pytest: 1 passed, 3 skipped, 19 deselected
Verified config loads correctly and matches structure of existing configs (e.g., meta-math/MetaMathQA)
Verified dataset field names (problem, generated_solution) exist via HuggingFace streaming
End-to-end SFT training test with 5000 samples on 8×A800 GPUs (from distilgpt2 - 82M):
- Loss decreased from 2.87 → 1.52 over 15 epochs
- Model successfully learned the CoT reasoning format from training data

Example:

Question: How many lines can be drawn that are equidistant from 3 non-collinear points in the plane?

Before SFT:

Click to view output

The solution is to use the plane geometry, then fill the plane.
#include <math.h> #include <std.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.

After SFT:

Click to view output

<think>
Okay, so I need to figure out how many lines are equidistant from three non-collinear points in the plane. Hmm, let's see. The problem is asking for the number of lines that can be drawn that are equidistant from 3 non-collinear points in the plane. The condition is that the line is equidistant from all three non-collinear points in that plane.

First, maybe I should recall some properties of line-drawing. Since all lines have at least one intersection point, but the problem is asking for the number of lines that aren't equidistant at all, maybe there's a pattern or a way to relate these lines to other points.

Since all lines are equidistant from all three non-collinear points, maybe the line that's equidistant from all three non-collinear points in that plane is equidistant from all three non-collinear points in that plane. Since all lines are equidistant from all three non-collinear points, that means all three non-collinear points must be equidistant from all three non-collinear points in that plane.

I remember that when dealing with lines equidistant from a plane, the intersection points of the sides of two lines are the plane's transversals, and the line that's equidistant from all three non-

Checklist

[ Passed ] You ran uv run python infra/pre-commit.py --all-files to lint/format your code
[ 1 passed, 3 skipped, 19 deselected ] You ran 'pytest' to test your code

dlwh

thanks so much for your contribution! This looks good to me!

Double checking that you meant to leave out additional_problems

XinyuGuan · 2026-01-12T09:46:44Z

thanks so much for your contribution! This looks good to me!

Double checking that you meant to leave out additional_problems

Hi,

Thank you for your comment! The additional 193K problems don't have solutions (labels) so I didn't add them. If you think they are useful, I can add the split next week. I can also add other datasets mentioned in #1848.

Add nvidia/OpenMathReasoning dataset (cot, tir, genselect splits)

058cf30

dlwh approved these changes Jan 12, 2026

View reviewed changes

dlwh merged commit defcca3 into marin-community:main Jan 12, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add nvidia/OpenMathReasoning dataset (cot, tir, genselect splits) #2316

Add nvidia/OpenMathReasoning dataset (cot, tir, genselect splits) #2316

Uh oh!

XinyuGuan commented Jan 10, 2026 •

edited

Loading

Uh oh!

dlwh left a comment

Uh oh!

Uh oh!

XinyuGuan commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add nvidia/OpenMathReasoning dataset (cot, tir, genselect splits) #2316

Add nvidia/OpenMathReasoning dataset (cot, tir, genselect splits) #2316

Uh oh!

Conversation

XinyuGuan commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dataset Details

Splits Added

Field Mapping

Testing Performed

Checklist

Uh oh!

dlwh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

XinyuGuan commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

XinyuGuan commented Jan 10, 2026 •

edited

Loading