Skip to content

Conversation

@XinyuGuan
Copy link
Contributor

@XinyuGuan XinyuGuan commented Jan 10, 2026

Description

Fixes #1848(Partially)

This PR adds the nvidia/OpenMathReasoning dataset to Marin's SFT dataset collection. OpenMathReasoning is a high-quality mathematical reasoning dataset containing 5.67 million samples generated by DeepSeek-R1 and QwQ-32B models (Qwen2.5-32B-Instruct to preprocess problems, and DeepSeek-R1 and QwQ-32B to generate solutions), featuring chain-of-thought solutions to olympiad-level math problems.

Dataset Details

Splits Added

Split Samples Description
cot 3.2M Chain-of-thought reasoning solutions
tir 1.72M Tool-integrated reasoning (code execution)
genselect 566K Generated selection samples

Field Mapping

  • problem → instruction column
  • generated_solution → response column
  • Metadata columns preserved: expected_answer, problem_source, problem_type, generation_model, inference_mode, pass_rate_72b_tir, used_in_kaggle

Testing Performed

  • Pre-commit checks passed
  • pytest: 1 passed, 3 skipped, 19 deselected
  • Verified config loads correctly and matches structure of existing configs (e.g., meta-math/MetaMathQA)
  • Verified dataset field names (problem, generated_solution) exist via HuggingFace streaming
  • End-to-end SFT training test with 5000 samples on 8×A800 GPUs (from distilgpt2 - 82M):
    • Loss decreased from 2.87 → 1.52 over 15 epochs
    • Model successfully learned the CoT reasoning format from training data
  • Example:
    • Question: How many lines can be drawn that are equidistant from 3 non-collinear points in the plane?

    • Before SFT:

      Click to view output
      The solution is to use the plane geometry, then fill the plane.
      #include <math.h> #include <std.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.h> #include <math.
      
    • After SFT:

      Click to view output
      <think>
      Okay, so I need to figure out how many lines are equidistant from three non-collinear points in the plane. Hmm, let's see. The problem is asking for the number of lines that can be drawn that are equidistant from 3 non-collinear points in the plane. The condition is that the line is equidistant from all three non-collinear points in that plane.
      
      First, maybe I should recall some properties of line-drawing. Since all lines have at least one intersection point, but the problem is asking for the number of lines that aren't equidistant at all, maybe there's a pattern or a way to relate these lines to other points.
      
      Since all lines are equidistant from all three non-collinear points, maybe the line that's equidistant from all three non-collinear points in that plane is equidistant from all three non-collinear points in that plane. Since all lines are equidistant from all three non-collinear points, that means all three non-collinear points must be equidistant from all three non-collinear points in that plane.
      
      I remember that when dealing with lines equidistant from a plane, the intersection points of the sides of two lines are the plane's transversals, and the line that's equidistant from all three non-
      

Checklist

  • [ Passed ] You ran uv run python infra/pre-commit.py --all-files to lint/format your code
  • [ 1 passed, 3 skipped, 19 deselected ] You ran 'pytest' to test your code

Copy link
Member

@dlwh dlwh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much for your contribution! This looks good to me!

Double checking that you meant to leave out additional_problems

@dlwh dlwh merged commit defcca3 into marin-community:main Jan 12, 2026
8 checks passed
@XinyuGuan
Copy link
Contributor Author

thanks so much for your contribution! This looks good to me!

Double checking that you meant to leave out additional_problems

Hi,

Thank you for your comment! The additional 193K problems don't have solutions (labels) so I didn't add them. If you think they are useful, I can add the split next week. I can also add other datasets mentioned in #1848.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add best SFT datasets

2 participants