Add nvidia/OpenMathReasoning dataset (cot, tir, genselect splits) #2316
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Fixes #1848(Partially)
This PR adds the nvidia/OpenMathReasoning dataset to Marin's SFT dataset collection. OpenMathReasoning is a high-quality mathematical reasoning dataset containing 5.67 million samples generated by DeepSeek-R1 and QwQ-32B models (Qwen2.5-32B-Instruct to preprocess problems, and DeepSeek-R1 and QwQ-32B to generate solutions), featuring chain-of-thought solutions to olympiad-level math problems.
Dataset Details
Splits Added
Field Mapping
problem→ instruction columngenerated_solution→ response columnexpected_answer,problem_source,problem_type,generation_model,inference_mode,pass_rate_72b_tir,used_in_kaggleTesting Performed
problem,generated_solution) exist via HuggingFace streamingQuestion: How many lines can be drawn that are equidistant from 3 non-collinear points in the plane?
Before SFT:
Click to view output
After SFT:
Click to view output
Checklist
uv run python infra/pre-commit.py --all-filesto lint/format your code