Skip to content

Prepare v2 training dataset and preprocessing workflow#19

Merged
Sudhendra merged 3 commits intomainfrom
feat/v2-data-pipeline-prep
Feb 13, 2026
Merged

Prepare v2 training dataset and preprocessing workflow#19
Sudhendra merged 3 commits intomainfrom
feat/v2-data-pipeline-prep

Conversation

@Sudhendra
Copy link
Owner

Summary

  • add scripts/preprocess_synthetic.py with tests to strip <think>/<tool_call> artifacts and reject low-quality synthetic pairs using char/token ratio thresholds
  • update data/training/{train,valid,test}.jsonl to the sanitized v2 splits and refresh runbook/docs for the new local training + MLflow logging workflow
  • update .gitignore to ignore generated archive and unsanitized training byproducts (data/archive/, data/training/unsanitized_*.jsonl) while keeping sanitized training files tracked

Test Plan

  • ruff check scripts/preprocess_synthetic.py tests/test_preprocess_synthetic.py
  • pytest tests/test_preprocess_synthetic.py

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 129ad88b68

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Sudhendra Sudhendra merged commit ec896ad into main Feb 13, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant