Prepare v2 training dataset and preprocessing workflow by Sudhendra · Pull Request #19 · Sudhendra/compression-layer

Sudhendra · 2026-02-13T02:20:51Z

Summary

add scripts/preprocess_synthetic.py with tests to strip <think>/<tool_call> artifacts and reject low-quality synthetic pairs using char/token ratio thresholds
update data/training/{train,valid,test}.jsonl to the sanitized v2 splits and refresh runbook/docs for the new local training + MLflow logging workflow
update .gitignore to ignore generated archive and unsanitized training byproducts (data/archive/, data/training/unsanitized_*.jsonl) while keeping sanitized training files tracked

Test Plan

ruff check scripts/preprocess_synthetic.py tests/test_preprocess_synthetic.py
pytest tests/test_preprocess_synthetic.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 129ad88b68

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

scripts/preprocess_synthetic.py

Sudhendra added 2 commits February 12, 2026 20:19

feat: prepare v2 training dataset and pipeline docs

412bdc6

fix: use context manager for rejected output file

129ad88

chatgpt-codex-connector bot reviewed Feb 13, 2026

View reviewed changes

scripts/preprocess_synthetic.py Show resolved Hide resolved

fix: reject non-object JSONL records during preprocessing

c09b0eb

Sudhendra merged commit ec896ad into main Feb 13, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare v2 training dataset and preprocessing workflow#19

Prepare v2 training dataset and preprocessing workflow#19
Sudhendra merged 3 commits intomainfrom
feat/v2-data-pipeline-prep

Sudhendra commented Feb 13, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Sudhendra commented Feb 13, 2026

Summary

Test Plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant