Sudhendra · Sudhendra · Feb 13, 2026 · Feb 13, 2026 · Feb 13, 2026 · Feb 13, 2026
diff --git a/.gitignore b/.gitignore
@@ -63,6 +63,8 @@ data/synthetic/
 data/validated/
 data/final/
 data/cache/
+data/archive/
+data/training/unsanitized_*.jsonl
 data/*.jsonl
 data/*.json
 data/costs.log
@@ -130,3 +132,5 @@ temp/
 .windsurf/
 .zencoder/
 skills/
+data/training_v1/
+data/validated_v1/
diff --git a/data/training/test.jsonl b/data/training/test.jsonl
diff --git a/data/training/train.jsonl b/data/training/train.jsonl
diff --git a/data/training/valid.jsonl b/data/training/valid.jsonl
diff --git a/docs/AGENTS.md b/docs/AGENTS.md
@@ -119,6 +119,12 @@ gh pr merge --squash
 | `src/training/train_mlx.py` | Local MLX training |
 | `src/inference/compressor.py` | Production inference service |
 | `configs/training.yaml` | Model & hyperparameters |
+| `scripts/train_local.py` | CLI for MLX training with run storage |
+| `scripts/preprocess_synthetic.py` | Heuristic filtering of synthetic pairs |
+| `scripts/format_training_data.py` | Format validated pairs into train/valid/test splits |
+| `scripts/data_sanitization.py` | Structure/role/encoding checks on training data |
+| `scripts/mlflow_logger.py` | Post-training MLflow/DagsHub logging |
+| `scripts/evaluate_adapter.py` | Cross-model equivalence evaluation |
 
 ## Task-Specific Instructions
 
@@ -201,6 +207,8 @@ OPENAI_API_KEY=
 GOOGLE_API_KEY=
 HF_TOKEN=              # For model downloads
 TINKER_API_KEY=        # For cloud training
+DAGSHUB_OWNER=         # For MLflow logging (default: Sudhendra)
+DAGSHUB_REPO=          # For MLflow logging (default: compression-layer)
 ```
 
 ## Quick Commands
@@ -216,15 +224,28 @@ gh pr create --title "Phase X: Description" --body "## Summary\n- bullet points"
 # Check PR CI status
 gh pr checks
 
+# === DATA PIPELINE ===
+# Preprocess synthetic pairs (strip artifacts, filter by ratio)
+python scripts/preprocess_synthetic.py --input data/synthetic/nl_v2.jsonl --output data/validated/nl_pairs.jsonl --rejected data/validated/rejected_nl_pairs.jsonl --max-char-ratio 0.95 --max-token-ratio 0.95
+
+# Format into train/valid/test splits
+python scripts/format_training_data.py --input data/validated --output data/training --train-ratio 0.8 --valid-ratio 0.1 --test-ratio 0.1 --seed 42
+
+# Sanitize training data
+python scripts/data_sanitization.py --input data/training/train.jsonl --sanitized data/training/sanitized_train.jsonl --unsanitized data/training/unsanitized_train.jsonl
+
 # === LOCAL INFERENCE (MLX) ===
-python -m mlx_lm.generate --model mlx-community/Qwen3-4B-Instruct-4bit --prompt "..."
+python -m mlx_lm.generate --model mlx-community/Qwen3-4B-Instruct-2507-8bit --prompt "..."
 
 # === LOCAL TRAINING (MLX) ===
-python -m mlx_lm.lora --model mlx-community/Qwen3-4B-Instruct-4bit --train --data ./data
+python scripts/train_local.py --train
 
 # === CLOUD TRAINING (Tinker) ===
 python scripts/train_tinker.py --config configs/training.yaml --output models/adapters/tinker
 
+# === POST-TRAINING LOGGING ===
+python scripts/mlflow_logger.py --experiment-name "compression-v2" --dagshub-owner Sudhendra --dagshub-repo compression-layer
+
 # === VALIDATION ===
 python scripts/validate_batch.py --input data/seed/pairs.jsonl
 ```

diff --git a/docs/CLAUDE.md b/docs/CLAUDE.md
@@ -4,8 +4,8 @@
 
 Universal semantic compression layer for LLM inputs. Compresses memories, code, context before API calls while preserving reasoning equivalence across Claude/GPT/Gemini.
 
-**Model**: Qwen3-8B (via Unsloth)  
-**Training**: Unsloth + LoRA (2-5x faster, 70% less VRAM)  
+**Model**: Qwen3-4B-Instruct-2507-8bit (local MLX) / Qwen3-8B (Tinker cloud)  
+**Training**: MLX LoRA (local) + Tinker (cloud production)  
 **License**: Apache 2.0  
 **Repo**: https://github.com/Sudhendra/compression-layer
 
@@ -75,11 +75,12 @@ gh pr merge --squash
 
 - `src/validation/` — Equivalence testing across models
 - `src/generation/` — Compression pair synthesis
-- `src/training/` — Unsloth fine-tuning on Qwen3-8B
+- `src/training/` — MLX local + Tinker cloud fine-tuning
 - `src/inference/` — Production compressor
-- `data/raw/` — Source corpora (gitignored)
-- `data/validated/` — Cross-model validated pairs
-- `models/` — LoRA checkpoints, GGUF exports
+- `data/synthetic/` — Raw synthetic pairs from v1 adapter (gitignored)
+- `data/validated/` — Preprocessed and filtered pairs (gitignored)
+- `data/training/` — Final train/valid/test splits in chat format (gitignored)
+- `models/` — LoRA checkpoints, run logs, GGUF exports (gitignored)
 
 ## Code Conventions
 
@@ -115,8 +116,21 @@ OPENAI_API_KEY=
 GOOGLE_API_KEY=
 HF_TOKEN=           # For model downloads
 TINKER_API_KEY=     # For cloud training
+DAGSHUB_OWNER=      # For MLflow logging (default: Sudhendra)
+DAGSHUB_REPO=       # For MLflow logging (default: compression-layer)
 ```
 
+## Key Scripts
+
+| Script | Purpose |
+|--------|---------|
+| `scripts/train_local.py` | MLX LoRA training with run storage |
+| `scripts/preprocess_synthetic.py` | Heuristic filtering of synthetic pairs |
+| `scripts/format_training_data.py` | Format validated pairs into train/valid/test splits |
+| `scripts/data_sanitization.py` | Structure/role/encoding checks on training data |
+| `scripts/mlflow_logger.py` | Post-training MLflow/DagsHub logging |
+| `scripts/evaluate_adapter.py` | Cross-model equivalence evaluation |
+
 ## Common Tasks
 
 **Change base model**: Update `configs/training.yaml` → `cloud.model`

diff --git a/docs/MLX_TRAINING.md b/docs/MLX_TRAINING.md
@@ -106,3 +106,117 @@ tail -f models/runs/mlx/latest/train.err
   control whether LoRA adapters are saved to disk.
 - Adapter checkpoints are saved according to `save_every` and live under
   `models/runs/mlx/<timestamp>/adapter/`.
+
+## Data preprocessing pipeline
+
+Before training, raw synthetic data must be preprocessed and sanitized. The full
+pipeline is:
+
+### Step 1: Preprocess synthetic pairs
+
+Strip generation artifacts (`<think>`, `<tool_call>` tags) and filter by
+compression ratio:
+
+```bash
+# NL pairs (stricter filtering: 0.95 char/token ratio)
+python scripts/preprocess_synthetic.py \
+  --input data/synthetic/nl_v2.jsonl \
+  --output data/validated/nl_pairs.jsonl \
+  --rejected data/validated/rejected_nl_pairs.jsonl \
+  --max-char-ratio 0.95 \
+  --max-token-ratio 0.95
+
+# Code pairs (default 1.0 thresholds)
+python scripts/preprocess_synthetic.py \
+  --input data/synthetic/code_v2.jsonl \
+  --output data/validated/code_pairs.jsonl \
+  --rejected data/validated/rejected_code_pairs.jsonl \
+  --max-char-ratio 1.0 \
+  --max-token-ratio 1.0
+```
+
+### Step 2: Format into train/valid/test splits
+
+```bash
+python scripts/format_training_data.py \
+  --input data/validated \
+  --output data/training \
+  --train-ratio 0.8 --valid-ratio 0.1 --test-ratio 0.1 \
+  --seed 42
+```
+
+### Step 3: Sanitize all splits
+
+```bash
+for split in train valid test; do
+  python scripts/data_sanitization.py \
+    --input "data/training/${split}.jsonl" \
+    --sanitized "data/training/sanitized_${split}.jsonl" \
+    --unsanitized "data/training/unsanitized_${split}.jsonl"
+done
+
+# Promote sanitized files
+for split in train valid test; do
+  mv "data/training/sanitized_${split}.jsonl" "data/training/${split}.jsonl"
+done
+```
+
+### Current data state (v2)
+
+| File | Count |
+|------|-------|
+| `data/training/train.jsonl` | 19,845 |
+| `data/training/valid.jsonl` | 2,473 |
+| `data/training/test.jsonl` | 2,497 |
+
+## Post-training: MLflow/DagsHub logging
+
+After a training run completes, log metrics and artifacts to DagsHub/MLflow using
+`scripts/mlflow_logger.py`. This is **not** called automatically by the training
+scripts -- you must run it manually.
+
+### Prerequisites
+
+```bash
+# Ensure dagshub and mlflow are installed
+pip install dagshub mlflow
+
+# Or install via optional dependency group
+pip install -e ".[mlflow]"
+```
+
+### Usage
+
+```bash
+# Log the latest MLX run (auto-detected)
+python scripts/mlflow_logger.py \
+  --experiment-name "compression-v2" \
+  --dagshub-owner Sudhendra \
+  --dagshub-repo compression-layer
+
+# Log a specific run directory
+python scripts/mlflow_logger.py \
+  --run-dir models/runs/mlx/2026-01-30_17-14-36 \
+  --experiment-name "compression-v1" \
+  --dagshub-owner Sudhendra \
+  --dagshub-repo compression-layer
+```
+
+The logger reads `run.json` and `train.log` from the run directory and logs:
+- **Params**: model, git_sha, lora_rank, batch_size, learning_rate, etc.
+- **Metrics**: train_loss, val_loss, tokens_per_sec, peak_mem_gb (per step)
+- **Artifacts**: run.json, train.log, adapter weights, loss curve plot
+
+### Environment variables
+
+You can set defaults via environment variables instead of CLI flags:
+
+```bash
+DAGSHUB_OWNER=Sudhendra
+DAGSHUB_REPO=compression-layer
+MLFLOW_TRACKING_URI=https://dagshub.com/Sudhendra/compression-layer.mlflow
+```
+
+**Note**: The default `--dagshub-owner` in the script is `Gautam-Galada` (the
+contributor who wrote it). Override with `--dagshub-owner Sudhendra` or set the
+`DAGSHUB_OWNER` env var.
diff --git a/docs/SETUP.md b/docs/SETUP.md
@@ -63,19 +63,22 @@ python -m mlx_lm.generate \
 For quick iteration with Qwen3-4B:
 
 ```bash
-# Fine-tune with MLX LoRA (local)
+# Fine-tune with MLX LoRA (recommended wrapper script)
+python scripts/train_local.py --train
+
+# Or run mlx_lm directly
 python -m mlx_lm.lora \
-  --model mlx-community/Qwen3-4B-Instruct-4bit \
+  --model mlx-community/Qwen3-4B-Instruct-2507-8bit \
   --train \
-  --data ./data/validated \
-  --iters 100 \
+  --data ./data/training \
+  --iters 500 \
   --batch-size 2 \
   --lora-rank 8
 
 # Test adapter
 python -m mlx_lm.generate \
-  --model mlx-community/Qwen3-4B-Instruct-4bit \
-  --adapter-path ./adapters \
+  --model mlx-community/Qwen3-4B-Instruct-2507-8bit \
+  --adapter-path models/runs/mlx/latest/adapter \
   --prompt "Compress: ..."
 ```
 
@@ -192,6 +195,11 @@ HF_HUB_ENABLE_HF_TRANSFER=1
 
 # Tinker (cloud training)
 TINKER_API_KEY=tk_...
+
+# DagsHub/MLflow (experiment tracking, optional)
+DAGSHUB_OWNER=Sudhendra
+DAGSHUB_REPO=compression-layer
+# MLFLOW_TRACKING_URI is auto-derived from owner/repo if omitted
 ```
 
 ---
@@ -219,14 +227,22 @@ TINKER_API_KEY=tk_...
 
 ```bash
 # Local inference
-python -m mlx_lm.generate --model mlx-community/Qwen3-4B-Instruct-4bit --prompt "..."
+python -m mlx_lm.generate --model mlx-community/Qwen3-4B-Instruct-2507-8bit --prompt "..."
 
-# Local training (small scale)
-python -m mlx_lm.lora --model mlx-community/Qwen3-4B-Instruct-4bit --train --data ./data
+# Local training (wrapper script with run storage)
+python scripts/train_local.py --train
 
 # Cloud training (production)
 python scripts/train_tinker.py --config configs/training.yaml --output models/adapters/tinker
 
+# Post-training MLflow logging
+python scripts/mlflow_logger.py --experiment-name "compression-v2" --dagshub-owner Sudhendra
+
+# Data preprocessing pipeline
+python scripts/preprocess_synthetic.py --input data/synthetic/nl_v2.jsonl --output data/validated/nl_pairs.jsonl
+python scripts/format_training_data.py --input data/validated --output data/training
+python scripts/data_sanitization.py --input data/training/train.jsonl --sanitized data/training/sanitized_train.jsonl --unsanitized data/training/unsanitized_train.jsonl
+
 # Run validation
 python scripts/validate_batch.py --input data/seed/pairs.jsonl
 ```

diff --git a/docs/TASKS.md b/docs/TASKS.md
@@ -154,19 +154,52 @@ See: `docs/plans/2026-01-31-v2-production-training.md`
 - Train production model on Tinker (Qwen3-8B)
 - Release model on HuggingFace
 
-### Tasks
-- [ ] Create adapter-based compression generator
-- [ ] Generate 5K+ code pairs using v1 adapter
-- [ ] Generate 5K+ NL pairs using v1 adapter
-- [ ] Validate synthetic pairs (Claude + GPT)
-- [ ] Train v2 model on Tinker
-- [ ] Evaluate and compare with v1
-- [ ] Package and release model
-
-### Estimated Costs
-- Validation: $40-80
-- Tinker training: $10-20
-- Total: ~$55-110
+### Data Generation ✅ COMPLETE
+- [x] Generate synthetic code pairs using v1 adapter → **17,315 pairs** (`data/synthetic/code_v2.jsonl`)
+- [x] Generate synthetic NL pairs using v1 adapter → **17,056 pairs** (`data/synthetic/nl_v2.jsonl`)
+- [x] Total raw synthetic: **34,371 pairs**
+
+### Data Preprocessing & Sanitization ✅ COMPLETE
+- [x] Build heuristic preprocessing script (`scripts/preprocess_synthetic.py`)
+  - Strips `<think>`/`<tool_call>` generation artifacts
+  - Filters by character ratio and token ratio thresholds
+  - Writes clean + rejected outputs with rejection reasons
+- [x] Preprocess code pairs (max-char-ratio 1.0, max-token-ratio 1.0) → **17,125 passed**, 190 rejected
+- [x] Preprocess NL pairs (max-char-ratio 0.95, max-token-ratio 0.95) → **10,505 passed**, 6,551 rejected
+- [x] Format into train/valid/test splits (80/10/10, seed 42) via `scripts/format_training_data.py`
+- [x] Sanitize all splits via `scripts/data_sanitization.py` (structure, role, encoding checks)
+- [x] Validate final data: 0 bad JSON, 0 bad messages, 0 bad roles across all splits
+
+### V2 Data Summary
+| Split | Count | Status |
+|-------|-------|--------|
+| `data/training/train.jsonl` | 19,845 | Sanitized, chat format ✅ |
+| `data/training/valid.jsonl` | 2,473 | Sanitized, chat format ✅ |
+| `data/training/test.jsonl` | 2,497 | Sanitized, chat format ✅ |
+| Unsanitized (rejected by sanitizer) | 2,815 | Archived |
+
+All training files have correct `(system, user, assistant)` message structure.
+Prompt masking verified: loss computed only on assistant tokens (`--mask-prompt`).
+
+### MLflow/DagsHub Logging ✅ COMPLETE
+- [x] Post-training MLflow logger (`scripts/mlflow_logger.py`)
+  - Reads `run.json` + `train.log` from MLX run directories
+  - Logs params, step metrics, artifacts, and loss curve plots to DagsHub/MLflow
+  - CLI: `--experiment-name`, `--dagshub-owner`, `--dagshub-repo`
+- [x] Dependencies installed: `dagshub` (v0.6.5), `mlflow` (v3.9.0)
+
+### Remaining Tasks
+- [ ] Train v2 model locally (MLX) on 19,845 examples
+- [ ] Train v2 model on Tinker (Qwen3-8B) for production
+- [ ] Evaluate v2 and compare with v1
+- [ ] Log v2 training run to DagsHub/MLflow
+- [ ] Package and release model on HuggingFace
+
+### Cost Notes
+- Synthetic generation: done via v1 adapter (free, local)
+- Preprocessing/sanitization: heuristic pipeline (free, no API calls)
+- Multi-model validation skipped in favor of heuristic filtering (saved ~$40-80)
+- Tinker training: ~$10-20 estimated
 
 ---
 

diff --git a/scripts/mlflow_logger.py b/scripts/mlflow_logger.py
@@ -249,4 +249,4 @@ def main() -> None:
 
 
 if __name__ == "__main__":
-    main()
+    main()
Original file line number	Diff line number	Diff line change
Expand Up		@@ -249,4 +249,4 @@ def main() -> None:


		if __name__ == "__main__":
		main()
		main()