Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ data/synthetic/
data/validated/
data/final/
data/cache/
data/archive/
data/training/unsanitized_*.jsonl
data/*.jsonl
data/*.json
data/costs.log
Expand Down Expand Up @@ -130,3 +132,5 @@ temp/
.windsurf/
.zencoder/
skills/
data/training_v1/
data/validated_v1/
2,718 changes: 2,497 additions & 221 deletions data/training/test.jsonl

Large diffs are not rendered by default.

21,584 changes: 19,835 additions & 1,749 deletions data/training/train.jsonl

Large diffs are not rendered by default.

2,690 changes: 2,472 additions & 218 deletions data/training/valid.jsonl

Large diffs are not rendered by default.

25 changes: 23 additions & 2 deletions docs/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,12 @@ gh pr merge --squash
| `src/training/train_mlx.py` | Local MLX training |
| `src/inference/compressor.py` | Production inference service |
| `configs/training.yaml` | Model & hyperparameters |
| `scripts/train_local.py` | CLI for MLX training with run storage |
| `scripts/preprocess_synthetic.py` | Heuristic filtering of synthetic pairs |
| `scripts/format_training_data.py` | Format validated pairs into train/valid/test splits |
| `scripts/data_sanitization.py` | Structure/role/encoding checks on training data |
| `scripts/mlflow_logger.py` | Post-training MLflow/DagsHub logging |
| `scripts/evaluate_adapter.py` | Cross-model equivalence evaluation |

## Task-Specific Instructions

Expand Down Expand Up @@ -201,6 +207,8 @@ OPENAI_API_KEY=
GOOGLE_API_KEY=
HF_TOKEN= # For model downloads
TINKER_API_KEY= # For cloud training
DAGSHUB_OWNER= # For MLflow logging (default: Sudhendra)
DAGSHUB_REPO= # For MLflow logging (default: compression-layer)
```

## Quick Commands
Expand All @@ -216,15 +224,28 @@ gh pr create --title "Phase X: Description" --body "## Summary\n- bullet points"
# Check PR CI status
gh pr checks

# === DATA PIPELINE ===
# Preprocess synthetic pairs (strip artifacts, filter by ratio)
python scripts/preprocess_synthetic.py --input data/synthetic/nl_v2.jsonl --output data/validated/nl_pairs.jsonl --rejected data/validated/rejected_nl_pairs.jsonl --max-char-ratio 0.95 --max-token-ratio 0.95

# Format into train/valid/test splits
python scripts/format_training_data.py --input data/validated --output data/training --train-ratio 0.8 --valid-ratio 0.1 --test-ratio 0.1 --seed 42

# Sanitize training data
python scripts/data_sanitization.py --input data/training/train.jsonl --sanitized data/training/sanitized_train.jsonl --unsanitized data/training/unsanitized_train.jsonl

# === LOCAL INFERENCE (MLX) ===
python -m mlx_lm.generate --model mlx-community/Qwen3-4B-Instruct-4bit --prompt "..."
python -m mlx_lm.generate --model mlx-community/Qwen3-4B-Instruct-2507-8bit --prompt "..."

# === LOCAL TRAINING (MLX) ===
python -m mlx_lm.lora --model mlx-community/Qwen3-4B-Instruct-4bit --train --data ./data
python scripts/train_local.py --train

# === CLOUD TRAINING (Tinker) ===
python scripts/train_tinker.py --config configs/training.yaml --output models/adapters/tinker

# === POST-TRAINING LOGGING ===
python scripts/mlflow_logger.py --experiment-name "compression-v2" --dagshub-owner Sudhendra --dagshub-repo compression-layer

# === VALIDATION ===
python scripts/validate_batch.py --input data/seed/pairs.jsonl
```
Expand Down
26 changes: 20 additions & 6 deletions docs/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

Universal semantic compression layer for LLM inputs. Compresses memories, code, context before API calls while preserving reasoning equivalence across Claude/GPT/Gemini.

**Model**: Qwen3-8B (via Unsloth)
**Training**: Unsloth + LoRA (2-5x faster, 70% less VRAM)
**Model**: Qwen3-4B-Instruct-2507-8bit (local MLX) / Qwen3-8B (Tinker cloud)
**Training**: MLX LoRA (local) + Tinker (cloud production)
**License**: Apache 2.0
**Repo**: https://github.com/Sudhendra/compression-layer

Expand Down Expand Up @@ -75,11 +75,12 @@ gh pr merge --squash

- `src/validation/` — Equivalence testing across models
- `src/generation/` — Compression pair synthesis
- `src/training/` — Unsloth fine-tuning on Qwen3-8B
- `src/training/` — MLX local + Tinker cloud fine-tuning
- `src/inference/` — Production compressor
- `data/raw/` — Source corpora (gitignored)
- `data/validated/` — Cross-model validated pairs
- `models/` — LoRA checkpoints, GGUF exports
- `data/synthetic/` — Raw synthetic pairs from v1 adapter (gitignored)
- `data/validated/` — Preprocessed and filtered pairs (gitignored)
- `data/training/` — Final train/valid/test splits in chat format (gitignored)
- `models/` — LoRA checkpoints, run logs, GGUF exports (gitignored)

## Code Conventions

Expand Down Expand Up @@ -115,8 +116,21 @@ OPENAI_API_KEY=
GOOGLE_API_KEY=
HF_TOKEN= # For model downloads
TINKER_API_KEY= # For cloud training
DAGSHUB_OWNER= # For MLflow logging (default: Sudhendra)
DAGSHUB_REPO= # For MLflow logging (default: compression-layer)
```

## Key Scripts

| Script | Purpose |
|--------|---------|
| `scripts/train_local.py` | MLX LoRA training with run storage |
| `scripts/preprocess_synthetic.py` | Heuristic filtering of synthetic pairs |
| `scripts/format_training_data.py` | Format validated pairs into train/valid/test splits |
| `scripts/data_sanitization.py` | Structure/role/encoding checks on training data |
| `scripts/mlflow_logger.py` | Post-training MLflow/DagsHub logging |
| `scripts/evaluate_adapter.py` | Cross-model equivalence evaluation |

## Common Tasks

**Change base model**: Update `configs/training.yaml` → `cloud.model`
Expand Down
114 changes: 114 additions & 0 deletions docs/MLX_TRAINING.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,3 +106,117 @@ tail -f models/runs/mlx/latest/train.err
control whether LoRA adapters are saved to disk.
- Adapter checkpoints are saved according to `save_every` and live under
`models/runs/mlx/<timestamp>/adapter/`.

## Data preprocessing pipeline

Before training, raw synthetic data must be preprocessed and sanitized. The full
pipeline is:

### Step 1: Preprocess synthetic pairs

Strip generation artifacts (`<think>`, `<tool_call>` tags) and filter by
compression ratio:

```bash
# NL pairs (stricter filtering: 0.95 char/token ratio)
python scripts/preprocess_synthetic.py \
--input data/synthetic/nl_v2.jsonl \
--output data/validated/nl_pairs.jsonl \
--rejected data/validated/rejected_nl_pairs.jsonl \
--max-char-ratio 0.95 \
--max-token-ratio 0.95

# Code pairs (default 1.0 thresholds)
python scripts/preprocess_synthetic.py \
--input data/synthetic/code_v2.jsonl \
--output data/validated/code_pairs.jsonl \
--rejected data/validated/rejected_code_pairs.jsonl \
--max-char-ratio 1.0 \
--max-token-ratio 1.0
```

### Step 2: Format into train/valid/test splits

```bash
python scripts/format_training_data.py \
--input data/validated \
--output data/training \
--train-ratio 0.8 --valid-ratio 0.1 --test-ratio 0.1 \
--seed 42
```

### Step 3: Sanitize all splits

```bash
for split in train valid test; do
python scripts/data_sanitization.py \
--input "data/training/${split}.jsonl" \
--sanitized "data/training/sanitized_${split}.jsonl" \
--unsanitized "data/training/unsanitized_${split}.jsonl"
done

# Promote sanitized files
for split in train valid test; do
mv "data/training/sanitized_${split}.jsonl" "data/training/${split}.jsonl"
done
```

### Current data state (v2)

| File | Count |
|------|-------|
| `data/training/train.jsonl` | 19,845 |
| `data/training/valid.jsonl` | 2,473 |
| `data/training/test.jsonl` | 2,497 |

## Post-training: MLflow/DagsHub logging

After a training run completes, log metrics and artifacts to DagsHub/MLflow using
`scripts/mlflow_logger.py`. This is **not** called automatically by the training
scripts -- you must run it manually.

### Prerequisites

```bash
# Ensure dagshub and mlflow are installed
pip install dagshub mlflow

# Or install via optional dependency group
pip install -e ".[mlflow]"
```

### Usage

```bash
# Log the latest MLX run (auto-detected)
python scripts/mlflow_logger.py \
--experiment-name "compression-v2" \
--dagshub-owner Sudhendra \
--dagshub-repo compression-layer

# Log a specific run directory
python scripts/mlflow_logger.py \
--run-dir models/runs/mlx/2026-01-30_17-14-36 \
--experiment-name "compression-v1" \
--dagshub-owner Sudhendra \
--dagshub-repo compression-layer
```

The logger reads `run.json` and `train.log` from the run directory and logs:
- **Params**: model, git_sha, lora_rank, batch_size, learning_rate, etc.
- **Metrics**: train_loss, val_loss, tokens_per_sec, peak_mem_gb (per step)
- **Artifacts**: run.json, train.log, adapter weights, loss curve plot

### Environment variables

You can set defaults via environment variables instead of CLI flags:

```bash
DAGSHUB_OWNER=Sudhendra
DAGSHUB_REPO=compression-layer
MLFLOW_TRACKING_URI=https://dagshub.com/Sudhendra/compression-layer.mlflow
```

**Note**: The default `--dagshub-owner` in the script is `Gautam-Galada` (the
contributor who wrote it). Override with `--dagshub-owner Sudhendra` or set the
`DAGSHUB_OWNER` env var.
34 changes: 25 additions & 9 deletions docs/SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,19 +63,22 @@ python -m mlx_lm.generate \
For quick iteration with Qwen3-4B:

```bash
# Fine-tune with MLX LoRA (local)
# Fine-tune with MLX LoRA (recommended wrapper script)
python scripts/train_local.py --train

# Or run mlx_lm directly
python -m mlx_lm.lora \
--model mlx-community/Qwen3-4B-Instruct-4bit \
--model mlx-community/Qwen3-4B-Instruct-2507-8bit \
--train \
--data ./data/validated \
--iters 100 \
--data ./data/training \
--iters 500 \
--batch-size 2 \
--lora-rank 8

# Test adapter
python -m mlx_lm.generate \
--model mlx-community/Qwen3-4B-Instruct-4bit \
--adapter-path ./adapters \
--model mlx-community/Qwen3-4B-Instruct-2507-8bit \
--adapter-path models/runs/mlx/latest/adapter \
--prompt "Compress: ..."
```

Expand Down Expand Up @@ -192,6 +195,11 @@ HF_HUB_ENABLE_HF_TRANSFER=1

# Tinker (cloud training)
TINKER_API_KEY=tk_...

# DagsHub/MLflow (experiment tracking, optional)
DAGSHUB_OWNER=Sudhendra
DAGSHUB_REPO=compression-layer
# MLFLOW_TRACKING_URI is auto-derived from owner/repo if omitted
```

---
Expand Down Expand Up @@ -219,14 +227,22 @@ TINKER_API_KEY=tk_...

```bash
# Local inference
python -m mlx_lm.generate --model mlx-community/Qwen3-4B-Instruct-4bit --prompt "..."
python -m mlx_lm.generate --model mlx-community/Qwen3-4B-Instruct-2507-8bit --prompt "..."

# Local training (small scale)
python -m mlx_lm.lora --model mlx-community/Qwen3-4B-Instruct-4bit --train --data ./data
# Local training (wrapper script with run storage)
python scripts/train_local.py --train

# Cloud training (production)
python scripts/train_tinker.py --config configs/training.yaml --output models/adapters/tinker

# Post-training MLflow logging
python scripts/mlflow_logger.py --experiment-name "compression-v2" --dagshub-owner Sudhendra

# Data preprocessing pipeline
python scripts/preprocess_synthetic.py --input data/synthetic/nl_v2.jsonl --output data/validated/nl_pairs.jsonl
python scripts/format_training_data.py --input data/validated --output data/training
python scripts/data_sanitization.py --input data/training/train.jsonl --sanitized data/training/sanitized_train.jsonl --unsanitized data/training/unsanitized_train.jsonl

# Run validation
python scripts/validate_batch.py --input data/seed/pairs.jsonl
```
Expand Down
59 changes: 46 additions & 13 deletions docs/TASKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,19 +154,52 @@ See: `docs/plans/2026-01-31-v2-production-training.md`
- Train production model on Tinker (Qwen3-8B)
- Release model on HuggingFace

### Tasks
- [ ] Create adapter-based compression generator
- [ ] Generate 5K+ code pairs using v1 adapter
- [ ] Generate 5K+ NL pairs using v1 adapter
- [ ] Validate synthetic pairs (Claude + GPT)
- [ ] Train v2 model on Tinker
- [ ] Evaluate and compare with v1
- [ ] Package and release model

### Estimated Costs
- Validation: $40-80
- Tinker training: $10-20
- Total: ~$55-110
### Data Generation ✅ COMPLETE
- [x] Generate synthetic code pairs using v1 adapter → **17,315 pairs** (`data/synthetic/code_v2.jsonl`)
- [x] Generate synthetic NL pairs using v1 adapter → **17,056 pairs** (`data/synthetic/nl_v2.jsonl`)
- [x] Total raw synthetic: **34,371 pairs**

### Data Preprocessing & Sanitization ✅ COMPLETE
- [x] Build heuristic preprocessing script (`scripts/preprocess_synthetic.py`)
- Strips `<think>`/`<tool_call>` generation artifacts
- Filters by character ratio and token ratio thresholds
- Writes clean + rejected outputs with rejection reasons
- [x] Preprocess code pairs (max-char-ratio 1.0, max-token-ratio 1.0) → **17,125 passed**, 190 rejected
- [x] Preprocess NL pairs (max-char-ratio 0.95, max-token-ratio 0.95) → **10,505 passed**, 6,551 rejected
- [x] Format into train/valid/test splits (80/10/10, seed 42) via `scripts/format_training_data.py`
- [x] Sanitize all splits via `scripts/data_sanitization.py` (structure, role, encoding checks)
- [x] Validate final data: 0 bad JSON, 0 bad messages, 0 bad roles across all splits

### V2 Data Summary
| Split | Count | Status |
|-------|-------|--------|
| `data/training/train.jsonl` | 19,845 | Sanitized, chat format ✅ |
| `data/training/valid.jsonl` | 2,473 | Sanitized, chat format ✅ |
| `data/training/test.jsonl` | 2,497 | Sanitized, chat format ✅ |
| Unsanitized (rejected by sanitizer) | 2,815 | Archived |

All training files have correct `(system, user, assistant)` message structure.
Prompt masking verified: loss computed only on assistant tokens (`--mask-prompt`).

### MLflow/DagsHub Logging ✅ COMPLETE
- [x] Post-training MLflow logger (`scripts/mlflow_logger.py`)
- Reads `run.json` + `train.log` from MLX run directories
- Logs params, step metrics, artifacts, and loss curve plots to DagsHub/MLflow
- CLI: `--experiment-name`, `--dagshub-owner`, `--dagshub-repo`
- [x] Dependencies installed: `dagshub` (v0.6.5), `mlflow` (v3.9.0)

### Remaining Tasks
- [ ] Train v2 model locally (MLX) on 19,845 examples
- [ ] Train v2 model on Tinker (Qwen3-8B) for production
- [ ] Evaluate v2 and compare with v1
- [ ] Log v2 training run to DagsHub/MLflow
- [ ] Package and release model on HuggingFace

### Cost Notes
- Synthetic generation: done via v1 adapter (free, local)
- Preprocessing/sanitization: heuristic pipeline (free, no API calls)
- Multi-model validation skipped in favor of heuristic filtering (saved ~$40-80)
- Tinker training: ~$10-20 estimated

---

Expand Down
2 changes: 1 addition & 1 deletion scripts/mlflow_logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,4 +249,4 @@ def main() -> None:


if __name__ == "__main__":
main()
main()
Loading