Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ data/synthetic/
data/validated/
data/final/
data/cache/
data/archive/
data/training/unsanitized_*.jsonl
data/*.jsonl
data/*.json
data/costs.log
Expand Down Expand Up @@ -130,3 +132,5 @@ temp/
.windsurf/
.zencoder/
skills/
data/training_v1/
data/validated_v1/
2,718 changes: 2,497 additions & 221 deletions data/training/test.jsonl

Large diffs are not rendered by default.

21,584 changes: 19,835 additions & 1,749 deletions data/training/train.jsonl

Large diffs are not rendered by default.

2,690 changes: 2,472 additions & 218 deletions data/training/valid.jsonl

Large diffs are not rendered by default.

25 changes: 23 additions & 2 deletions docs/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,12 @@ gh pr merge --squash
| `src/training/train_mlx.py` | Local MLX training |
| `src/inference/compressor.py` | Production inference service |
| `configs/training.yaml` | Model & hyperparameters |
| `scripts/train_local.py` | CLI for MLX training with run storage |
| `scripts/preprocess_synthetic.py` | Heuristic filtering of synthetic pairs |
| `scripts/format_training_data.py` | Format validated pairs into train/valid/test splits |
| `scripts/data_sanitization.py` | Structure/role/encoding checks on training data |
| `scripts/mlflow_logger.py` | Post-training MLflow/DagsHub logging |
| `scripts/evaluate_adapter.py` | Cross-model equivalence evaluation |

## Task-Specific Instructions

Expand Down Expand Up @@ -201,6 +207,8 @@ OPENAI_API_KEY=
GOOGLE_API_KEY=
HF_TOKEN= # For model downloads
TINKER_API_KEY= # For cloud training
DAGSHUB_OWNER= # For MLflow logging (default: Sudhendra)
DAGSHUB_REPO= # For MLflow logging (default: compression-layer)
```

## Quick Commands
Expand All @@ -216,15 +224,28 @@ gh pr create --title "Phase X: Description" --body "## Summary\n- bullet points"
# Check PR CI status
gh pr checks

# === DATA PIPELINE ===
# Preprocess synthetic pairs (strip artifacts, filter by ratio)
python scripts/preprocess_synthetic.py --input data/synthetic/nl_v2.jsonl --output data/validated/nl_pairs.jsonl --rejected data/validated/rejected_nl_pairs.jsonl --max-char-ratio 0.95 --max-token-ratio 0.95

# Format into train/valid/test splits
python scripts/format_training_data.py --input data/validated --output data/training --train-ratio 0.8 --valid-ratio 0.1 --test-ratio 0.1 --seed 42

# Sanitize training data
python scripts/data_sanitization.py --input data/training/train.jsonl --sanitized data/training/sanitized_train.jsonl --unsanitized data/training/unsanitized_train.jsonl

# === LOCAL INFERENCE (MLX) ===
python -m mlx_lm.generate --model mlx-community/Qwen3-4B-Instruct-4bit --prompt "..."
python -m mlx_lm.generate --model mlx-community/Qwen3-4B-Instruct-2507-8bit --prompt "..."

# === LOCAL TRAINING (MLX) ===
python -m mlx_lm.lora --model mlx-community/Qwen3-4B-Instruct-4bit --train --data ./data
python scripts/train_local.py --train

# === CLOUD TRAINING (Tinker) ===
python scripts/train_tinker.py --config configs/training.yaml --output models/adapters/tinker

# === POST-TRAINING LOGGING ===
python scripts/mlflow_logger.py --experiment-name "compression-v2" --dagshub-owner Sudhendra --dagshub-repo compression-layer

# === VALIDATION ===
python scripts/validate_batch.py --input data/seed/pairs.jsonl
```
Expand Down
26 changes: 20 additions & 6 deletions docs/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

Universal semantic compression layer for LLM inputs. Compresses memories, code, context before API calls while preserving reasoning equivalence across Claude/GPT/Gemini.

**Model**: Qwen3-8B (via Unsloth)
**Training**: Unsloth + LoRA (2-5x faster, 70% less VRAM)
**Model**: Qwen3-4B-Instruct-2507-8bit (local MLX) / Qwen3-8B (Tinker cloud)
**Training**: MLX LoRA (local) + Tinker (cloud production)
**License**: Apache 2.0
**Repo**: https://github.com/Sudhendra/compression-layer

Expand Down Expand Up @@ -75,11 +75,12 @@ gh pr merge --squash

- `src/validation/` — Equivalence testing across models
- `src/generation/` — Compression pair synthesis
- `src/training/` — Unsloth fine-tuning on Qwen3-8B
- `src/training/` — MLX local + Tinker cloud fine-tuning
- `src/inference/` — Production compressor
- `data/raw/` — Source corpora (gitignored)
- `data/validated/` — Cross-model validated pairs
- `models/` — LoRA checkpoints, GGUF exports
- `data/synthetic/` — Raw synthetic pairs from v1 adapter (gitignored)
- `data/validated/` — Preprocessed and filtered pairs (gitignored)
- `data/training/` — Final train/valid/test splits in chat format (gitignored)
- `models/` — LoRA checkpoints, run logs, GGUF exports (gitignored)

## Code Conventions

Expand Down Expand Up @@ -115,8 +116,21 @@ OPENAI_API_KEY=
GOOGLE_API_KEY=
HF_TOKEN= # For model downloads
TINKER_API_KEY= # For cloud training
DAGSHUB_OWNER= # For MLflow logging (default: Sudhendra)
DAGSHUB_REPO= # For MLflow logging (default: compression-layer)
```

## Key Scripts

| Script | Purpose |
|--------|---------|
| `scripts/train_local.py` | MLX LoRA training with run storage |
| `scripts/preprocess_synthetic.py` | Heuristic filtering of synthetic pairs |
| `scripts/format_training_data.py` | Format validated pairs into train/valid/test splits |
| `scripts/data_sanitization.py` | Structure/role/encoding checks on training data |
| `scripts/mlflow_logger.py` | Post-training MLflow/DagsHub logging |
| `scripts/evaluate_adapter.py` | Cross-model equivalence evaluation |

## Common Tasks

**Change base model**: Update `configs/training.yaml` → `cloud.model`
Expand Down
114 changes: 114 additions & 0 deletions docs/MLX_TRAINING.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,3 +106,117 @@ tail -f models/runs/mlx/latest/train.err
control whether LoRA adapters are saved to disk.
- Adapter checkpoints are saved according to `save_every` and live under
`models/runs/mlx/<timestamp>/adapter/`.

## Data preprocessing pipeline

Before training, raw synthetic data must be preprocessed and sanitized. The full
pipeline is:

### Step 1: Preprocess synthetic pairs

Strip generation artifacts (`<think>`, `<tool_call>` tags) and filter by
compression ratio:

```bash
# NL pairs (stricter filtering: 0.95 char/token ratio)
python scripts/preprocess_synthetic.py \
--input data/synthetic/nl_v2.jsonl \
--output data/validated/nl_pairs.jsonl \
--rejected data/validated/rejected_nl_pairs.jsonl \
--max-char-ratio 0.95 \
--max-token-ratio 0.95

# Code pairs (default 1.0 thresholds)
python scripts/preprocess_synthetic.py \
--input data/synthetic/code_v2.jsonl \
--output data/validated/code_pairs.jsonl \
--rejected data/validated/rejected_code_pairs.jsonl \
--max-char-ratio 1.0 \
--max-token-ratio 1.0
```

### Step 2: Format into train/valid/test splits

```bash
python scripts/format_training_data.py \
--input data/validated \
--output data/training \
--train-ratio 0.8 --valid-ratio 0.1 --test-ratio 0.1 \
--seed 42
```

### Step 3: Sanitize all splits

```bash
for split in train valid test; do
python scripts/data_sanitization.py \
--input "data/training/${split}.jsonl" \
--sanitized "data/training/sanitized_${split}.jsonl" \
--unsanitized "data/training/unsanitized_${split}.jsonl"
done

# Promote sanitized files
for split in train valid test; do
mv "data/training/sanitized_${split}.jsonl" "data/training/${split}.jsonl"
done
```

### Current data state (v2)

| File | Count |
|------|-------|
| `data/training/train.jsonl` | 19,845 |
| `data/training/valid.jsonl` | 2,473 |
| `data/training/test.jsonl` | 2,497 |

## Post-training: MLflow/DagsHub logging

After a training run completes, log metrics and artifacts to DagsHub/MLflow using
`scripts/mlflow_logger.py`. This is **not** called automatically by the training
scripts -- you must run it manually.

### Prerequisites

```bash
# Ensure dagshub and mlflow are installed
pip install dagshub mlflow

# Or install via optional dependency group
pip install -e ".[mlflow]"
```

### Usage

```bash
# Log the latest MLX run (auto-detected)
python scripts/mlflow_logger.py \
--experiment-name "compression-v2" \
--dagshub-owner Sudhendra \
--dagshub-repo compression-layer

# Log a specific run directory
python scripts/mlflow_logger.py \
--run-dir models/runs/mlx/2026-01-30_17-14-36 \
--experiment-name "compression-v1" \
--dagshub-owner Sudhendra \
--dagshub-repo compression-layer
```

The logger reads `run.json` and `train.log` from the run directory and logs:
- **Params**: model, git_sha, lora_rank, batch_size, learning_rate, etc.
- **Metrics**: train_loss, val_loss, tokens_per_sec, peak_mem_gb (per step)
- **Artifacts**: run.json, train.log, adapter weights, loss curve plot

### Environment variables

You can set defaults via environment variables instead of CLI flags:

```bash
DAGSHUB_OWNER=Sudhendra
DAGSHUB_REPO=compression-layer
MLFLOW_TRACKING_URI=https://dagshub.com/Sudhendra/compression-layer.mlflow
```

**Note**: The default `--dagshub-owner` in the script is `Gautam-Galada` (the
contributor who wrote it). Override with `--dagshub-owner Sudhendra` or set the
`DAGSHUB_OWNER` env var.
34 changes: 25 additions & 9 deletions docs/SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,19 +63,22 @@ python -m mlx_lm.generate \
For quick iteration with Qwen3-4B:

```bash
# Fine-tune with MLX LoRA (local)
# Fine-tune with MLX LoRA (recommended wrapper script)
python scripts/train_local.py --train

# Or run mlx_lm directly
python -m mlx_lm.lora \
--model mlx-community/Qwen3-4B-Instruct-4bit \
--model mlx-community/Qwen3-4B-Instruct-2507-8bit \
--train \
--data ./data/validated \
--iters 100 \
--data ./data/training \
--iters 500 \
--batch-size 2 \
--lora-rank 8

# Test adapter
python -m mlx_lm.generate \
--model mlx-community/Qwen3-4B-Instruct-4bit \
--adapter-path ./adapters \
--model mlx-community/Qwen3-4B-Instruct-2507-8bit \
--adapter-path models/runs/mlx/latest/adapter \
--prompt "Compress: ..."
```

Expand Down Expand Up @@ -192,6 +195,11 @@ HF_HUB_ENABLE_HF_TRANSFER=1

# Tinker (cloud training)
TINKER_API_KEY=tk_...

# DagsHub/MLflow (experiment tracking, optional)
DAGSHUB_OWNER=Sudhendra
DAGSHUB_REPO=compression-layer
# MLFLOW_TRACKING_URI is auto-derived from owner/repo if omitted
```

---
Expand Down Expand Up @@ -219,14 +227,22 @@ TINKER_API_KEY=tk_...

```bash
# Local inference
python -m mlx_lm.generate --model mlx-community/Qwen3-4B-Instruct-4bit --prompt "..."
python -m mlx_lm.generate --model mlx-community/Qwen3-4B-Instruct-2507-8bit --prompt "..."

# Local training (small scale)
python -m mlx_lm.lora --model mlx-community/Qwen3-4B-Instruct-4bit --train --data ./data
# Local training (wrapper script with run storage)
python scripts/train_local.py --train

# Cloud training (production)
python scripts/train_tinker.py --config configs/training.yaml --output models/adapters/tinker

# Post-training MLflow logging
python scripts/mlflow_logger.py --experiment-name "compression-v2" --dagshub-owner Sudhendra

# Data preprocessing pipeline
python scripts/preprocess_synthetic.py --input data/synthetic/nl_v2.jsonl --output data/validated/nl_pairs.jsonl
python scripts/format_training_data.py --input data/validated --output data/training
python scripts/data_sanitization.py --input data/training/train.jsonl --sanitized data/training/sanitized_train.jsonl --unsanitized data/training/unsanitized_train.jsonl

# Run validation
python scripts/validate_batch.py --input data/seed/pairs.jsonl
```
Expand Down
59 changes: 46 additions & 13 deletions docs/TASKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,19 +154,52 @@ See: `docs/plans/2026-01-31-v2-production-training.md`
- Train production model on Tinker (Qwen3-8B)
- Release model on HuggingFace

### Tasks
- [ ] Create adapter-based compression generator
- [ ] Generate 5K+ code pairs using v1 adapter
- [ ] Generate 5K+ NL pairs using v1 adapter
- [ ] Validate synthetic pairs (Claude + GPT)
- [ ] Train v2 model on Tinker
- [ ] Evaluate and compare with v1
- [ ] Package and release model

### Estimated Costs
- Validation: $40-80
- Tinker training: $10-20
- Total: ~$55-110
### Data Generation ✅ COMPLETE
- [x] Generate synthetic code pairs using v1 adapter → **17,315 pairs** (`data/synthetic/code_v2.jsonl`)
- [x] Generate synthetic NL pairs using v1 adapter → **17,056 pairs** (`data/synthetic/nl_v2.jsonl`)
- [x] Total raw synthetic: **34,371 pairs**

### Data Preprocessing & Sanitization ✅ COMPLETE
- [x] Build heuristic preprocessing script (`scripts/preprocess_synthetic.py`)
- Strips `<think>`/`<tool_call>` generation artifacts
- Filters by character ratio and token ratio thresholds
- Writes clean + rejected outputs with rejection reasons
- [x] Preprocess code pairs (max-char-ratio 1.0, max-token-ratio 1.0) → **17,125 passed**, 190 rejected
- [x] Preprocess NL pairs (max-char-ratio 0.95, max-token-ratio 0.95) → **10,505 passed**, 6,551 rejected
- [x] Format into train/valid/test splits (80/10/10, seed 42) via `scripts/format_training_data.py`
- [x] Sanitize all splits via `scripts/data_sanitization.py` (structure, role, encoding checks)
- [x] Validate final data: 0 bad JSON, 0 bad messages, 0 bad roles across all splits

### V2 Data Summary
| Split | Count | Status |
|-------|-------|--------|
| `data/training/train.jsonl` | 19,845 | Sanitized, chat format ✅ |
| `data/training/valid.jsonl` | 2,473 | Sanitized, chat format ✅ |
| `data/training/test.jsonl` | 2,497 | Sanitized, chat format ✅ |
| Unsanitized (rejected by sanitizer) | 2,815 | Archived |

All training files have correct `(system, user, assistant)` message structure.
Prompt masking verified: loss computed only on assistant tokens (`--mask-prompt`).

### MLflow/DagsHub Logging ✅ COMPLETE
- [x] Post-training MLflow logger (`scripts/mlflow_logger.py`)
- Reads `run.json` + `train.log` from MLX run directories
- Logs params, step metrics, artifacts, and loss curve plots to DagsHub/MLflow
- CLI: `--experiment-name`, `--dagshub-owner`, `--dagshub-repo`
- [x] Dependencies installed: `dagshub` (v0.6.5), `mlflow` (v3.9.0)

### Remaining Tasks
- [ ] Train v2 model locally (MLX) on 19,845 examples
- [ ] Train v2 model on Tinker (Qwen3-8B) for production
- [ ] Evaluate v2 and compare with v1
- [ ] Log v2 training run to DagsHub/MLflow
- [ ] Package and release model on HuggingFace

### Cost Notes
- Synthetic generation: done via v1 adapter (free, local)
- Preprocessing/sanitization: heuristic pipeline (free, no API calls)
- Multi-model validation skipped in favor of heuristic filtering (saved ~$40-80)
- Tinker training: ~$10-20 estimated

---

Expand Down
2 changes: 1 addition & 1 deletion scripts/mlflow_logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,4 +249,4 @@ def main() -> None:


if __name__ == "__main__":
main()
main()
Loading