Training improvements: LoRA tuning, early stopping, loss normalization by Sudhendra · Pull Request #23 · Sudhendra/compression-layer

Sudhendra · 2026-02-16T22:52:25Z

Summary

Fixes severe overfitting observed in Qwen3-8B training (val loss diverged from ~94 to ~130 by epoch 3). Changes based on deep research into Qwen3 LoRA best practices.

LoRA config tuned: r=64→16, alpha=128→32, dropout=0→0.05 (matches Axolotl Qwen3 config, QLoRA paper recommendations)
Early stopping added: patience=5 evals, threshold=0.01 — stops training when val loss plateaus
Loss normalization: Training/validation loss now reported as per-token cross-entropy (~2-5 range) instead of sum-reduced (~100-400 range)
Mid-epoch eval enabled: eval_interval_steps=250 (was 0/disabled!)
Tinker visualizer: New scripts/visualize_tinker_training.py parses metrics.jsonl and plots loss curves with epoch boundaries, best val loss, and early stopping markers

Files Changed

File	Change
`configs/training.yaml`	LoRA params, epochs=2, eval_interval, early stopping config, updated cost estimates
`src/training/train_tinker.py`	Early stopping logic, loss normalization, `_check_early_stopping` helper
`scripts/visualize_tinker_training.py`	New Tinker training curve visualizer
`tests/test_train_tinker.py`	9 new tests for early stopping + loss normalization
`tests/test_visualize_tinker_training.py`	12 new tests for metrics parser

Test Plan

All 242 tests pass
Lint clean (ruff check)
Visualizer generates plot from existing training run
Run new training with these settings and verify early stopping triggers appropriately

- LoRA: rank 64→16, alpha 128→32, dropout 0→0.05 (research-backed) - Epochs: 3→2 with early stopping (patience=5, threshold=0.01) - Enable mid-epoch validation every 250 steps (was disabled) - Add _check_early_stopping pure helper with proper edge case handling - Check early stopping at both mid-epoch and epoch-end validation - Log early stopping events to train.log and metrics.jsonl

- _iter_training_batches returns (batch, total_tokens, completion_tokens) - Training loss divided by completion token count for interpretable values - Validation loss accumulated as total_loss/total_completion_tokens - metrics.jsonl logs both train_loss (per-token) and train_loss_total (raw) - final_loss and logger.info use per-token values

- Parses metrics.jsonl (train, val, early_stop entry types) - Two-subplot layout: loss curves + throughput - EMA smoothing for noisy train loss (configurable alpha) - Marks epoch boundaries, best val loss, early stopping - Handles both old (sum-reduced) and new (per-token) formats - CLI: --metrics, --output, --dpi, --ema-alpha - 12 parser tests

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76afa9afec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

scripts/visualize_tinker_training.py

src/training/train_tinker.py

- Fix mypy type error in step_loss_per_token assignment - Guard best_val_loss lookup against missing exact match (P1) - Persist early stopping state across checkpoint resumes (P2)

Sudhendra added 6 commits February 16, 2026 02:42

test: add early stopping and loss normalization tests

b43f6a9

docs: update cost estimates for 2-epoch training with early stopping

8c00126

style: fix lint issues in visualizer tests

76afa9a

chatgpt-codex-connector bot reviewed Feb 16, 2026

View reviewed changes

scripts/visualize_tinker_training.py Outdated Show resolved Hide resolved

src/training/train_tinker.py Outdated Show resolved Hide resolved

fix: address CI and codex review feedback

42f67e5

- Fix mypy type error in step_loss_per_token assignment - Guard best_val_loss lookup against missing exact match (P1) - Persist early stopping state across checkpoint resumes (P2)

Sudhendra merged commit 937f575 into main Feb 16, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training improvements: LoRA tuning, early stopping, loss normalization#23

Training improvements: LoRA tuning, early stopping, loss normalization#23
Sudhendra merged 7 commits intomainfrom
feat/training-improvements

Sudhendra commented Feb 16, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Sudhendra commented Feb 16, 2026

Summary

Files Changed

Test Plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments