feat: add resumable validation/checkpoint flow for Tinker ServiceClient by Sudhendra · Pull Request #22 · Sudhendra/compression-layer

Sudhendra · 2026-02-13T04:51:44Z

Summary

Add ServiceClient training controls for periodic logging, checkpointing, validation cadence, and checkpoint resume behavior.
Persist service-mode artifacts (tinker_run.json, run.json, metrics.jsonl, train.log) and write MLflow-compatible train/val log lines.
Add regression tests for artifact generation and checkpoint resume, and update Tinker setup docs/CLI defaults.

Test Plan

ruff check src/training/train_tinker.py tests/test_train_tinker.py scripts/train_tinker.py docs/SETUP.md
source .venv/bin/activate && mypy src/training/train_tinker.py --ignore-missing-imports
pytest tests/test_train_tinker.py tests/test_tinker_sdk_client.py tests/test_tinker_data.py -q

…ceClient

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52356c6d7a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-13T04:55:14Z

src/training/train_tinker.py

+                    state["latest_checkpoint_path"] = checkpoint_path
+                    state["checkpoints"] = checkpoints
+                    state["updated_at"] = _utc_now_iso()


Persist completed_steps when checkpointing

This checkpoint branch updates latest_checkpoint_path but does not persist completed_steps = current_step. If checkpoint_interval_steps is not aligned with log_interval_steps, a crash after checkpointing will resume model weights from a later step while batch skipping still uses an older step count, causing duplicated optimizer updates and inconsistent resumed training.

Useful? React with 👍 / 👎.

Fixed in ccd2d8e. Checkpoint persistence now also writes state["completed_steps"] = current_step before saving run state in the checkpoint branch, so resume skip logic stays aligned even when checkpoint and log intervals differ.

chatgpt-codex-connector · 2026-02-13T04:55:14Z

src/training/train_tinker.py

+        training_client = service_client.create_lora_training_client(
+            base_model=config.model,
+            rank=config.lora.rank,
+        )


Reset progress when falling back to fresh training client

When a prior latest_checkpoint_path exists but the SDK lacks create_training_client_from_state_with_optimizer, this code starts a fresh LoRA client without clearing previously loaded completed_steps. The later skip logic then omits early batches on a brand-new model, silently under-training the run.

Useful? React with 👍 / 👎.

Fixed in ccd2d8e. When a prior checkpoint exists but SDK restore API is unavailable, the code now falls back to fresh client and explicitly resets latest_checkpoint_path and completed_steps to avoid skipping batches on a new model.

chatgpt-codex-connector · 2026-02-13T04:55:14Z

src/training/train_tinker.py

                final_loss = step_loss

-            if current_step % 10 == 0 or current_step == total_steps:
+            if current_step % config.log_interval_steps == 0 or current_step == total_steps:


Validate log interval before modulo arithmetic

log_interval_steps is now user-configurable, but this condition performs % config.log_interval_steps unguarded. Setting the interval to 0 in YAML/CLI will raise ZeroDivisionError during training instead of providing a clear config error or disabling logging.

Useful? React with 👍 / 👎.

Fixed in ccd2d8e. Added a guard at the start of _train_with_service_client_sdk that returns a clear error when log_interval_steps <= 0, preventing modulo-by-zero during training.

feat: add resumable checkpoints and MLflow artifacts for Tinker Servi…

52356c6

…ceClient

chatgpt-codex-connector bot reviewed Feb 13, 2026

View reviewed changes

fix: harden service-client resume and logging interval handling

ccd2d8e

Sudhendra merged commit fd7d020 into main Feb 13, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add resumable validation/checkpoint flow for Tinker ServiceClient#22

feat: add resumable validation/checkpoint flow for Tinker ServiceClient#22
Sudhendra merged 2 commits intomainfrom
feat/tinker-validation-checkpointing

Sudhendra commented Feb 13, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 13, 2026

Uh oh!

Sudhendra Feb 13, 2026

Uh oh!

chatgpt-codex-connector bot Feb 13, 2026

Uh oh!

Sudhendra Feb 13, 2026

Uh oh!

chatgpt-codex-connector bot Feb 13, 2026

Uh oh!

Sudhendra Feb 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Sudhendra commented Feb 13, 2026

Summary

Test Plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Sudhendra Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Sudhendra Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Sudhendra Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments