Skip to content

feat: add resumable validation/checkpoint flow for Tinker ServiceClient#22

Merged
Sudhendra merged 2 commits intomainfrom
feat/tinker-validation-checkpointing
Feb 13, 2026
Merged

feat: add resumable validation/checkpoint flow for Tinker ServiceClient#22
Sudhendra merged 2 commits intomainfrom
feat/tinker-validation-checkpointing

Conversation

@Sudhendra
Copy link
Owner

Summary

  • Add ServiceClient training controls for periodic logging, checkpointing, validation cadence, and checkpoint resume behavior.
  • Persist service-mode artifacts (tinker_run.json, run.json, metrics.jsonl, train.log) and write MLflow-compatible train/val log lines.
  • Add regression tests for artifact generation and checkpoint resume, and update Tinker setup docs/CLI defaults.

Test Plan

  • ruff check src/training/train_tinker.py tests/test_train_tinker.py scripts/train_tinker.py docs/SETUP.md
  • source .venv/bin/activate && mypy src/training/train_tinker.py --ignore-missing-imports
  • pytest tests/test_train_tinker.py tests/test_tinker_sdk_client.py tests/test_tinker_data.py -q

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52356c6d7a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +834 to +836
state["latest_checkpoint_path"] = checkpoint_path
state["checkpoints"] = checkpoints
state["updated_at"] = _utc_now_iso()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Persist completed_steps when checkpointing

This checkpoint branch updates latest_checkpoint_path but does not persist completed_steps = current_step. If checkpoint_interval_steps is not aligned with log_interval_steps, a crash after checkpointing will resume model weights from a later step while batch skipping still uses an older step count, causing duplicated optimizer updates and inconsistent resumed training.

Useful? React with 👍 / 👎.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ccd2d8e. Checkpoint persistence now also writes state["completed_steps"] = current_step before saving run state in the checkpoint branch, so resume skip logic stays aligned even when checkpoint and log intervals differ.

Comment on lines +657 to +660
training_client = service_client.create_lora_training_client(
base_model=config.model,
rank=config.lora.rank,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reset progress when falling back to fresh training client

When a prior latest_checkpoint_path exists but the SDK lacks create_training_client_from_state_with_optimizer, this code starts a fresh LoRA client without clearing previously loaded completed_steps. The later skip logic then omits early batches on a brand-new model, silently under-training the run.

Useful? React with 👍 / 👎.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ccd2d8e. When a prior checkpoint exists but SDK restore API is unavailable, the code now falls back to fresh client and explicitly resets latest_checkpoint_path and completed_steps to avoid skipping batches on a new model.

final_loss = step_loss

if current_step % 10 == 0 or current_step == total_steps:
if current_step % config.log_interval_steps == 0 or current_step == total_steps:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate log interval before modulo arithmetic

log_interval_steps is now user-configurable, but this condition performs % config.log_interval_steps unguarded. Setting the interval to 0 in YAML/CLI will raise ZeroDivisionError during training instead of providing a clear config error or disabling logging.

Useful? React with 👍 / 👎.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ccd2d8e. Added a guard at the start of _train_with_service_client_sdk that returns a clear error when log_interval_steps <= 0, preventing modulo-by-zero during training.

@Sudhendra Sudhendra merged commit fd7d020 into main Feb 13, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments