feat: add resumable validation/checkpoint flow for Tinker ServiceClient#22
feat: add resumable validation/checkpoint flow for Tinker ServiceClient#22
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 52356c6d7a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| state["latest_checkpoint_path"] = checkpoint_path | ||
| state["checkpoints"] = checkpoints | ||
| state["updated_at"] = _utc_now_iso() |
There was a problem hiding this comment.
Persist completed_steps when checkpointing
This checkpoint branch updates latest_checkpoint_path but does not persist completed_steps = current_step. If checkpoint_interval_steps is not aligned with log_interval_steps, a crash after checkpointing will resume model weights from a later step while batch skipping still uses an older step count, causing duplicated optimizer updates and inconsistent resumed training.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in ccd2d8e. Checkpoint persistence now also writes state["completed_steps"] = current_step before saving run state in the checkpoint branch, so resume skip logic stays aligned even when checkpoint and log intervals differ.
| training_client = service_client.create_lora_training_client( | ||
| base_model=config.model, | ||
| rank=config.lora.rank, | ||
| ) |
There was a problem hiding this comment.
Reset progress when falling back to fresh training client
When a prior latest_checkpoint_path exists but the SDK lacks create_training_client_from_state_with_optimizer, this code starts a fresh LoRA client without clearing previously loaded completed_steps. The later skip logic then omits early batches on a brand-new model, silently under-training the run.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in ccd2d8e. When a prior checkpoint exists but SDK restore API is unavailable, the code now falls back to fresh client and explicitly resets latest_checkpoint_path and completed_steps to avoid skipping batches on a new model.
| final_loss = step_loss | ||
|
|
||
| if current_step % 10 == 0 or current_step == total_steps: | ||
| if current_step % config.log_interval_steps == 0 or current_step == total_steps: |
There was a problem hiding this comment.
Validate log interval before modulo arithmetic
log_interval_steps is now user-configurable, but this condition performs % config.log_interval_steps unguarded. Setting the interval to 0 in YAML/CLI will raise ZeroDivisionError during training instead of providing a clear config error or disabling logging.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in ccd2d8e. Added a guard at the start of _train_with_service_client_sdk that returns a clear error when log_interval_steps <= 0, preventing modulo-by-zero during training.
Summary
tinker_run.json,run.json,metrics.jsonl,train.log) and write MLflow-compatible train/val log lines.Test Plan