Skip to content

Commit 11defe5

Browse files
authored
Merge pull request #305 from roboflow/docs/result-checkpoints
document RF-DETR result checkpoints
2 parents 0b047e4 + ba77819 commit 11defe5

File tree

1 file changed

+24
-4
lines changed

1 file changed

+24
-4
lines changed

docs/learn/train.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,30 @@ Different GPUs have different VRAM capacities, so adjust batch_size and grad_acc
164164

165165
</details>
166166

167+
### Result checkpoints
168+
169+
During training, multiple model checkpoints are saved to the output directory:
170+
171+
- `checkpoint.pth` – the most recent checkpoint, saved at the end of the latest epoch.
172+
173+
- `checkpoint_<number>.pth` – periodic checkpoints saved every N epochs (default is every 10).
174+
175+
- `checkpoint_best_ema.pth` – best checkpoint based on validation score, using the EMA (Exponential Moving Average) weights. EMA weights are a smoothed version of the model’s parameters across training steps, often yielding better generalization.
176+
177+
- `checkpoint_best_regular.pth` – best checkpoint based on validation score, using the raw (non-EMA) model weights.
178+
179+
- `checkpoint_best_total.pth` – final checkpoint selected for inference and benchmarking. It contains only the model weights (no optimizer state or scheduler) and is chosen as the better of the EMA and non-EMA models based on validation performance.
180+
181+
??? note "Checkpoint file sizes"
182+
183+
Checkpoint sizes vary based on what they contain:
184+
185+
- **Training checkpoints** (e.g. `checkpoint.pth`, `checkpoint_<number>.pth`) include model weights, optimizer state, scheduler state, and training metadata. Use these to resume training.
186+
187+
- **Evaluation checkpoints** (e.g. `checkpoint_best_ema.pth`, `checkpoint_best_regular.pth`) store only the model weights — either EMA or raw — and are used to track the best-performing models. These may come from different epochs depending on which version achieved the highest validation score.
188+
189+
- **Stripped checkpoint** (e.g. `checkpoint_best_total.pth`) contains only the final model weights and is optimized for inference and deployment.
190+
167191
### Resume training
168192

169193
You can resume training from a previously saved checkpoint by passing the path to the `checkpoint.pth` file using the `resume` argument. This is useful when training is interrupted or you want to continue fine-tuning an already partially trained model. The training loop will automatically load the weights and optimizer state from the provided checkpoint file.
@@ -214,10 +238,6 @@ python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py
214238

215239
Replace `8` in the `--nproc_per_node argument` with the number of GPUs you want to use. This approach creates one training process per GPU and splits the workload automatically. Note that your effective batch size is multiplied by the number of GPUs, so you may need to adjust your `batch_size` and `grad_accum_steps` to maintain the same overall batch size.
216240

217-
### Result checkpoints
218-
219-
During training, two model checkpoints (the regular weights and an EMA-based set of weights) will be saved in the specified output directory. The EMA (Exponential Moving Average) file is a smoothed version of the model’s weights over time, often yielding better stability and generalization.
220-
221241
### Logging with TensorBoard
222242

223243
[TensorBoard](https://www.tensorflow.org/tensorboard) is a powerful toolkit that helps you visualize and track training metrics. With TensorBoard set up, you can train your model and keep an eye on the logs to monitor performance, compare experiments, and optimize model training. To enable logging, simply pass `tensorboard=True` when training the model.

0 commit comments

Comments
 (0)