-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Hi MONAI team,
I'm training a segmentation model from scratch using Auto3Dseg, and while training remains efficient (less than 1 minute per image), I’m encountering exponential increases in validation time per image during later epochs. The issue seems to worsen with each epoch. For example, during training epoch 294, the validation images processed in ~80–250 seconds. But by epoch 298, some validation images took over 3300s (55 minutes) per image!
Sample Output:
Epoch 294 vs 298 Validation Timing:
Final training 294/299 loss: 0.7182 acc_avg: 0.7918 acc [ 0.667 0.917] time 82.54s lr: 2.0133e-07
Val 294/300 0/12 ... time 80.45s
Val 294/300 3/12 ... time 233.68s
Val 294/300 5/12 ... time 253.24s
Final training 298/299 loss: 0.6506 acc_avg: 0.7909 acc [ 0.631 0.951] time 84.18s lr: 2.2377e-08
Val 298/300 5/12 ... time 3393.77s
Val 298/300 6/12 ... time 1763.08s
Val 298/300 8/12 ... time 2419.77s
Troubleshooting done:
- Confirmed it's not data I/O bottlenecks (validation data remains static)
- Tried emptying CUDA cache before/after validation steps
- Checked Segmenter class Validate function and postprocessing for memory leaks or side effects
- No large RAM/VRAM spikes observed
- Dataloader settings are unchanged across training
I am training on Windows 11, using GPU NVIDIA A40-48 GB, Pytorch=2.5.1, Cuda=12.4. Inference is taking ~ 0.37 seconds per image after training.
Could you please help with this processing time issue? Thank you for your time and for the great framework!