Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing before session timed out #10

Open
qAp opened this issue Jan 15, 2022 · 1 comment
Open

Checkpointing before session timed out #10

qAp opened this issue Jan 15, 2022 · 1 comment

Comments

@qAp
Copy link
Owner

qAp commented Jan 15, 2022

At the moment, it takes well over 9 hours (~ 14 hr) to train a single epoch, so there's not a single checkpoint before the Kaggle session is timed out. This makes resuming training impossible.

@qAp
Copy link
Owner Author

qAp commented Jan 24, 2022

Normally, validation is carried out after each epoch of training, and normally, model checkpoints are made by monitoring some validation metric. This implies that validation is done once per epoch, and model checkpoints are made at most once per epoch.

In pytorch lightning, Trainer.val_check_interval allows one to set how often validation is to be carried out. If it's set to 0.25, then validation is done after each quarter of an epoch of training, so validation is done a total of 4 times per epoch. This allows one to check how well the model is doing more frequently.

And, for the version of pytorch lightning currently in use, ModelCheckpoint.every_n_val_epochs should monitor the specified validation metric and decide whether to make a checkpoint or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant