Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix validation and checkpointing interval #224

Merged
merged 4 commits into from
Aug 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- Checkpoints now include model parameters, allowing for mismatches with the provided configuration file.
- `accelerator` parameter now controls the accelerator (CPU, GPU, etc) that is used.
- `devices` parameter controls the number of accelerators used.
- `every_n_train_steps` parameter now controls the frequency of both validation epochs and model checkpointing during training.
- `val_check_interval` parameter now controls the frequency of both validation epochs and model checkpointing during training.

### Changed

Expand All @@ -22,6 +22,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- Validation performance metrics are logged (and added to tensorboard) at the validation epoch, and training loss is logged at the end of training epoch, i.e. training and validation metrics are logged asynchronously.
- Irrelevant warning messages on the console output and in the log file are no longer shown.
- Nicely format logged warnings.
- `every_n_train_steps` has been renamed to `val_check_interval` in accordance to the corresponding Pytorch Lightning parameter.

### Removed

Expand Down
2 changes: 1 addition & 1 deletion casanovo/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ class Config:
train_from_scratch=bool,
save_top_k=int,
model_save_folder_path=str,
every_n_train_steps=int,
val_check_interval=int,
accelerator=str,
devices=int,
calculate_precision=bool,
Expand Down
2 changes: 1 addition & 1 deletion casanovo/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ save_top_k: 5
# Path to saved checkpoints
model_save_folder_path: ""
# Model validation and checkpointing frequency in training steps
every_n_train_steps: 50_000
val_check_interval: 50_000
# Calculate peptide and amino acid precision during training. this
# is expensive, so we recommend against it.
calculate_precision: False
Expand Down
3 changes: 2 additions & 1 deletion casanovo/denovo/model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,8 @@ def initialize_trainer(self, train: bool) -> None:
max_epochs=self.config.max_epochs,
num_sanity_val_steps=self.config.num_sanity_val_steps,
strategy=self._get_strategy(),
val_check_interval=self.config.every_n_train_steps,
val_check_interval=self.config.val_check_interval,
check_val_every_n_epoch=None,
)
trainer_cfg.update(additional_cfg)

Expand Down
4 changes: 2 additions & 2 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,8 @@ Using the filename (column "filename") you can then retrieve the corresponding p

By default, Casanovo saves a snapshot of the model weights after every 50,000 training steps.
Note that the number of samples that are processed during a single training step depends on the batch size.
Therefore, when using the default training batch size of 32, this correspond to saving a model snapshot after every 1.6 million training samples.
You can optionally modify the snapshot frequency in the [config file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) (parameter `every_n_train_steps`), depending on your dataset size.
Therefore, when using the default training batch size of 32, this corresponds to saving a model snapshot after every 1.6 million training samples.
You can optionally modify the snapshot (and validation) frequency in the [config file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) (parameter `val_check_interval`), depending on your dataset size.
Note that taking very frequent model snapshots will result in somewhat slower training time because Casanovo will evaluate its performance on the validation data for every snapshot.

When saving a model snapshot, Casanovo will use the validation data to compute performance measures (training loss, validation loss, amino acid precision, and peptide precision) and print this information to the console and log file.
Expand Down
2 changes: 1 addition & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ def tiny_config(tmp_path):
"warmup_iters": 1,
"max_iters": 1,
"max_epochs": 20,
"every_n_train_steps": 1,
"val_check_interval": 1,
"model_save_folder_path": str(tmp_path),
"accelerator": "cpu",
}
Expand Down
Loading