All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Please add your functional changes to the appropriate section in the PR. Keep it human-readable, your future self will thank you!
- Not update NaN-weight-mask for loss function when using remapper and no imputer #178
- Dont crash when using the profiler if certain env vars arent set #180
- Remove saving of metadata to training checkpoint #57
- Fixes to callback plots [#182] (power spectrum large numpy array error + precip cmap for cases where precip is prognostic).
- GraphTrainableParameters callback will log a warning when no trainable parameters are specified #173
- Fixes to checkpoint saving - ensure last checkpoint if saving when using max_steps [#191] (ecmwf#191)
- Identify stretched grid models based on graph rather than configuration file #204
- Introduce variable to configure: transfer_learning -> bool, True if loading checkpoint in a transfer learning setting.
TRANSFER LEARNING: enabled new functionality. You can now load checkpoints from different models and different training runs.
- Effective batch size:
(config.dataloader.batch_size["training"] * config.hardware.num_gpus_per_node * config.hardware.num_nodes) // config.hardware.num_gpus_per_model
. Used for experiment reproducibility across different computing configurations. - Added a check for the variable sorting on pre-trained/finetuned models #120
- Added default configuration files for stretched grid and limited area model experiments 173
- Added new metrics for stretched grid models to track losses inside/outside the regional domain #199
- Add supporting arrrays (numpy) to checkpoint
- Support for masking out unconnected nodes in LAM #171
- Improved validation metrics, allow 'all' to be scaled #202
- Removed the resolution config entry #120
0.3.1 - AIFS v0.3 Compatibility - 2024-11-28
- Perform full shuffle of training dataset #153
- Update
n_pixel
used by datashader to better adapt across resolutions #152 - Fixed bug in power spectra plotting for the n320 resolution.
- Allow histogram and spectrum plot for one variable #165
- Introduce variable to configure (Cosine Annealing) optimizer warm up #155
- Add reader groups to reduce CPU memory usage and increase dataloader throughput #76
- Bump
anemoi-graphs
version to 0.4.1 #159
0.3.0 - Loss & Callback Refactors - 2024-11-14
-
Rename loss_scaling to variable_loss_scaling #138
-
Refactored callbacks. #60
-
Refactored rollout #87
- Enable longer validation rollout than training
-
Expand iterables in logging #91
- Save entire config in mlflow
-
Included more loss functions and allowed configuration #70
-
Include option to use datashader and optimised asyncronohous callbacks #102
- Fix that applies the metric_ranges in the post-processed variable space #116
-
Allow updates to scalars #137
- Add without subsetting in ScaleTensor
-
Sub-hour datasets #63
-
Add synchronisation workflow #92
-
Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report 38
-
Feat: Save a gif for longer rollouts in validation #65
-
New limited area config file added, limited_area.yaml. #134
-
New stretched grid config added, stretched_grid.yaml #133
-
Functionality to change the weight attribute of nodes in the graph at the start of training without re-generating the graph. [#136] (ecmwf#136)
-
Custom System monitor for Nvidia and AMD GPUs #147
- Renamed frequency keys in callbacks configuration. #118
- Modified training configuration to support max_steps and tied lr iterations to max_steps by default #67
- Merged node & edge trainable feature callbacks into one. #135
- Increase the default MlFlow HTTP max retries #111
0.2.2 - Maintenance: pin python <3.13 - 2024-10-28
- Lock python version <3.13 #107
0.2.1 - Bugfix: resuming mlflow runs - 2024-10-24
- Mlflow-sync to include new tag for server to server syncing #83
- Mlflow-sync to include functionality to resume and fork server2server runs #83
- Rollout training for Limited Area Models. #79
- Feature: New
Boolean1DMask
class. Enables rollout training for limited area models. #79
- Fix pre-commit regex
- Mlflow-sync to handle creation of new experiments in the remote server [#83] (ecmwf#83)
- Fix for multi-gpu when using mlflow due to refactoring of _get_mlflow_run_params function [#99] (ecmwf#99)
- ci: fix pyshtools install error (#100) ecmwf#100
- Mlflow-sync to handle creation of new experiments in the remote server #83
- Fix for multi-gpu when using mlflow due to refactoring of _get_mlflow_run_params function #99
- ci: fix pyshtools install error #100
- Fix
__version__
import in init
- Update copyright notice
0.2.0 - Feature release - 2024-10-16
- Make pin_memory of the Dataloader configurable (#64)
- Add anemoi-transform link to documentation
- Codeowners file (#56)
- Changelog merge strategy (#56)
- Contributors file (#106)
-
Introduction of remapper to anemoi-models leads to changes in the data indices. Some preprocessors cannot be applied in-place anymore.
-
Variable Bounding as configurable model layers #13
- Enable the callback for plotting a histogram for variables containing NaNs
- Enforce same binning for histograms comparing true data to predicted data
- Fix: Inference checkpoints are now saved according the frequency settings defined in the config #37
- Feature: Add configurable models #50
- Feature: Authentication support for mlflow sync - #51
- Feature: Support training for datasets with missing time steps #48
- Feature:
AnemoiMlflowClient
, an mlflow client with authentication support #86 - Long Rollout Plots
- Mask NaN values in training loss function #72 and #271
- Fix
TypeError
raised when trying to JSON serialisedatetime.timedelta
object - #43 - Bugfixes for CI (#56)
- Fix
mlflow
subcommand on python 3.9 #62 - Show correct subcommand in MLFlow - Addresses #39 in #61
- Fix interactive multi-GPU training #82
- Allow 500 characters in mlflow logging #88
- Updated configuration examples in documentation and corrected links - #46
- Remove credential prompt from mlflow login, replace with seed refresh token via web - #78
- Update CODEOWNERS
- Change how mlflow measures CPU Memory usage - 94
0.1.0 - Anemoi training - First release - 2024-08-16
- Subcommand for training
anemoi-training train
- Subcommand for config generation of configs
- Subcommand for mlflow: login and sync
- Subcommand for checkpoint handling
- Searchpaths for Hydra configs, to enable configs in CWD,
ANEMOI_CONFIG_PATH
env, and.config/anemoi/training
in addition to package defaults - MlFlow token authentication
- Configurable pressure level scaling
- Downstream CI to test all dependencies with changes
- Changelog Status check
- Readthedocs PR builder
- Changelog Release Updater Workflow
- Extended ruff Ruleset
- Added Docsig pre-commit hook
__future__
annotations for typehints- Added Typehints where missing
- Added Changelog
- Correct errors in callback plots
- fix error in the default config
- example slurm config
- ability to configure precip-type plots
- Fixed PyPI packaging
- Use of Anemoi models
- Use of Anemoi graphs
- Adjusted tests to work with new Anemoi ecosystem
- Adjusted configs to reasonable common defaults
- Changed hardware-specific keys from configs to
???
to trigger "missing" __len__
of NativeGridDataset- Configurable dropout in attention layer
- First draft on Read the Docs
- Fixed docstrings
- Moved callbacks into folder to fascilitate future refactor
- Adjusted PyPI release infrastructure to common ECMWF workflow
- Bumped versions in Pre-commit hooks
- Fix crash when logging hyperparameters with missing values in the config
- Fixed "null" tracker metadata when tracking is disabled, now returns an empty dict
- Pinned numpy<2 until we can test all migration
- (ci): path ignore of docs for downstream ci
- (ci): remove yaml anchor, unsupported by Github
- ci: make python QA reusable
- ci: permissions on changelog updater
- Dependency on mlflow-export-import
- Specific user configs
- len function of NativeGridDataset as it lead to bugs