This example demonstrates how to train a neural global weather forecast model. The recipe is set up so that modifying the model architecture, data, or the training procedure is straightforward.
The conf
directory contains the configuration files for the model, data,
training, etc. The configs are given in YAML format and use the omegaconf
library to manage them. Several example configs are given for generating
different datasets, models, and training procedures. For example, AFNO and
GraphCast are given with corresponding training procedure and datasets configs.
The default configs are set to only download and train a tiny dataset and can be
run on an 8GB GPU. To train larger models please adjust conf/config.yaml
according to the comments.
In this example we provide scripts to obtain the ERA5 dataset from ARCO ERA5 and perform needed curation and remapping steps. ARCO ERA5 contains a complete lat lon gridded dataset of the ERA5 reanalysis including single and pressure level data. Often when training a model on ERA5, a temporal and channel subset is used. For example, FourCast Net is trained on a 20-channel subset of ERA5 at 6 hour temporal resolution (AFNO). There can also be the need for remapping from lat lon grids as is the case with the DLWP model. Given these requirements we provide the following workflow for generating needed datasets that works for most applications.
We recommend first downloading a temporally and single leveled subset of ERA5
from ARCO ERA5. This can be done using the download_era5.py
script and configs
for this can be found in ./conf/dataset/
. This script will require ~40TB of
storage for non-tiny configs but can be adjusted to download a smaller subset of
the data. Given a 2.5 Gb/s connection the download will take ~1.5 days. The
default configs will only download ~100 GBs.
python download_era5.py
Once the ERA5 dataset is downloaded you can generate a curated dataset for
training. This can be done using the curate_era5.py
script and configs for
this can be found in ./conf/curated_dataset/
. This script will generate the
zarr dataset needed for training including needed transformations such as
regridding.
python curate_era5.py
In theory one should be able perform curation directly from ARCO ERA5. This will work however there is a significant penalty in doing so due to the pressure levels being chunked together in ARCO ERA5. This means that if you want to extract a single pressure level you will need to download all 37 levels. If you are planning to test multiple transforms or channel subsamplings then this will become prohibitively expensive. Because of this we recommend following our described workflow. We have also raised an issue on ARCO ERA5 to fix this chunking issue and if resolved we will update instructions.
Install the required dependencies by running below:
pip install -r requirements.txt
Apart from the dataset configs the main configs for training are model
,
training
, and validation
. These can be adjusted accordingly and to train the
model, run
python train.py
Progress can be monitored using MLFlow. Open a new terminal and navigate to the training directory, then run:
mlflow ui -p 2458
View progress in a browser at http://127.0.0.1:2458
Data parallelism is also supported with multi-GPU runs. To launch a multi-GPU training, run
mpirun -np <num_GPUs> python train.py
If running inside a docker container, you may need to include the
--allow-run-as-root
in the multi-GPU run command.
One of the showcased models available in the configs is Spherical Fourier Neural Operators: Learning Stable Dynamics on the Sphere. In order to train the SFNO model, PhysicsNeMo Makani needs to be installed. This allows the model to be added to physicsnemo's model registry. For more information on this process, please refer to PhysicsNeMo model registry.
git clone [email protected]:NVIDIA/makani.git
cd makani
pip install -e .
The config file can be modified to train the SFNO model by uncommenting all SFNO configs. Following the prior dataset fetching and curation steps, the model can be trained by running:
python train.py