Replication Package: Designing probabilistic AI monsoon forecasts to inform agricultural decision-making
Replication code for probabilistic Indian monsoon onset forecasts that blend rainfall observations (IMD) with AI weather prediction model forecasts (NeuralGCM, AIFS) through multinomial blending models. This package reproduces all results from data preparation through cross-validated evaluation to publication figures. See ``Running the Pipeline'' below for more information.
The data used in the original paper can be found at https://doi.org/10.5281/zenodo.18894299. Before running this pipeline, all three folders found in this dataset need to be added to Monsoon_Data/raw_nc.
Most of this pipeline is designed to be reusable beyond the specific datasets used in the paper. All pipeline behavior is controlled by YAML spec files rather than hardcoded values, so you can point the pipeline at your own data sources without modifying any R code. The code that produces figures is currently designed to produce the figures appearing in the paper, so will need further modification. The summary_global files in the results folder contain all of our primary metrics.
- Different ground truth rainfall: Create a new spec in
specs/raw_data/modeled afterimd.yml. Pointinput.nc_folderat your NetCDF files and configure the variable name, dimension mappings, and onset thresholds for your grid. - Different AI forecast models: Create a new spec in
specs/raw_data/modeled afterngcm.yml(for ensembles) oraifs.yml. The pipeline handles ensemble forecasts in NetCDF format with configurable dimension names and variable mappings. - Different grid resolutions: The pipeline supports regridding via weighted linear combinations. Set
cell_transform_enabled: truein your raw data spec and provide a weights file mapping source grid cells to target grid cells. - Different onset definitions or time periods: Edit the
climatologiessection of your ground truth spec to change training/test year ranges, onset column names, and season boundaries. - Different blending model formulas: Edit the
models.formulassection of thespecs/2025_blend/cv_models*.ymlspecs to change which predictors enter the multinomial logistic regression. - Different forecast systems in the MME: Edit the
mme.blend_modelslist inspecs/2025_blend/cv_models*.ymlto add, remove, or reconfigure which calibrated forecast systems are blended.
The fully commented spec files in specs/ serve as the primary reference for all configurable options. See Spec-Driven Design below for details on the architecture.
If you change any of the inputs, you will need to add new "combine" and "2025_blend" spec files to ensure they are included in the combined climatology+ground truth + weather model dataset, and that their outputs are included in the blending code.
MO_Forecast_Code/
├── R/
│ ├── _shared/ Core utilities (spec parsing, scoring, misc)
│ │ ├── misc.R Null-coalescing, softmax, logit helpers
│ │ └── read_spec.R Spec loading and validation
│ ├── prepare_data/ Data preparation helpers
│ │ ├── nc_utils.R NetCDF reading, regridding, wide-table construction
│ │ ├── onset_utils.R Onset detection, MOK dates, threshold loading
│ │ ├── climatology_utils.R KDE fitting, climatological forecasts
│ │ └── combine_forecasts_utils.R Merging climatology + forecast families
│ └── 2025_blending_process/ Blending pipeline helpers
│ ├── connect_utils.R Day-to-week aggregation, logit transforms
│ ├── blend_evaluation_utils.R CV evaluation, nnet multinomial, Platt calibration
│ ├── evaluation_2025_utils.R Out-of-sample scoring utilities
│ └── blend_figure_utils.R Figure generation utilities
├── pipelines/ Each pipelines folder has its own README.md file for more information
│ ├── prepare_data/ Stages 1-3: NetCDF -> onset tables -> climatology -> combined.
│ └── 2025_blending_process/ Stages 0-3: weekly-bin blending + evaluation + figures
├── specs/
│ ├── raw_data/ NetCDF input specs (aifs, ngcm, imd variants)
│ ├── combine/ Data combination templates (for combining different weather model + climatology combinations)
│ └── 2025_blend/ Blended model specs (formulas, MME config, connect specs)
├── submit/ Shell wrappers and PBS job scripts
├── Monsoon_Data/ Data directory (not tracked in git)
│ ├── raw_nc/ Raw NetCDF inputs (IMD, NGCM, AIFS)
│ ├── reference/ Onset thresholds, MOK dates
│ ├── Processed_Data/
│ │ ├── Models/ Per-system onset tables
│ │ ├── Climatology/ KDE climatology forecasts
│ │ ├── Combined/ Merged modeling-ready wide tables
│ │ └── 2025_pipeline_input/ Weekly-bin data for blending
│ ├── results/
│ │ └── 2025_model_evaluation/ Model metrics, blend weights, figures
│ ├── evaluation_2025/ Out-of-sample forecast + ground truth files
│ └── maps/ India boundary data for maps
└── figures/ Figures (PDF/SVG). Note that PDF figures will only appear if Cairo is installed (for Arial fonts)
The pipeline expects the following data to be in place before running:
| Path | Description |
|---|---|
Monsoon_Data/raw_nc/IMD_2by2/ |
IMD gridded rainfall NetCDF files (data_YYYY.nc), one per year |
Monsoon_Data/raw_nc/ngcm/ |
NeuralGCM ensemble forecast NetCDF files, one per year |
Monsoon_Data/raw_nc/aifs/ |
AIFS ensemble forecast NetCDF files, one per year |
Monsoon_Data/reference/thresholds_df.csv |
Per-grid-cell onset rainfall thresholds (lat, lon, onset_threshold) |
Monsoon_Data/reference/MOK Onset.csv |
Monsoon Onset Kerala dates by year (Year, MOK) |
Monsoon_Data/maps/ |
India boundary shapefiles (for map figures) |
Monsoon_Data/evaluation_2025/ |
Out-of-sample forecast and ground truth CSVs (for stage 2) |
install.packages(c(
# Core
"optparse", "yaml", "data.table", "dplyr", "tidyr", "purrr", "stringr",
"lubridate", "tibble", "tidyverse", "conflicted",
# Parallel
"furrr", "future", "future.apply", "parallel",
# Data I/O
"ncdf4", "readr",
# Modeling
"nnet",
# Scoring
"pROC", "ModelMetrics", "matrixStats",
# Visualization
"ggplot2", "sf", "forcats", "colorspace", "patchwork", "maps",
# Utilities
"tictoc", "here"
))
# Optional (used only if available):
# install.packages(c("svglite", "R.matlab"))All scripts must be run from the repository root (MO_Forecast_Code/). Scripts use relative source("R/...") paths that break from other working directories. On a PBS server, the entire pipeline on the original dataset can be run using qsub submit/run_full_2025.pbs from the MO_Forecast_Code directory. The two .sh files in that folder will run Stage 1 and Stage 2 separately. The spec_ids called will need to be modified if you use new data sources.
Processes raw NetCDF files into onset tables, fits KDE climatologies, and combines everything into modeling-ready wide tables.
# Process raw NetCDF files into per-system onset tables
Rscript pipelines/prepare_data/1_process_raw_nc_files.R --spec_id imd
Rscript pipelines/prepare_data/1_process_raw_nc_files.R --spec_id ngcm
Rscript pipelines/prepare_data/1_process_raw_nc_files.R --spec_id aifs
# Build KDE climatology forecasts from IMD onset dates
Rscript pipelines/prepare_data/2_build_climatology.R --spec_id imd
# Combine ground truth, model forecasts, and climatology into wide tables
Rscript pipelines/prepare_data/3_combine_datasets.R --spec_id combine_template_2025Outputs: Monsoon_Data/Processed_Data/Combined/combine_template_2025_combined_wide.rds
Converts daily probabilities to weekly bins, runs cross-validated multinomial model evaluation and multi-model ensemble (MME) weight optimization, scores against out-of-sample 2025 data, and produces publication figures.
# Convert daily data to weekly bins (one call per onset filter variant)
Rscript pipelines/2025_blending_process/0_connect_prepare_data_to_2025_pipeline.R --spec_id connect_mok
Rscript pipelines/2025_blending_process/0_connect_prepare_data_to_2025_pipeline.R --spec_id connect_clim_mok_date
Rscript pipelines/2025_blending_process/0_connect_prepare_data_to_2025_pipeline.R --spec_id connect_no_mok_filter
# Cross-validated model evaluation + MME weight optimization
Rscript pipelines/2025_blending_process/1_blend_evaluation.R
#If you have modified the pipeline to use new models, stop here and look in the Monsoon_Data/results folder! The next two scripts
# will not function correctly
# Out-of-sample 2025 evaluation
Rscript pipelines/2025_blending_process/2_2025_evaluation.R
# Publication figures
Rscript pipelines/2025_blending_process/3_produce_figures.ROutputs: Monsoon_Data/results/2025_model_evaluation/ (metrics CSVs, blend weights, reliability plots, publication figures)
bash submit/run_prepare_data.sh
bash submit/run_2025_pipeline.shqsub submit/run_full_2025.pbsRaw NetCDF (Monsoon_Data/raw_nc/)
│
▼
┌──────────────────────────────────┐
│ 1_process_raw_nc_files.R │ specs/raw_data/*.yml
│ → Processed_Data/Models/*.rds │
└──────────┬───────────────────────┘
│
├──▶ 2_build_climatology.R
│ → Processed_Data/Climatology/*.rds
│ │
▼ ▼
┌──────────────────────────────────┐
│ 3_combine_datasets.R │ specs/combine/*.yml
│ → Processed_Data/Combined/*.rds │
└──────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ 0_connect (day → week bins) │ specs/2025_blend/connect_*.yml
│ → Processed_Data/2025_pipeline_input/ │
└──────────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ 1_blend_evaluation.R │ specs/2025_blend/cv_models*.yml
│ Cross-validated nnet multinomial │ specs/2025_blend/hindcast_*.yml
│ Platt calibration, MME optimization │
│ → results/2025_model_evaluation/ │
└──────────┬───────────────────────────────┘
│
├──▶ 2_2025_evaluation.R
│ Out-of-sample scoring (Brier, RPS, AUC)
│ │
▼ ▼
┌──────────────────────────────────────────┐
│ 3_produce_figures.R │
│ → figures/ (PDF/SVG) │
└──────────────────────────────────────────┘
Three onset filter variants control which data points are included in training and evaluation, depending on how the monsoon onset date is defined relative to the Monsoon Onset Kerala (MOK):
| Variant | Spec suffix | Description |
|---|---|---|
| mok | (default) | Only issue dates after the observed MOK date each year |
| clim_mok_date | _clim_mok_date |
Only issue dates after a fixed climatological MOK date (June 1) |
| no_mok_filter | _no_mok_filter |
No MOK-based filtering; all issue dates from May 1 onward |
Each variant has its own set of connect, CV, and hindcast specs in specs/2025_blend/.
All pipeline behavior is controlled by YAML specs in specs/. Specs define input paths, variable selection, modeling options, and output configuration. Output basenames are derived from the spec filename (spec_id), not from a field in the YAML. Pipeline scripts are thin orchestration layers: parse args -> load spec -> call helpers -> write artifacts.
| Directory | Purpose | Used by |
|---|---|---|
specs/raw_data/ |
NetCDF input configuration (paths, variables, thresholds, MOK) | 1_process_raw_nc_files.R, 2_build_climatology.R |
specs/combine/ |
Which processed datasets to merge into wide tables | 3_combine_datasets.R |
specs/2025_blend/connect_*.yml |
Day-to-week conversion settings (forecast models, rain predictors) | 0_connect_prepare_data_to_2025_pipeline.R |
specs/2025_blend/cv_models*.yml |
Model formulas, MME configuration, forecast calibration | 1_blend_evaluation.R |
specs/2025_blend/hindcast_*.yml |
Out-of-sample hindcast evaluation configuration | 1_blend_evaluation.R |
models.formulas: Named multinomial logistic regression formulas usingnnet::multinommme: Multi-model ensemble configuration — which calibrated models to blend, weight optimization settingsmme.blend_models: List of models entering the MME, each with aname,source(clim or forecast), and optionalcal_variantextras.forecasts: Per-system Platt calibration and raw/calibrated scoring optionsextras.clim_logits: Climatology baseline configurations with optional training-window variants
- Spatial key:
id = paste0(lat, "_", lon) - Time columns:
time(POSIXct/Date),year - Outcome categories:
week1throughweek4pluslater(5 weekly bins) - Forecast probabilities stored in wide format with system-specific prefixes
- All scripts must be run from the repository root (
MO_Forecast_Code/)