Crashes when attempting to run with custom high-resolution ERA5 Land meteorological forcing

Hi @mvdebolskiy. @kjetilaas and @rosiealice. Here's an update on the high-resolution regional NorSink grid and crashes with the ERA5 Land forcing data, as discussed in the CTSM Friday meeting on January 23. I posted it in the Teams meeting hat, but not sure that everybody gets notified or has access, so moving it here. Let me know if you prefer some other channel.
 
I have tested several more things since the meeting. There are still some avenues to try, but great if you could read the questions at the end @mvdebolskiy and reply to those if you know. Sorry about the long post, there are lots of possibly relevant details.

Path to the case and repo directory are in the first comment.
 
To recap: I have produced a mesh for a minimal rectangular 0.1x0.1 degree grid that covers Norway and all river outflows, and a surface data set for the same mesh. I have also processed ERA5 Land meteorological data into a 3-stream forcing data set on the same grid and 1-hourly resolution. I converted them to the same format as CRUNCEP, CRUJRA, etc. Finally, I added the new mesh, surface data files and streaming data fles to the XML database files so that I can use them with create_newcase and the CIME scripts.
 
The modifications are in a fork under my own user, in the branch `norsink_inputdata_main`, here: [https://github.com/korsbakken/CTSM/tree/norsink_inputdata_main](https://github.com/korsbakken/CTSM/tree/norsink_inputdata_main). I have put a clone set to the right commit in my work folder on betzy (see first comment).

The modifications are in XML files in CTSM, ccs_config and cdeps, and in some files in the mksurfdat pipeline. I don't think I have changed anything in code that executes during model runs. 
 
I have created, built and successfully run cases with the new mesh and surface data and standard CRUJRA forcing data (CRUJRA2024). But when I swap in the ERA5 Land-based forcing data I created, the model crashes after about 40 seconds, with no discernable error messages in the logs, other than cesm.log stating that MPI_ABORT was invoked in MPI_COMM_WORLD with error code -1, no indication of which component or function caused the abort. In the stack traces, the only discernable function name (at the third stack level) is `PIOc_read_darray+0x41b`. The rest are just hex codes. 
 
The other log files don't indicate much activity beyond the initialization before they just stop, with no error messages. The only exception is the atm.log file which does a lot of initialization activities, and then states that it opens the first forcing file for the first time step, and attempts to read FSDS. It doesn't appear to finish. You can see the details in the `run` subdirectory in the case directory.

I have experimented with adding some debug writes to cdeps/streams/dshr_strdata_readstrm.F90 to pinpoint exactly where the crash happens, but that's very slow going. I haven't found a smoking gun yet, and I can't be totally sure that the actual crash doesn't happen in a completely different component that is executing in parallel.
 
Of the things I haven't corrected or tested yet, there are two main candidates for the cause, and two less likely ones. @mvdebolskiy, if you know, could you weigh in on whether any of these would produce silent crashes of the type I describe above?:


1. The ERA5 Land data contains missing values at some points that are not masked out, and for some points it also depends on time whether they have missing values or not. I plan to test filling these with nearest-neighbor values to see what happens.

2. The processed ERA5 Land files end up using NaN for fill values and missing values, whereas the standard data sets appear to use big float values like 1e36 instead. Which means that the missing values above also end up as NaNs. I don't know whether the model requires a non-NaN float as a fill value, but I can try to make the processing code use 1e36 instead.
 
3. The processed files record time as hours from the start of the month, while the standard datasets use days. The metadata in the files indicate this in a way that should be compliant with CF standards, and they get parsed correctly by the xarray and netCDF4 libraries in Python at least. But does the model require using fractional days instead?
 
4. The processed ERA5 Land files use `proleptic_gregorian` calendar, not `noleap`. As far as I understand, DATM will just ignore February 29 in this case, and there's no leap year in the test runs I have done anyway. But I can test converting the data to noleap if having different calendar IDs in the input files is in fact an issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashes when attempting to run with custom high-resolution ERA5 Land meteorological forcing #190

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Crashes when attempting to run with custom high-resolution ERA5 Land meteorological forcing #190

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions