-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Description
Current behavior (describe the bug)
We are starting to see issues with random ctests stalling on various machines. This seems to be more common on Jet (see #419), but we are starting to see similar issues on Hera (#478, #479).
For example, the rrfs_fv3jedi_2024052700_getkf_solver can stall while reading the hofx files on a random member:
Test : H(x) for member 26:
Test : aircar_airTemperature_133 nobs= 32382 Min=211.1625378057457, Max=307.7765640042287, RMS=259.7695377376046
Test : aircar_specificHumidity_133 nobs= 3291 Min=-3.15865270137522e-11, Max=0.02095278773004437, RMS=0.0066298079263217
Test : aircar_winds_233 nobs= 68710 Min=-44.06984241979016, Max=54.03912256852514, RMS=14.57889524621921
LinearModel<MODEL>::forecastTL: Starting
----------------------------------------------------------------------------------------------------
Increment print | number of fields = 8 | cube sphere face size: C420
eastward_wind | Min:-1.292832e+01 Max:+1.165059e+01 RMS:+8.788393e-01
northward_wind | Min:-2.483585e+01 Max:+1.555412e+01 RMS:+1.020776e+00
air_temperature | Min:-5.620034e+00 Max:+5.375475e+00 RMS:+5.463021e-01
air_pressure_thickness | Min:-5.494182e+00 Max:+7.098684e+00 RMS:+5.691151e-01
water_vapor_mixing_ratio_wrt_moist_air | Min:-5.993849e-03 Max:+7.774201e-03 RMS:+4.394907e-04
cloud_liquid_ice | Min:-3.154406e-07 Max:+2.800794e-06 RMS:+1.116570e-08
cloud_liquid_water | Min:-4.723777e-04 Max:+1.116509e-03 RMS:+1.138239e-05
ozone_mass_mixing_ratio | Min:-2.046351e-09 Max:+4.929310e-09 RMS:+1.802833e-11
----------------------------------------------------------------------------------------------------
LinearModel<MODEL>::forecastTL: Finished
----------------------------------------------------------------------------------------------------
slurmstepd: error: *** STEP 18492759.0 ON h14c49 CANCELLED AT 2025-11-06T19:38:39 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Steps to Reproduce (if applicable)
Run ctests
Expected behavior
If a ctest passes once, it should always pass.
Additional information
This isn't the highest priority right now since most of our fv3-jedi development has moved to WCOSS2 where tests seem to not have this problem.
Metadata
Metadata
Assignees
Labels
No labels