Skip to content

[bug] Ctests sometimes stall on Hera and Jet #480

@SamuelDegelia-NOAA

Description

@SamuelDegelia-NOAA

Current behavior (describe the bug)

We are starting to see issues with random ctests stalling on various machines. This seems to be more common on Jet (see #419), but we are starting to see similar issues on Hera (#478, #479).

For example, the rrfs_fv3jedi_2024052700_getkf_solver can stall while reading the hofx files on a random member:

Test     : H(x) for member 26:
Test     : aircar_airTemperature_133 nobs= 32382 Min=211.1625378057457, Max=307.7765640042287, RMS=259.7695377376046
Test     : aircar_specificHumidity_133 nobs= 3291 Min=-3.15865270137522e-11, Max=0.02095278773004437, RMS=0.0066298079263217
Test     : aircar_winds_233 nobs= 68710 Min=-44.06984241979016, Max=54.03912256852514, RMS=14.57889524621921
LinearModel<MODEL>::forecastTL: Starting
----------------------------------------------------------------------------------------------------
Increment print | number of fields = 8 | cube sphere face size: C420
eastward_wind                                | Min:-1.292832e+01 Max:+1.165059e+01 RMS:+8.788393e-01
northward_wind                               | Min:-2.483585e+01 Max:+1.555412e+01 RMS:+1.020776e+00
air_temperature                              | Min:-5.620034e+00 Max:+5.375475e+00 RMS:+5.463021e-01
air_pressure_thickness                       | Min:-5.494182e+00 Max:+7.098684e+00 RMS:+5.691151e-01
water_vapor_mixing_ratio_wrt_moist_air       | Min:-5.993849e-03 Max:+7.774201e-03 RMS:+4.394907e-04
cloud_liquid_ice                             | Min:-3.154406e-07 Max:+2.800794e-06 RMS:+1.116570e-08
cloud_liquid_water                           | Min:-4.723777e-04 Max:+1.116509e-03 RMS:+1.138239e-05
ozone_mass_mixing_ratio                      | Min:-2.046351e-09 Max:+4.929310e-09 RMS:+1.802833e-11
----------------------------------------------------------------------------------------------------
LinearModel<MODEL>::forecastTL: Finished
----------------------------------------------------------------------------------------------------
slurmstepd: error: *** STEP 18492759.0 ON h14c49 CANCELLED AT 2025-11-06T19:38:39 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

Steps to Reproduce (if applicable)

Run ctests

Expected behavior

If a ctest passes once, it should always pass.

Additional information

This isn't the highest priority right now since most of our fv3-jedi development has moved to WCOSS2 where tests seem to not have this problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions