-
Notifications
You must be signed in to change notification settings - Fork 2
Restarting the system
Instructions for propping back up the operational system when it (inevitably) falls over.
Update 09/2019
Now the suites have been tidied up the easiest way to restart things if everything has fallen over for an external reason (e.g. phorcycs crashing) and not for an internal suite reason (e.g. a task failing) is to use
cylc restart [suite-name]
(You can also use rose suite-restart from the suite home directory, there appears to be no difference though to for some reason to do with updating the messaging you have to run a cylc command, e.g. gcylyc [suite-name] before running rose suite-restart for the relevant suite)
It is still important to restart the suites in the right order as below
End of update
If the failure point is adjust_soil_levels
in the WRF suite check out the note here.
If the whole set of suites has failed the order the suites have to be restarted in is:
- download
- wrf
- fvcom (e.g.fvcom_rosa, fvcom_tamar)
- regrid (e.g. tamar_regrid)
- plotting / mycoast website / housekeeping
This is also the dependency order so if something higher up the stack has failed those below should either be stopped and restarted once the prior suites are up and running, or if they are hanging on the trigger dependency then just this task can be reset (by going into the gui with gcylc suite-name, then right clicking on the task and using 'Reset state' to set it to either waiting or succeeded.
The general steps for restarting a suite called suite-name
is:
- Login to phorcys as modop and navigate to the suite directory (in ./Rose_suites/)
- Check the gui for which task has failed using
gcylc suite-name
- To see the log files for individual tasks either use the gui or look in ~/cylc-run/suite-name/log/job/cycle-point/job-name/01/ on phorcycs or CETO (depending on where the task has run)
- Make sure the suite is fully stopped
cylc stop --now --now suite-name
- For WRF only clean suite directories using
rose suite-clean
, if there is a dependency on another suite it is possible there are still there are orphaned log tasks running which will stop it being cleaned up properly, one way of getting rid of these is to go to ~/cylc-run/ and uselsof +D suite-name | awk '{print $2}' | tail -n +2 | xargs kill -9
(This whole step is no longer necessary for the fvcom suites, but still needs to be done for WRF for reasons I haven't got to the bottom of yet) - Change the
INITIAL_START_DATE
inrose-suite.conf
, to the relevant start date, usually the day after it last ran successfully. The time with the date must be midnight (i.e. 00:00:00) for all suites otherwise the inter suite triggers won't work. - Restart the suite using
rose suite-run
. If you get an error 'Device or resource busy' then you need to run the lsof command as described two steps ago. - Any suites downstream of the suite you've just restarted (e.g. the regrid and plotting suites if you've restarted an fvcom model suite) will need their trigger tasks reset. To do this go into the gui (gcylc suite-name), right click on the trigger task, and click 'Reset State' -> 'waiting'
download Even though this suite runs every 12 hours and would back fill data it must be run from midnight of the first day you are running the WRF/FVCOM suite (or integer 12 hour period before) otherwise it will fail to trigger the other model runs.
regrid Make sure start date is the same as or later than the fvcom start run otherwise it may be waiting forever for an fvcom run that won't ever happen
housekeeping Should be robust to starting whenever and with whatever start date. Currently deletes relative to system date (not the Rose cycle point date) so if you are running the forecast starting from a while ago (e.g. > 1 week) then don't restart this immediately as it might purge things on ceto too early.
Since there are some redundant suites in modop/Rose_suites this is the list of suites expected to be running operationally
- download
- wrf
- fvcom_tamar
- fvcom_rosa
- tamar_regrid
- my_coast_suite (different directory structure to the above; in mycoast_plot/mycoast-website-plots/roses)
- pylag_viirs
- pylag_olci
- rosa_regrid
- wider_tamar_regrid
For the adrift SAR model additionally:
- tamar_regrid_wind
- wider_tamar_wind
- rosa_regrid_wind
The plotting suites are redundant now we have the mycoast website