Skip to content

Restarting the system

mikebedington edited this page Jun 13, 2022 · 16 revisions

Instructions for propping back up the operational system when it (inevitably) falls over.

Update 09/2019

Now the suites have been tidied up the easiest way to restart things if everything has fallen over for an external reason (e.g. phorcycs crashing) is to use

cylc restart [suite-name]

(You can also use rose suite-restart from the suite home directory, there appears to be no difference though to for some reason to do with updating the messaging you have to run a cylc command, e.g. gcylyc [suite-name] before running rose suite-restart for the relevant suite)

It is still important to restart the suites in the right order as below

End of update

If the failure point is adjust_soil_levels in the WRF suite check out the note here.

If the whole set of suites has failed the order the suites have to be restarted in is:

  • download
  • wrf
  • fvcom (e.g.fvcom_rosa, fvcom_tamar)
  • regrid (e.g. tamar_regrid)
  • plotting / mycoast website / housekeeping

This is also the dependency order so if something higher up the stack has failed those below should either be stopped and restarted once the prior suites are up and running, or if they are hanging on the trigger dependency then just this task can be reset (by going into the gui with gcylc suite-name, then right clicking on the task and using 'Reset state' to set it to either waiting or succeeded.

The general steps for restarting a suite called suite-name is:

  • Login to phorcys as modop and navigate to the suite directory (in ./Rose_suites/)
  • Check the gui for which task has failed using gcylc suite-name
  • To see the log files for individual tasks either use the gui or look in ~/cylc-run/suite-name/log/job/cycle-point/job-name/01/ on phorcycs or CETO (depending on where the task has run)
  • Make sure the suite is fully stopped cylc stop --now --now suite-name
  • For WRF only clean suite directories using rose suite-clean, if there is a dependency on another suite it is possible there are still there are orphaned log tasks running which will stop it being cleaned up properly, one way of getting rid of these is to go to ~/cylc-run/ and use lsof +D suite-name | awk '{print $2}' | tail -n +2 | xargs kill -9 (This whole step is no longer necessary for the fvcom suites, but still needs to be done for WRF for reasons I haven't got to the bottom of yet)
  • Change the INITIAL_START_DATE in rose-suite.conf, to the relevant start date, usually the day after it last ran successfully. The time with the date must be midnight (i.e. 00:00:00) for all suites otherwise the inter suite triggers won't work.
  • Restart the suite using rose suite-run

Individual suite notes:

download Even though this suite runs every 12 hours and would back fill data it must be run from midnight of the first day you are running the WRF/FVCOM suite (or integer 12 hour period before) otherwise it will fail to trigger the other model runs.

regrid Make sure start date is the same as or later than the fvcom start run otherwise it may be waiting forever for an fvcom run that won't ever happen

housekeeping Should be robust to starting whenever and with whatever start date. Currently deletes relative to system date (not the Rose cycle point date) so if you are running the forecast starting from a while ago (e.g. > 1 week) then don't restart this immediately as it might purge things on ceto too early.

Current list of operational suites:

Since there are some redundant suites in modop/Rose_suites this is the list of suites expected to be running operationally

  • download
  • wrf
  • fvcom_tamar
  • fvcom_rosa
  • tamar_regrid
Clone this wiki locally