Skip to content

Restarting the system

mikebedington edited this page Nov 16, 2022 · 16 revisions

Instructions for propping back up the operational system when it (inevitably) falls over.

*** Update 11/2022 *** Need to login into phorcys as a normal user and then sudo as modop

Update 09/2019

Now the suites have been tidied up the easiest way to restart things if everything has fallen over for an external reason (e.g. phorcycs crashing) and not for an internal suite reason (e.g. a task failing) is to use

cylc restart [suite-name]

(You can also use rose suite-restart from the suite home directory, there appears to be no difference though to for some reason to do with updating the messaging you have to run a cylc command, e.g. gcylyc [suite-name] before running rose suite-restart for the relevant suite)

It is still important to restart the suites in the right order as below

End of update

If the failure point is adjust_soil_levels in the WRF suite check out the note here.

If the whole set of suites has failed the order the suites have to be restarted in is:

  • download
  • wrf
  • fvcom (e.g.fvcom_rosa, fvcom_tamar)
  • regrid (e.g. tamar_regrid)
  • plotting / mycoast website / housekeeping

This is also the dependency order so if something higher up the stack has failed those below should either be stopped and restarted once the prior suites are up and running, or if they are hanging on the trigger dependency then just this task can be reset (by going into the gui with gcylc suite-name, then right clicking on the task and using 'Reset state' to set it to either waiting or succeeded.

The general steps for restarting a suite called suite-name is:

  • Login to phorcys as modop and navigate to the suite directory (in ./Rose_suites/)
  • Check the gui for which task has failed using gcylc suite-name
  • To see the log files for individual tasks either use the gui or look in ~/cylc-run/suite-name/log/job/cycle-point/job-name/01/ on phorcycs or CETO (depending on where the task has run)
  • Make sure the suite is fully stopped cylc stop --now --now suite-name
  • For WRF only clean suite directories using rose suite-clean, if there is a dependency on another suite it is possible there are still there are orphaned log tasks running which will stop it being cleaned up properly, one way of getting rid of these is to go to ~/cylc-run/ and use lsof +D suite-name | awk '{print $2}' | tail -n +2 | xargs kill -9 (This whole step is no longer necessary for the fvcom suites, but still needs to be done for WRF for reasons I haven't got to the bottom of yet)
  • Change the INITIAL_START_DATE in rose-suite.conf, to the relevant start date, usually the day after it last ran successfully. The time with the date must be midnight (i.e. 00:00:00) for all suites otherwise the inter suite triggers won't work.
  • Restart the suite using rose suite-run. If you get an error 'Device or resource busy' then you need to run the lsof command as described two steps ago.
  • Any suites downstream of the suite you've just restarted (e.g. the regrid and plotting suites if you've restarted an fvcom model suite) will need their trigger tasks reset. To do this go into the gui (gcylc suite-name), right click on the trigger task, and click 'Reset State' -> 'waiting'

Individual suite notes and common problems:

download Even though this suite runs every 12 hours and would back fill data it must be run from midnight of the first day you are running the WRF/FVCOM suite (or integer 12 hour period before) otherwise it will fail to trigger the other model runs.

regrid Make sure start date is the same as or later than the fvcom start run otherwise it may be waiting forever for an fvcom run that won't ever happen

housekeeping Should be robust to starting whenever and with whatever start date. Currently deletes relative to system date (not the Rose cycle point date) so if you are running the forecast starting from a while ago (e.g. > 1 week) then don't restart this immediately as it might purge things on ceto too early.

Current list of operational suites:

Since there are some redundant suites in modop/Rose_suites this is the list of suites expected to be running operationally

  • download
  • wrf
  • fvcom_tamar
  • fvcom_rosa
  • tamar_regrid
  • my_coast_suite (different directory structure to the above; in mycoast_plot/mycoast-website-plots/roses)
  • pylag_viirs
  • pylag_olci
  • rosa_regrid
  • wider_tamar_regrid
  • housekeeping (currently broken)

For the adrift SAR model additionally:

  • tamar_regrid_wind
  • wider_tamar_wind
  • rosa_regrid_wind

The plotting suites are redundant now we have the mycoast website

Restart types

'Cold start' means from a cmems interpolated restart file. It only applies to the first day of forecast after which it runs from hot start from the previous FVCOM run.

If its fallen over and you want to keep running from yesterdays forecast then it depends if you have had to nuke the ~/cylc-run directories as part of the solution or not (e.g. by doing rose suite- clean).

If you have COLD_START=False and ARCHIVE_RESTART=False then it will attempt to retrieve the restart from the previous day from the relevant folder in ~/cylc-run/fvcom_xxx/share/cycle/DDDD/ on ceto and fall over if thats been deleted.

If you have COLD_START=False and ARCHIVE_RESTART=True then it will attempt to retrieve an already made restart file for the previous day from the location defined by RESTART_ARCHIVE_DIR. At the moment this should be setup to automatically dump the ongoing restart files in the correct restart folder so should always be able to restart in this way if the model has run to completion (but not e.g. if the model failed to transfer output because sthenno is full).

Clone this wiki locally