Watchdog Timer upgrade discussion #863

mikaelsky · 2024-03-24T15:23:34Z

mikaelsky
Mar 24, 2024
Collaborator

Looong, sorry in advance :)

The current watchdog timer, now sans IRQ, could benefit from a few upgrades.

Before talking about what upgrades lets at first define why we have a watchdog timer and what purpose it serves.

:puts on professor hat:
A watchdog is there to ensure that if a device it monitors, in our case a processor, enters an undefined state at can be discovered and reacted to. This we achieve by ways of a monitoring function that the watchdog utilizes to determine that the device being monitored is still in a well defined state. The reason we use the word "device" here is that watchdogs are used for more than use processors, often they can be found as system monitors where e.g. a device has to perform a function at every so often which is then monitored. If a device fails to do this an alarm will sound.

For a processor running firmware the monitoring function is often a timer that the firmware has to set/clear prior to the timer running out.
:takes off processor hat:

Keeping all this in mind we can now talk about a watchdog timer (monitor function) for our processor.

First thing to note is that watchdog timers never use IRQs and ISRs, which has now been fixed, yeah :) We don't use ISRs to service watchdogs as the point of the watchdog is to determine whether the processor/firmware is an undefined state, e.g. executing something at random. The challenge here is that even when a processor has entered an undefined state the IRQ subsystem is still fully functional. Hence if we use an ISR to service a wathdog it will defeat the watchdogs main function, as the ISR will almost always be correctly called even if the processor is in an undefined state.

Watchdogs also have lock bits, which I believe we already have. The lock-bit means whenever the watchdog is enabled nothing but the refresh register is write-able.
For RISCV its also useful to have additional enable controls for sleep and debug modes. In particular the ability to disable the watchdog in debug mode is quite useful. Again I believe we already have this.

An additional set of watchdog enable/control bits that are tied to system power states are also useful. These would control the watchdog behavior during various power states. The simplest is adding support for "deep sleep" power state. In this state you would want to disable the watchdog timer to 1) prevent it from triggering during deep sleep and 2) to prevent the watchdog from coming up in an unknown state after we exit deep sleep.
Similar controls can beneficially be added to the machine timer.

From an SoC perspective it is also useful to have the option to have the watchdog either reset the processor - if we are the primary core - or more normal, indicate to an external system controller that the watchdog has tripped and this processor is in an undefined state.
Why do we not want to reset the processor in the system as default always? In a multi-core system its dangerous to have a random core just reset itself without notifying other processors in the system that it happened.
E.g. imagine a system where there are 4 motor controllers that are all controlled by a system controller based on inputs provided by a user. Now imagine what would happen if 1 of the 4 motor controllers watchdog timed out and it just reset the processor? In this scenario the system controller wouldn't know that it just lost control of 1 motor nor that it has to reboot the controller of said 1 motor to regain control of said motor.

Another useful feature of a watchdog timer is variable time outs. We like to have a variable timeout as different firmware states and functions will have variable requirements. E.g. an OS scheduler can use a watchdog to determine whether an application has time out.
More often though variable timeouts are used in power management to ensure we are woken up in time by an external event. E.g. during a normal operation any computer interval should never exceed 100ms, but during a period of sleep should never exceed 10s.
For this feature its normal to have a count-down timer where firmware keep updating the count-down value with the current inter-event period.

Up vs down counters. Both can be used for a watchdog. For variable interval watchdogs down counters can be beneficial as it only requires 1 register write as said writes serves both as a clear of the watchdog timer and a setting of a new count down value. Using a down counter will also save a tiny bit of area as the sign bit can used to indicate time-out vs a general compare circuit required by an up counter.

There are obviously even more safety features that can built in like wrapping the watchdog timer update function in a jump over field to prevent a sequential execute walk through memory triggering the watchdog register update.
Using "<" or ">" operators for the watchdog compares to prevent a clock/power glitch from preventing the watchdog from triggering. These glitches can cause 1 or more flops to flip which can cause a watchdog timer register from skipping over the compare point.

I've seen "password" writes used to prevent "accidental write" from clearing the watchdog timeout. This is a somewhat hardware centric view as the firmware that updates a watchdog would include said password write as part of the update sequence and as noted above the most likely culprit of error here would be an accidental jump into the watchdog update function which a "password" won't protect us from.
So whats left is that the "password" protects us from an accidental random write to 1 memory location in our 1 giga-word memory space of a value that would then be counted from. The likelihood of which said write is quite low and the watchdog would still count using the new value. For a "password" to be truly useful the failure mode has to be consistent write to our specific register by accident, which more often than not ends up being the actual watchdog update function being consistently jumped to and we are back to the first the firmware case again.

There are debates on whether is smart or not to put watchdog functions in ROM vs RAM. As ROM is immutable it means it can be corrupted by undefined behavior from the processor. Which could be a good thing... or it could be a bad thing :)

stnolting · 2024-03-26T20:52:29Z

stnolting
Mar 26, 2024
Maintainer

:puts on professor hat:
A watchdog is ... [...]
:takes off processor hat:

Great definitions. Full ACK! 👍

First thing to note is that watchdog timers never use IRQs and ISRs, which has now been fixed, yeah :)

In my defense, I did not invent this "bark and bite" concept. Here you will find a good overview of the general functions of a watchdog with really fantastic illustration examples: https://en-support.renesas.com/knowledgeBase/16978434

First thing to note is that watchdog timers never use IRQs and ISRs, which has now been fixed, yeah :)

✔️ This is already implemented.

For RISCV its also useful to have additional enable controls for sleep and debug modes. In particular the ability to disable the watchdog in debug mode is quite useful. Again I believe we already have this.

✔️ This is also already implemented.

An additional set of watchdog enable/control bits that are tied to system power states are also useful.

Good point. We already have the option to turn off the WDT when the CPU is in sleep mode (the only sleep/power-down mode we have). Providing some kind of power-monitoring would require technology-specific IP, which should be a no-go here. However, we could provide some external signals for this.

Similar controls can beneficially be added to the machine timer.

I think this might be something we could consider for the "general purpose timer". However, the machine timer is defined by the RISC-V spec. so we ~~can~~ should not modify that.

From an SoC perspective it is also useful to have the option to have the watchdog either reset the processor - if we are the primary core - or more normal, indicate to an external system controller that the watchdog has tripped and this processor is in an undefined state.

Hmm right... However, this is intended (at least for now) as a single-core setup (at least for now). So would you propose to "output" the watchdog's state as user-specific output signal?

Using a down counter will also save a tiny bit of area as the sign bit can used to indicate time-out vs a general compare circuit required by an up counter.

Actually, I was waiting for someone to point at this 🤣
I prefer this count-up-and-check-if-threshold-match style as the actual counter FFs need less logic in front of them (just the adder carry-chain plus a clear, but no direct write). But this is just me, trying to save each and every precious LUT 😅

Using "<" or ">" operators for the watchdog compares to prevent a clock/power glitch from preventing the watchdog from triggering. These glitches can cause 1 or more flops to flip which can cause a watchdog timer register from skipping over the compare point.

So something like a "windowed" watchdog?? I.e. the application has to feed the watchdog within a specified time window?

Or are we talking about the actual implementation of the timeout logic? Maybe a Gray-counter with "sanity check" (only one bit toggles per cycles) would be the way to go?

For a "password" to be truly useful the failure mode has to be consistent write to our specific register by accident, which more often than not ends up being the actual watchdog update function being consistently jumped to and we are back to the first the firmware case again.

Hm how to solve that? A challenge-response approach combined with some time window?!
I just realize that I have no idea about all those safety implications... :D

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watchdog Timer upgrade discussion #863

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Watchdog Timer upgrade discussion #863

mikaelsky Mar 24, 2024 Collaborator

Replies: 1 comment

stnolting Mar 26, 2024 Maintainer

mikaelsky
Mar 24, 2024
Collaborator

stnolting
Mar 26, 2024
Maintainer