This post could end with a single word: watchdog. But designing a good watchdog is a challenging task.
A hardware chip that cuts the power supply to the main processor is indispensable to provide real reliability. This chip should be pinged at regular intervals, otherwise a power cycle is done. If well calibrated, this system can be effective enough for single-thread application running on microcontrollers. But for microprocessors with an operating system and several processes running, a software watchdog is needed too.
The Software Side
Obviously the watchdog process (WDP from now on) must be tied with his hardware counterpart. In this way, if the WDP crashes, the system will reboot, ensuring that other processes don't remain without a monitor. This is the easy part; how the WDP checks that everything is working fine is another kettle of fish.
One solution may be monitoring the status of every process (ensuring that it's running and it's not in a zombie state) and abnormal usage of CPU and RAM. The hard part here is defining what "abnormal" means.
Besides this rough check, we can make each process to feed the WDP at regular intervals. The drawback is that we need to complicate each process inserting code not related to its core business. If it seems a not so big disadvantage, try to imagine the amount of code needed if you have a big process with tenths of threads running concurrently. Unfortunately, this is the price to pay for a really reliable system.
Management Of The Failure
OK, now you have your WDP running on your system with other processes that fed it. The next step is to decide what to do in case of failure. If a program is going to consume all the system resources, an obvious thing to do is killing it. And then?
The answer depends on the process and the system architecture. For some process, the right solution may be trying to restart them; for others a system reboot may be required. Additional rules may be set on the number of failures in a certain time. Probably, in an average system, all these strategies should be applied to different processes.
After the WDP has done its dirty work, is more likely that the failure will reappear. It can be because of a bug, an unmanaged situation or for a memory leak that is slowly consuming the RAM. A good way to understand what happened is to have a memory dump of the "bad" process to perform post mortem debug. But unfortunately, often this is not enough.
In a complex system where processes interact, a log that shows information from the last minutes before the WDP intervention can be really useful. This ends in other extra code added to the processes.
Designing an effective and reliable watchdog for embedded systems is a complex task and it often implies additional code added to the other processes. But believe me, it's worth the hassle.
Image of Marty Feldman from the movie Young Frankenstein by Insomnia Cured Here taken from Flickr licensed under the Creative Commons Attribution-ShareAlike 2.0 Generic license.