The Technology of Failure
We want everything to work perfectly all the time – it is just the way humans are. And as much as you might bemoan the failure of your smartphone to update contacts today, if it were a choice between that and your brakes failing on the car the answer is pretty obvious.
So if ABS brake systems and airbag systems can be made so they virtually never fail, why isn’t the same failure busting technology applied to everyday items like smartphones? This blog entry aims to give an appreciation of the high price that is rightly paid to failure proof electronics and how it is done.
Avoidance and Simplification
Safety needs to be considered early and holistically in the design of a product. While you can put a cage around a spinning fan blade so people don’t get the fingers and clothes caught, wouldn’t it be better not to have the spinning fan blade? Now it may not be practical to eliminate the spinning fan and the cage around the fan blade is a simple solution. Where there is a simple solution that works then it seems an obvious preference – later we will see some reasons why more complex solutions are often necessary.
The avoidance and simplification in electronics is seen throughout the white goods industry. One of the biggest risks for motorised products is that a motor will be damaged or overloaded, the windings overheat and not only start a fire but keep pumping it with heat creating a situation where even non-flammable plastics will burn ferociously. But our newspapers are not filled with reports of such tragedies. Many motors are fitted with over temperature fuses in the windings that go open circuit when the motor is hot. They are often one-time fuses meaning that once that has happened the motor is no more than a boat anchor and given the disposable nature of whitegoods, it is likely the whole product is a boat anchor. This does not preclude the electronic control from detecting the potentially hazardous condition, for example an incorrectly loaded washing machine, and shutting down before anything fails but the safety does not depend on it. The high cost of design and production control of safety critical parts are now confined to the over temperature fuse and its installation in the motor. The electronic control still needs to be reliable to keep the customers happy and the brand reputation in tact but a failure is a warranty claim not a human tragedy.
So step one in making systems safe is to make the safety system simple, local to the cause, address the real danger and rely on proven technology. Going back to the smartphone example – even when your phone has crashed the safety circuit that prevents the batteries from exploding is still fully operational – in fact it is packed in with the battery itself so it protects the battery even when it is out of the phone.
Fail Safe or Must Operate
The next matter that needs to be addressed is the profound difference between a system that can fail without tragedy as long as it does so safely – like a washing machine or a system that must operate to avoid tragedy – like an airplane. While these examples are extreme cases the line dividing them is not that simple. An airplane is still a failsafe system but the fail safe state is the plane being stationary on the tarmac and the engines shut down. So in the airplane there is no direct simple path from 10km above the earth to being safe on the ground. Complex operation is still required to make the plane safe when things go wrong.
Even the simplest of systems though can be divided into ‘must operate’ and ‘fail safe’ components. The over temperature fuse in the motor in the earlier example must operate when it gets above temperature. We call these must operate components ‘infallible’ which is a misnomer to say the least. They are better described as almost infallible when operated within specification and in the correct environment. Most fail safe systems rely on infallible parts at some point.
The Fallibility of Components
One could always imagine a mechanism for an infallible part to fail. When you get a group of engineers together to vote on whether a part is infallible or not, all you will determine is who of the engineers are optimists and who are pessimists. Fortunately many products have standards that lay out what parts are subject to what possible faults.
The IEC 60730 standard for Automated Electrical Controls has an often referred to Appendix H27 which lists most common parts and how they are considered to fail. Reference to this settles most discussions.
Software
If you’ve used a desktop computer it is hard to imagine ever trusting your life to software but in fact you do make this act of faith every day. There are two challenges for software in safety systems. Firstly, is the processor running the software operating correctly? Secondly, is the software free of bugs?
Again standards like IEC60730 provide guidance in both of these areas. As to whether the processor is operating properly it may be enough for the processor to ‘self-check’ in some applications by following a series of checks in a table that runs over several pages – some processor manufacturers actually provide software to do these checks along with specially considered hardware in the processor. More demanding applications though – so called ‘Class C” applications require comparison to another processor or other fairly extreme hardware controls.
The software is one of the biggest areas of differentiation between your ABS brakes and your smart phone. Could you imagine getting all the authors of all the various pieces of software on your phone and getting them to swear there are no bugs in it – and further – ask them to prove it? If we applied this requirement to the first PC, it might just be available now.
So how does one prove that software is bug free? There are some good tools and methodologies for doing this but they are extremely time consuming. The software in critical systems is kept as simple as possible. The standards call for a mix of verification strategies to be applied because no single evaluation method is enough.
Redundancy
Redundancy in safety critical systems is different to carrying a spare tyre. All the redundant systems operate at the same time and a decision is made based upon the multiple results obtained. If the results from the multiple redundant systems are not the same then there is a problem. In a failsafe system – you simply fail safe. In a must operate system you’d better have three or more systems so you can go with the majority.
Redundancy does not eliminate infallible components all together. You need an infallible device to compare the results of the systems and make a final decision.
Redundancy can be taken a step further in a form called heterogeneous or diverse redundancy. This is effectively using more hardware to help overcome design problems. A redundant system consisting of multiple systems that are the same can still all agree on the same wrong result if the design is wrong. Diverse systems overcome this problem by having multiple systems of different design – preferably designed by different teams. The chance that multiple designs will make the same mistake is quite low.
Redundancy is an extremely powerful technique but also the most expensive and is usually only called in when absolutely needed.
So if you want that smartphone that is as reliable as your ABS brakes you can have it. It will be 4 times the size and you will have to wait until 2022.
Leave a Reply