With large electronic development projects, a disproportionate amount of time is spent troubleshooting bugs in “finished” devices.
In the worst cases, projects are wrapped up and products sold to customers even though they still have problems. Look at the failure of Baxter Healthcare’s Sigma Spectrum Infusion Pumps. Some of these devices mistakenly detected an open door, thus halting critical treatment and forcing clinicians to reboot it. This led to a Class 1 recall, the FDA’s most severe classification.
Successfully troubleshooting these problems prior to product launch is in everybody’s best interest, and can be the most rewarding part of a project for an engineer. (There’s the pride from a sense of accomplishment and the relief from not having to worry about customers who are dissatisfied or, even worse, in danger.)
A Systematic Approach
Electronic circuits are complex interactions of hardware and software, providing limited information to the designer. Even the most “simple” electronic device may have hundreds of nodes, which can generate many different events. Typically, only one of these sequences of events is desired – the one in which the design successfully monitors all inputs and controls all outputs.
Engineers expect that a complex design will not behave as desired when it first arrives from the manufacturer. The good news is that there are established methods to apply deductive reasoning to fix these problems. The bad news is that it can take five minutes or five months. Its application requires a systematic approach and a tremendous amount of patience, creativity, and focus.
Step 1: Reproduce the Failure Symptom
There is a systematic troubleshooting process that should always be used to identify a problem’s root cause. The first step is to reproduce the failure symptom. A symptom that can’t be reproduced can’t be confidently fixed. Here are some techniques for reproducing intermittent problems:
- Gather as much information as possible from the failure’s witnesses.
- Apply environmental stresses such as thermal, mechanical, and electrical shock.
- Review data logs to identify failure sequences.
- Simulate the circuit and test the effects of hypothetical open and short circuits at different locations and sequences.
Timeline pressure often leads designers to make changes to address a symptom that can’t be recreated. Unfortunately this is a waste of time and effort and will instill a false sense of confidence, since there’s no way to verify the effect of the change. An engineer should never implement design changes to address a failure before it can be reproduced.
Step 2: Simplification
Once a failure has been reproduced, the second step in the troubleshooting process is simplification. This is done by removing functional blocks, one at a time, while monitoring behavior to see if the problem has been fixed.
Typically, the blocks are removed sequentially from output back to the input. This step requires creativity to determine how to remove a function but still monitor the circuit’s behavior. One of the most challenging parts of the process is to disconnect feedback loops, because their automatic control mechanisms may mask the root cause(s) of the problem.
In these cases, the feedback loops should be replaced by external sources to mimic their outputs without automatic adjustments. This simplification process should continue until the failure symptom has been eliminated.
Step 3: Reintroduction
After the symptom has been eliminated, the next step is to re-introduce the functional blocks back to the circuit. Blocks should be added one at a time, while the circuit’s output is simultaneously monitored to see when the problem recurs. When the failure occurs, the corresponding block must contain at least one of the failure’s root causes and should be removed again. Continue reintroducing blocks to identify which are okay and those containing root causes of the failure.
These two steps (simplification and reintroduction) should be repeated within the problematic blocks to further narrow down the root cause. At the end of this sequence, there should be one or more components identified as root causes of the failure symptom.
Step 4: Fix Each Root Cause
The next step is to fix each root cause after they have all been identified. Each solution is obviously dependent on the cause, but generally requires fixing a design mistake or making a component more robust.
Step 5: Verification
The final step of the troubleshooting process is verification. First, the fixed circuit should behave correctly under the conditions that previously caused the failure. Then, each design fix should be reverted to its original state while the circuit is monitored, to verify that the original symptom recurs. The purpose of this step is to verify that the minimum design fixes have been identified, thus minimizing impact and implementation costs.
This troubleshooting approach is well established, but it’s often ignored due to time pressure or overconfidence. While skipping these steps may sometimes turn out alright for simple problems, shortcuts will be counterproductive for complex problems with multiple root causes.
(On-Ramp is an ongoing series explaining engineering techniques used in product development.)
Read about our electrical engineering expertise.