Sometimes, when it comes to software failure, more than just money can be lost. Sometimes lives, livelihoods and personal property are on the line. These four failures represent some of the biggest software failures in the last fifty years and the lessons in software testing we can learn to help avoid their losses.
1. American Airlines FNS, 1995
What Happened: In 1995, American Airlines Flight 965, en route from Miami, FL to Cali, Columbia, crashed into a mountain and killed 151 of their 155 passengers as well as the entire crew. Bad navigation decisions by the pilots were exacerbated by a problem with their flight navigation system that caused a direct routing to clear all pre-existing radio beacons from the route. Since Cali relied on radio beacons to help pilots navigate the mountains, this left the pilots without important terrain information.
What We Can Learn: Though the fault was ultimately pilot error, smarter negative testing of the FNS could have helped pilots catch their mistakes before they became fatal. Sometimes, users will not use the software as intended or will make mistakes. This is why negative testing, or testing that the software will recognize when a mistake is made and produce an error message, is just as important as positive testing. It’s not enough to make sure your software works as intended—make sure it won’t try to work as unintended, too.
2. Canadian Cancer Therapy Machine (Therac-25), 1986
What Happened: Like many other major programming catastrophes, the six death-by-radiation-overdose accidents caused by the Therac-25 cancer therapy machine was a result of multiple smaller errors cascading into a major malfunction. The two main causes were determined to be a reuse of older software on new hardware, and a flag overflow problem that occasionally reset the value number to zero, bypassing safety measures.
What We Can Learn: One, never assume that software will work on a new system just because it worked on an old one. Always thoroughly test new combinations of software and hardware. Two, beware of integer overflow potential, as it’s one of the most common software bugs.
3. DIA Baggage System, 1993
What Happened: When Denver International Airport was first designed, it was planned to have an automated baggage system for all three concourses, to be finished in a two-year time frame. Multiple experts expressed that, since this was the most complicated auto-baggage system that had ever been built, the scope and size of the project was much too large for a two-year deadline. Despite this, the project continued without a changed timeline. $560 million dollars and 16 months over-deadline later, the project was declared unsuccessful.
What We Can Learn: A large part of what drove up the cost and timeline of the project was the complexity of its interlocking parts. Trying to integrate all three concourses meant automating pickup and drop-off for 88 gates and additional locations. Even more of a problem, these interlocking parts hadn’t been given a proper backup or redundancies put into place in case of a failure. Lastly, the project attempted to release the entire system at once, rather than working on steady, smaller roll-outs. When planning a large, complex project, always make sure to break it down into smaller, more manageable parts and plan for failure. The more complex a system software, the more likely you are to have issues between the individual functioning parts, and the more important it is to test the interaction between those parts before release.
4. Northeast Blackout, 2003
What Happened: In 2003, large parts of the Northeast U.S. coast and Ontario, Canada lost power, affecting over 55 million people. Ultimately, the cause was discovered to have been an issue of a delayed alarm following the failure of a voltage monitoring tool outside of Cincinnati, Ohio. This caused the continually-increasing electrical demand to be transferred to other transmission lines, which also became overloaded and caused cascading overload throughout the northeast.
What We Can Learn: The Northeast Blackout is one very, very large example of load capacity failure. When the first system became overloaded, it shut down and demand was transferred on and on, the demand increasing with each new shut-down. Knowing exactly what your maximum load capacity is will help you plan ahead to avoid reaching the point of overwhelm. When you know what to watch out for, you can begin putting systems in place to avoid the problem.