Biggest Software Failures Software Testing Can Help Catch

Sometimes, when it comes to software failures, more than just money is lost. Sometimes lives, livelihoods, and personal property are on the line, as well. These four failures represent some of the biggest software failures in the last fifty years and the lessons in software testing we can learn to help avoid their losses.

American Airlines FNS, 1995

What Happened: In 1995, American Airlines Flight 965, en route from Miami, FL to Cali, Columbia, crashed into a mountain and killed 151 of their 155 passengers plus the entire crew. Though the pilots made bad navigation decisions, a problem with their flight navigation system that caused a direct routing to clear all pre-existing radio beacons from the route exacerbated the issue. Because of Cali’s reliance on radio beacons to help pilots navigate the mountains, this left the pilots without important terrain information.

What We Can Learn: Though the fault was ultimately pilot error, smarter negative testing of the FNS could have helped pilots catch their mistakes before becoming fatal. For example, sometimes users do not use the software as intended (or they make simple mistakes). This common error is why negative testing, or testing that the software recognizes mistakes and produces an error messages, is just as important as positive testing. To be clear, it’s not enough to make sure software works as intended— it must not work as unintended, either.

Canadian Cancer Therapy Machine (Therac-25), 1986

What Happened: Like many other major programming catastrophes, the six death-by-radiation-overdose accidents caused by the Therac-25 cancer therapy machine was a result of multiple smaller errors cascading into a major malfunction. The two main causes were determined to be a reuse of older software on new hardware, and a flag overflow problem that occasionally reset the value number to zero, bypassing safety measures.

What We Can Learn: One, never assume that software will work on a new system just because it worked on an old one. Always thoroughly test new combinations of software and hardware. Two, beware of integer overflow potential, as it’s one of the most common software bugs.

DIA Baggage System, 1993

What Happened: The initial concept of the Denver International Airport included an automated baggage system for all three concourses. The set time of completion was two years. However, this was the most complicated auto-baggage system ever build; the scope of the project was too broad for a two-year deadline. Despite this, the project continued without a changed timeline. Project managers finally deemed the endeavor unsuccessful — 560 million dollars and 16 months over-deadline later.

What We Can Learn: A large part of what drove up the cost and timeline of the project was the complexity of its interlocking parts. Trying to integrate all three concourses meant automating pickup and drop-off for 88 gates and additional locations. Additionally, these interlocking parts weren’t backed up properly nor were redundancies put into place in case of software failures. Lastly, the project attempted to release the entire system at once, rather than working on steady, smaller roll-outs. When planning a large, complex project, always make sure to break it down into smaller, more manageable parts and plan for failure. The more complex a software system, the more likely you are to have issues between the individual functioning parts. Therefore, it is critical to test the interaction between those parts before release.

Northeast Blackout, 2003

What Happened: In 2003, large parts of the Northeast U.S. coast and Ontario, Canada lost power. The power loss affected over 55 million people. Ultimately, a delayed alarm following the failure of a voltage monitoring tool outside of Cincinnati, Ohio caused the issue. Unfortunately, this failure caused the continually-increasing electrical demand to transfer to other transmission lines. These lines also overloaded causing cascading overload throughout the northeast.

What We Can Learn: The Northeast Blackout is one very, very large example of load capacity failure. When the first system overloaded, it shut down and demand transferred on and on. Furthermore, this demand increased with each new shut-down.

Knowing exactly what your maximum load capacity is will help you plan ahead to avoid reaching the point of overwhelm. When you know what to watch out for, you can put systems in place to prevent the problem.

The impact of software failures varies greatly. Therefore, it is imperative to test software for things like functionality and usability regardless of potential risk. Software testing protects users from damage and protects a company and its stakeholders from liability and reputation loss. If you’ve got a public-facing software application or website, test it through a qualified third party software testing team.

William Miller

See Full Bio