The Pros and Cons of Embedded SREs
A comparison of the two main SRE team models: Embedded SREs vs. standalone SRE teams.
August 26, 2021
5 min read
Even seemingly minor math bugs in software code can have outsize consequences.
Reliability engineering and math skills tend to go hand-in-hand. After all, if you end up with a job helping to design or manage software, you probably had to pass some advanced math classes along the way.
Yet even the most skilled developers and SREs are only human, and they sometimes make math errors -- or design systems that do. Under the right circumstances, those mistakes can lead to significant problems.
To prove the point, here’s a look at four incidents or issues that were caused, at least in part, by math errors. Not all of these errors led to critical outages or disruptions, but they nonetheless demonstrate the importance of triple-checking your math if you want to build highly reliable systems.
The metric system and the imperial system (the measurement system used in the United States, among certain other countries) have coexisted awkwardly for more than two centuries, complicating the lives of designers and programmers who have to convert data from one system to the other.
Performing these conversions is typically straightforward enough. But you still have to do it -- and if you don’t, expensive problems could result.
That’s the takeaway from the story of NASA’s Mars Climate Orbiter, which was lost in orbit in 1999 due to a very simple math mistake: Failure to convert between metric and imperial units when measuring the spacecraft’s acceleration.
Given that the Orbiter had cost NASA $125 million to build and deploy, this was a costly lesson in the importance of double-checking your math.
What happens when you take the square root of four, then subtract two? Most calculators -- and fifth-graders -- would tell you the answer is 0.
But for about a decade, Windows Calculator reported the solution to this equation as -1.068281969439142e-19 or -8.1648465955514287168521180122928e-39, depending on which mode the calculator was set to use. Either way, the answer was clearly wrong.
The error was caused by a bug in the square root algorithm that the calculator used. And while it never led to any known incidents or outages, what’s notable about this math failure is how long it persisted: It took Microsoft nearly a decade to fix the bug, even though it was a widely known issue.
The lesson here, we suppose, is that even seemingly simple math calculations can go wrong when you don’t check your software for bugs. And when the main purpose of your software is to perform math, it’s especially important to vet it for mathematical flaws.
It turns out that buggy calculations are an issue not just for basic apps like Windows Calculator, but also state-of-the-art hardware components.
In 1993, Intel introduced the Pentium line of processors, which represented a leap forward in terms of processing speed over their 486 predecessors and promised to prepare the way for the next generation of Windows operating systems.
Yet the chips suffered from a fundamental math problem: Due to a pointing-float issue, they couldn’t reliably perform calculations past the eighth decimal point.
The flaw didn’t cause any major technical incidents (apparently, Windows 95 worked just fine without being able to crunch numbers eight decimals deep). But it did turn into a PR disaster for Intel, which ended up spending $475 million to replace flawed chips. It also harmed Intel’s reputation at a crucial point in the personal computing revolution, just as Windows PCs were becoming standard fixtures in homes.
If you’re of a certain age, you’ll recall waiting with bated breath on New Year’s Eve 1999 to see if the world was about to end due to the so-called Y2K bug.
The nature of the Y2K bug was quite simple: Some mission-critical operating systems and applications, which had been deployed decades earlier but were still in use, had been programmed to represent years using two digits instead of four. Thus, there was widespread fear during the late 1990s that when the year 2000 rolled around, these systems would think it was actually the year 00, and all hell would break loose.
The anxiety that Y2K spawned was intense enough to generate a series of “survival guides” that promised to help preppers survive what some expected to be a post-Y2K apocalypse. The federal government also spent $50 million building a “Y2k war room” from which to mange incidents caused by the bug.
In the event, the Y2K bug didn’t lead to any riots, mass starvations or other apocalyptic events. Some would say the world averted disaster thanks to the considerable efforts that governments and businesses invested in updating their software prior to the dawn of the new millennium. Others contend that the problem was blown out of proportion and was never as serious as some claimed.
Yet on a smaller scale, the Y2K issue did cause some incidents, like satellite failures and isolated mistakes in banking systems.
There are perhaps two takeaways from the Y2K affair. One is that programming issues that may seem catastrophic don’t always turn out to be so devastating, and it’s important to keep them in perspective as you plan a response.
Another is that developers should design their systems to be reliable for decades into the future, even if they don’t expect them to remain in use for that long -- because even if the systems are resilient in a technical sense, cultural and political anxiety related to a lack of confidence in the systems can be much more disruptive than actual technical issues.
None of these math-related bugs or flaws actually led to major technical failures. But some of them caused significant financial loss and/or reputational harm -- along with mass anxiety, in the case of the Y2K affair. They’re reminders that even seemingly trivial imperfections within software can have major consequences.
{{subscribe-form}}