Practical Guide to SRE: Incident Severity Levels
Incident severity levels are a measurement of the impact an incident has on the business. Classifying the severity of an issue is critical to decide how quickly and efficiently problems get resolved.
July 6, 2021
4 min read
From network problems to computer failures, a variety of incidents can disrupt operations for systems in outer space.
Managing incidents like network outages and server failures is tough enough when they happen in a data center.
But when those failures occur in outer space, the challenges can become stratospheric. When you can’t rely on conventional monitoring and management tools, and even physical access is impossible, you need to think more creatively about incident management.
With that reality in mind, here’s a look at incidents and reliability challenges that have occurred in outer space, and what SREs stand to learn from them.
Deployed in 1990, the Hubble Space Telescope has been snapping pictures and collecting other data from outer space for decades.
But that all stopped in mid-June, when the telescope’s “payload computer,” which manages data collection devices, stopped responding.
At first, NASA engineers attempted the most generic of mitigations: They simply rebooted the computer. That resolved the issue briefly, but it recurred shortly after.
As of press time, engineers are still investigating the incident, although NASA reports that it thinks hardware failure is likely at fault. If the failed computer can’t be brought back online, the telescope can default to a backup computer that was installed in 2009, but hasn’t actually been used in production.
From a reliability management perspective, this incident is interesting in that it reflects both foresight and lack of planning on NASA’s part. The fact that NASA installed a backup computer is great, and it may well turn out to be critical for keeping the $4.6 billion telescope operational. Having a backup system already in place is especially important in this context because planning a space mission to deploy a new replacement computer could take years.
On the other hand, the assessment process that engineers have followed to troubleshoot problems with the primary computer seems a bit slow and disorganized. It doesn’t appear that NASA had a playbook in place for working through an incident like this.
Of course, when you’re dealing with a totally unique device like the Hubble Telescope, you can’t expect a playbook for everything.
When the power fails in your data center, you can fall back to generators.
The fix is not so simple when electricity fails on the International Space Station, which has lost power at least partially on a number of occasions in recent years: 2015, 2019 and 2020.
None of these incidents turned out to be fatal to Space Station crew members or cause permanent damage. But they did disrupt many of the Station’s operations. For example, the 2019 incident prevented a SpaceX cargo device from launching as scheduled.
Recurring power issues have also contributed to calls for the Space Station to be decommissioned, although its future seems to be safe for now.
While SREs no doubt wish that the Space Station had a more reliable power supply, you can’t really fault engineers on this point. It’s not as if power generation in space is easy (the Space Station uses solar panels, but they are managed by complex infrastructure that is prone to failure), and when something does go wrong, obtaining the supplies to fix it is no simple task. But these are the types of reliability issues that will need to be solved if humans are ever to conquer the final frontier of space definitively.
Minimizing network latency is hard enough when your users are just hundreds of miles from your data center.
But what if they’re 22,000 miles away, as is the case for astronauts in orbit? You end up with some pretty big latency issues, it turns out, due to the sheer physical distance that packets must travel to power communications between Earth and devices like the Space Station.
The good news is that Internet bandwidth, apparently, is relatively good in space -- so good that astronauts can video chat with their families. It’s just ping times that are bad.
This is a challenge that NASA is already working to solve by switching to a laser-based connectivity system, which will deliver much lower latency rates than the satellite-based system astronauts currently use. But until then, space remains a prime reminder that even when bandwidth is excellent, latency problems can lead to a pretty poor end-user experience.
Opportunity, a Mars rover launched in 2004, used a pretty obvious strategy for generating power: Solar panels. For years, those panels kept Opportunity happily roving across Mars’s surface, collecting all manner of scientific data.
But in 2018, something happened that engineers hadn’t fully counted on: A major dust storm caused the device’s solar panels to fail. Although system architects had planned for this event by programming the rover to go into hibernation mode in the event of power failure, Opportunity never woke back up. In 2019, NASA officially declared the mission over.
To be sure, Opportunity had a good run, and it’s hard to plan for every type of incident that may threaten operations when you’re dealing with a device roaming across the Red Planet. Still, one might wish that the rover had been designed to be just a little more robust in the event of a major dust storm.
Incidents in outer space may seem like the stuff of science fiction -- and they are, for most SREs. But for engineers tasked with managing those systems that humans have deployed into orbit or other planets, incident response is just as important as it would be in any earthly setting.