Why and How SREs Can Benefit from Feature Flags
An overview of how SREs can benefit from feature flags to improve reliability.
August 5, 2022
5 min read
Millions of Canadians offline. For SREs, the Rogers outage is a lesson in the importance of testing updates, building redundant infrastructure and having a crisis communications plan.
When, eight years from now, folks are creating lists of the top IT incidents of the 2020s, there's a good chance that they'll include the Rogers outage of 2022. The failure, which made Internet and cellular network service unavailable for more than 12 million users across Canada, was one of the most significant outages in memory, in terms of both the number of affected accounts and the level of service disruption.
For Site Reliability Engineers, there's a lot to learn from this incident, from both a technical perspective and a crisis-management perspective. Here's a look at our main takeaways.
Before diving into lessons for SREs, let's go over what happened during the Rogers outage and what we know about the cause of the incident.
Rogers is a major telecommunications provider that serves around 12 million customers in Canada. Early on July 8, it suffered an incident with its infrastructure that made Internet and wireless phone service unavailable for most or all of its customers. The outage was so complete that users couldn't even make 911 calls or withdraw money from ATMs – which means the failure brought down not just ordinary services, but also what you might consider resources that are essential for normal daily life.
Connectivity was restored for most users within about fifteen hours. However, some customers reported continuing disruptions for several days after the beginning of the outage.
At first, Rogers did not reveal much technical detail about what triggered the incident. A statement attributed to the company's CEO the day following the outage said that a "maintenance update in our core network…caused some of our routers to malfunction." But it didn't explain specifically which type of maintenance update the company was trying to apply, why the update didn't go as planned or which types of routers were affected.
It wasn't until several weeks later, when Rogers submitted a response to questions from the Canadian Radio-television and Telecommunications Commission, that more technical detail came to light. Rogers explained to the CRTC that, during an update process, the company ran code that deleted a routing filter. It's still unclear exactly how the bad code made its way into the update routine.
For SREs, the Rogers outage is an example of a worst-case scenario. It was an extremely disruptive incident that took a relatively long time to mitigate, at least when you consider how severe the outage was.
That's why SREs should use Rogers's experience to assess what SRE teams can do to avoid issues like this – and manage them more effectively when they do happen.
Test your updates
From a technical perspective, one of the biggest takeaways from the Rogers outage for SREs is that you should test updates thoroughly before applying them. If engineers had tested their update routine more carefully, they would presumably have detected the bad code that deleted the routing filter before the deletion actually took place.
To be fair, Rogers is hardly the first company to experience a major outage due to a bad update. Atlassian, for instance, had a similar issue earlier this year, when a buggy migration process caused a long-lasting outage.
If you're an SRE, you don't want your company to be the next one that makes headlines when something that should be a simple change, like an update, triggers a massive outage. Test, test and test again before bringing your changes into production.
Build in redundancy
It also seems reasonable to conclude that Rogers didn't have redundancy built into its infrastructure in the form of backup routers that could take over when the primary routers crashed due to the bad update.
This is another reminder to SREs about why it's so critical to avoid putting all of your eggs in one basket or creating single points of failure. Redundancy may cost more, and it may not totally prevent disruptions. But it can make them less severe and easier to recover from. It's not hard to imagine the Rogers incident having been a lot less severe if even just 20 or 30 percent of its traffic could have been automatically redirected to backup routers. That might have been enough to keep critical services operational, at least.
Have an external crisis communications plan
Rogers has received more than a little criticism for not communicating well with its customers during the outage. The company waited hours before releasing an initial statement about the failure. It was not very responsive to media queries early-on, and, as noted above, it took several weeks before Rogers finally released technical details about what went wrong (and even then, the details emerged only because of a government inquiry into the incident).
The lack of rapid, transparent communication probably wasn't deliberate. We have to believe that Rogers didn't want to leave customers in the dark. Instead, the slow communication likely reflects the absence of a pre-planned strategy for external incident communication in a situation where Rogers' own networks were non-operational. Without a communications plan, Rogers presumably struggled to decide which information to share, or how to share it, in the midst of a major crisis.
If Rogers had had a playbook in place for communicating with the public and media about this type of issue before its network went down, it probably would have been able to share information more effectively – and it may not have faced as much of a backlash from customers and regulators.
Internal communications, at least, presumably were smoother. The company did manage to get service back up for most customers within about fifteen hours, which probably would not have happened if its engineers lacked a means of communicating with each other as they worked to resolve the incident.
For SREs, the Rogers outage of 2022 is a lesson in the importance of testing, designing redundant infrastructure and preparing for transparent communications before crisis strikes. While more careful planning in these respects may not have totally prevented the incident, it would likely have at least reduced its severity, and mitigated the backlash that Rogers now faces from angry customers and regulators.
{{subscribe-form}}