What SREs Can Learn from Facebook’s Largest Outage
An SRE’s analysis of the October 2021 Facebook outage.
May 13, 2021
9 min read
Service Level Objectives (SLOs) are a key component of any successful Site Reliability Engineering initiative. The question is, what are SLOs; and how do you determine what your SLOs should be? Once you've done that, how should you use them?
llustration by Ashton Rodenhiser
Over the last few years, many companies have adopted the principles and practices of Site Reliability Engineering (SRE) to build more resilient and reliable services. Google has been a major proponent of SRE and has even published a number of books and articles on the subject. In the "SRE Book", Google talks about SRE principles. Specifically about SLIs, SLOs, SLAs, and how they should be used to manage service reliability.
You might be saying, "Well, that's all good and fine. My question is what are all these terms and acronyms, and what am I supposed to do with them?"
Before we talk about Service Level Objectives (SLOs), and Service Level Agreements (SLAs), it's important to understand their precursor, Service Level Indicators (SLIs).
According to Google, Service Level Indicators are "a carefully defined quantitative measure of some aspect of the level of service that is provided."
In the SRE book, Google also discusses what they have termed The Four Golden Signals:
Google likes to describe this as the "fullness" of a service. CPU, disk I/O, and memory usage are common examples of ways to measure saturation, but there are many metrics that can be used for this signal.
If you monitor nothing else, the four golden signals would be the bare minimum to use for determining service health and would be the starting place for choosing your SLIs.
Service Level Objectives (SLOs) are basically a target used to determine service quality and availability based on SLIs, such as the golden signals above. In general, SLOs should be reasonable and achievable targets. It's also important to mention that while a certain metric must be met, it must be achieved over a specific time frame or it loses meaning.
Asking for 99.999% uptime--only about 5 minutes and 15 seconds of downtime per year--in anything except the most demanding or critical industries is often not a reasonable SLO. If only for the amount of resources, human or otherwise, it can take to achieve an SLO that stringent.
There will always be some latency in an application, so setting an unattainably low latency rate doesn't make sense. Under 500 milliseconds over a period of 10 minutes may be a reasonable SLO for some services, with 300 milliseconds or less over the same period being even better.
The same goes for errors. It is almost certain that an application with any normal amount of traffic will never have errors. More likely, one would want to set the SLO for the error rate somewhere around 1-5% over five or ten minutes, but your mileage may vary depending on the application.
Service Level Agreements (SLAs) are different from, but based on SLOs. Unlike SLOs, SLAs are typically going to be agreements with paying customers that a service will meet certain availability targets based on SLOs. Often, SLAs may be contractual in nature, with a refund or other penalty due for any time that a service falls outside of the SLA.
Since a deeper dive on constructing SLAs isn't the topic of this article, we'll plan on covering them in another post. Suffice it to say, that if you have SLAs with customers, then they will be based on some form of SLOs, and violating them is obviously not desirable due to customer impact.
We've made up a fictional company where we like to illustrate Site Reliability and DevOps principles in practice. We named it "Acme Technologies LLC".
Acme provides a SaaS application to millions of customers. Due to reliability issues in the past, Acme has recently started an SRE program, and hired a Site Reliability Manager to implement that program.
Since Acme had been having so many ongoing issues, one of the first things the Site Reliability Manager did was to start holding blameless postmortems to help analyze and mitigate some of the outages that have been occurring. Doing so has been very successful for Acme, and many long standing problems were reviewed and fixed.
It was time to start setting reliability goals for the teams that build the various services which comprise the Acme application to further improve availability. To keep things simple in the beginning, the Site Reliability Manager decided the SLOs would be based on Google's four golden signals.
She sat down with each team and reviewed the performance of their services over time, based on latency, traffic, errors, and saturation. While meeting with all of the teams, they discovered that not all services were built equally.
For the backend teams, their customers were the frontend teams and services. When it came to the frontend teams, it was the web clients and end users who were their customers.
Some backend services had a higher latency than other services, but this didn't adversely affect the frontend services. By comparison, the frontend services needed to respond much more quickly to keep the end users happy.
Based on the data they had, everyone decided that each team would establish an individual latency SLO for their service that was, at a minimum, tolerable for their service's consumers. Some services would meet an average 300ms SLO latency over the course of an hour, and some would meet an SLO of 200ms during that same time frame.
When it came to traffic, things were most definitely different for each service. The front end services almost always had the highest traffic; but, there were a couple of critical backend services that almost everything else depended on which handled even more traffic.
In this case, all of the teams resolved that a hard SLO for traffic wouldn't be set, but that alerts would be triggered to page someone if the traffic fell far outside what was considered normal for a given service.
Since traffic varied over the course of a day, this became a sliding window where any deviation of twenty-five percent during the course of 15 minutes would page someone. Most often, this would be in HTTP requests, but in some cases it would be concurrent connections to the service.
With experience, each team learned to adjust their alerts based on expected traffic spikes until pages were only triggered when something really unusual happened. When there was an anticipated traffic spike for a special event, alerts might be adjusted to accommodate the increase for a certain period of time.
When it came to error rate, all of the teams almost universally decided that the error rate should remain fairly low. An overall error rate of <5% over a time period of 10 minutes was considered acceptable, until they had more time to analyze the types and frequency of each error and decide on a better SLO.
Saturation varied pretty wildly between services, and was often tied to traffic. While the traffic alerts could be considered a warning of an impending problem, high saturation of the application's most constrained resources was considered a serious problem since that meant a service might soon fail.
Each service was analyzed and tested to the point of failure. Metrics were collected based on the number of requests to the system, concurrent client connections, CPU, disk I/O, memory usage, bandwidth, and so forth.
Where there was a critical system resource required to properly service requests, an SLO was established for the metric associated with that resource. The SLO was based on what was considered a baseline for the service, so that when a service metric fell out of the baseline, alerts would be triggered to involve a human to investigate.
All of the Acme development teams had now established their SLOs. They decided to set alerts based on their SLOs, so that if they were at risk of violating an SLO, the team would be paged to respond and investigate.
The teams also decided that pages would only be sent on actionable alerts. As an example, if there was a slight spike in latency that wouldn't cause one team's 300ms average over 60 minutes to be violated, a page wouldn't be sent out. However, if there was a more sustained spike that caused that average to go outside of their SLO, the engineer on call would be paged and they would dig in to find out what was happening.
Once the alerts were normalized, there were fewer spurious alerts. The Site Reliability Manager began discussing the use of error budgets with the teams to help them balance prioritizing development work alongside reliability work.
Each team was given a budget for a number of "allowable" errors, based on the previously established SLOs. So long as they didn't exceed their error budget, development work could continue normally. If they were at risk of exceeding--or exceeded--their error budget, that team would stop any ongoing new feature work to deal with any reliability issues that were cropping up.
Additional automation was added to help manage things like the on call schedules and alert routing, to ease some of the manual toil involved with the administrative aspect of the systems. They began to investigate ways to automate some of the more common tasks that would come up such as managing communication channels, work ticket generation, and postmortem data gathering.
After a while, it became a matter of pride that a team would meet their SLOs and work to make them better. Acme's software engineering teams worked hard to maintain their service reliability. Over time, the SLOs were adjusted as certain reliability milestones were met.
Of course, this post doesn't take into account all of the mitigations that each service team would need to implement to meet their SLOs. In some cases, this meant implementing autoscaling of systems so to meet varying traffic needs. With others, teams worked to reduce latency through code optimization. For some, it meant implementing better error handling to reduce the error rate.
Additional metrics were added, each with their own SLOs, to help the company further meet SLAs with its customers. End users were happier, and over time, Acme saw an increase in the reliability of the services it provides.
From May 17-20, 2021, Rootly is sponsoring SLOconf hosted by Nobl9. A conference designed for while-you’re-working participation, where you can learn more about using SLOs to improve your application reliability.
{{subscribe-form}}