De-Siloing Incident Management: How to Make Reliability Engineering Everyone’s Job
4 best practices for breaking down silos and establishing a culture of shared responsibility toward reliability.
February 4, 2022
6 min read
From alerting to during to post incident, great communication is the key to effective incident response.
When a production incident occurs, quick and efficient communication is key to a short MTTR. This does not only involve communication between the engineers involved in triaging and resolving the incident. Communicating with stakeholders both inside and outside your organization is equally as important, but is performed at a different pace and cadence and needs to be handled differently. If communication breaks down, it could lead to longer times to resolution, violated SLOs and SLAs, and even fines and loss of revenue.
The first communication that happens during an incident are the ones that go out when something breaks. This is usually generated from one of 3 places:
The first scenario is the most desirable. This usually means that you were able to discover an issue before anyone else, which allows engineers to get in front of the problem before it has a major impact on your customers. The second scenario is still acceptable, but should definitely be addressed as an action item in a post-mortem. The 3rd scenario is the worst. This will usually bubble up to management and should be addressed with high priority after the incident is resolved. This indicated a gap in your monitoring, a blind spot.
For the sake of brevity, let's assume that you were alerted by your monitoring tools. Ideally, you will have your communication channels setup so that each monitor alerts the correct group of people to handle the situation. This is usually either your SRE team, or the development team that "owns'' that service, or both. Depending on the service tier and importance of this service or this particular monitor, it could page, email, make a phone call/SMS, send a message in Slack/MS Teams/Mattermost/whatever messaging tool your org uses, or any other means of communication that is appropriate for that particular monitor. The important thing is that it is actually acknowledged and responded to. Communication is a 2 way street. Otherwise it is just noise, and should be eliminated, lest it leads to pager fatigue, which has repercussions that are a topic for another article.
If you have any status pages set up, these should be updated. Ideally, this is done automatically, but if the initial alert came through stakeholder communication, then you will need to do this manually. If you do not have status pages, then you should send out an email to a group that includes all interested parties, both internal and external. This should be brief and to the point, as you probably don't have too many details to share at this point. Make sure to include when you intend on sending updates and how frequently. If you don't mention this, you will have LOTS of people asking you for updates, which just creates confusion.
After the appropriate people are paged, gathering them in one place and coordinating a response becomes the next priority. If possible, you would have some automation in place that creates a dedicated messaging channel, as well as a video-conference-based meeting where all of the people that are working on the incident are directed to. This will be your virtual WAR room. It is important that ONLY the people that are working on the incident are here. Having outside stakeholders, or other people that are otherwise not involved can only create confusion and distractions.
The next task will be to assign roles. The roles that we want to focus on here are the incident commander and the communications lead. The incident commander needs to determine what communications they deem necessary to send outside of the WAR room. The communications lead is the person that will send out these communications. Another important task for the communications lead is to remind the incident commander that they need to send an update on the cadence that was communicated during the initial communication. Sometimes there are no updates, but even that bit needs to be communicated out to stakeholders. Also, in some cases, the communications role can be subdivided into multiple roles such as partner comms lead, social media comms lead, and any other channel that has concerned stakeholders.
Once the problem is mitigated, communication does not stop there. At this point, it usually falls on SRE to perform any tasks that remain. This usually includes determining a root cause, assessing impact, identifying action items, and performing a retrospective on the incident response. This is usually done in a post-mortem meeting.
After these tasks are completed, the root cause and impact needs to be sent out to stakeholders. This not only serves to inform concerned parties about the root cause, but helps build credibility for the support organization. Messaging here is key. You may want this communication to be reviewed by management or a PR person, as it may directly affect business relationships. A rule of thumb is to once again keep it concise and to the point, without using too many technical terms. The audience for this communication is often not very technical.
Action items need to be communicated to the product and project management personnel for the relevant application teams. These usually need to have a high priority, and will need to be worked into a sprint, or whatever your organization uses for time and resource planning. Follow-up communications also need to happen, usually on a weekly cadence, until all action items are completed.
Lastly, it is helpful that everyone involved in the incident response understand what came out of the retrospective part of the meeting. This should be integrated into your documentation for incident response and these changes should be communicated out to anyone who may find themselves in an incident response team in the future.
Careful and thorough planning and thought needs to go into your incident communications in order to avoid chaos during a crisis. From before an incident happens to after the resolution, both internally and externally, a process needs to be established and followed, as well as improved upon whenever possible. Although this will not stop things from breaking, it will help things get fixed much more quickly and efficiently. This is as much a key to the success of any SRE team as the best technical skills.
Mike Marchese is a site reliability engineering (SRE) manager at Ford Mobility (Spin). You can learn more about him on his website.