Beyond MTTR: 7 incident metrics that matter and 3 that don’t
Measure what matters, not what is easier. Learn tips to untangle the different common metrics used by SREs.
July 21, 2021
5 min read
It's time to break down the silos separating SREs from security engineers.
It's common today to talk about the "gap between security and development" or the "DevOps security disconnect." That makes good sense; there is indeed a need to de-silo security from development and DevOps processes.
What receives surprisingly less attention, however, is the disconnect between reliability engineering and security. For all that we talk about DevSecOps, we pay almost no heed to the importance of integrating security more centrally into the incident management work performed by SREs.
Here's why that is a problem, and what SREs can do to help address it.
When you're an SRE troubleshooting a reliability problem, security is probably not your first priority. Restoring service to your users as quickly as possible is your main focus. Your secondary focus is probably ensuring that the incident doesn't repeat itself.
If you think about security at all, it's probably not going to be until well after the incident has been resolved definitively. Maybe security will come up during your post mortem, if you remember to perform one.
This tendency places SREs at odds with security engineers, whose main focus is on security, not reliability. If you're a security engineer, you want to make sure that whatever remediation SREs apply during incident management is the most secure resolution. Whether it's the fastest and most reliable remediation is not your problem (it's the SREs' problem).
The fact that reliability engineers and security engineers have different goals creates a challenge because at many organizations, these two groups don't typically work very closely with each other. SREs may interface well with developers and IT engineers, but not with the security team.
As a result, it can be a real challenge to ensure that incident management reinforces the priorities of both SREs and security teams. When these two groups don't talk to each other, you end up with SREs implementing remediations that may exacerbate security issues, while security engineers come along and make changes that could decrease reliability.
For example, imagine an SRE who needs to fix an application that has failed. To restore service, the SRE rolls back to an earlier release that is known to be stable. But because the SRE hasn't been talking to the security team, he doesn't know that the earlier release contains a security vulnerability. The result is a rollback that re-introduces a security problem to the production environment.
Later, the security team -- which doesn't know about the application failure because it hasn't talked to SREs -- comes along and yells at IT for having an insecure release in production. So the newer, unstable release that had been rolled back gets redeployed. That fixes the security problem but re-introduces the reliability problem.
If you're trying to build a software delivery operation that is both secure and reliable, this isn't it.
Beyond issues related to reliability and efficiency, the disconnect between SREs and security engineers can also have compliance consequences.
Among the core requirements of SOC 2, a common compliance standard for companies that provide SaaS applications and cloud services, are having procedures in place to identify and respond to security issues during incident response.
This means that not integrating security into incident response makes it harder to demonstrate to auditors and customers that your SRE team is prepared to manage incidents in a way that gives equal priority to security and reliability.
If you’re not sure how to stay compliant, compliance automation companies like Secureframe can help you create policies that include a security incident response plan, as well as ensure you have set up proper communication lines to receive disclosures and reports of incidents or system failures.
How do you fix this disconnect between security and reliability? Addressing the SOC 2 criteria is a starting point. But the broader answer is to extend DevSecOps to cover reliability engineering and incident management.
Traditionally, DevSecOps has focused on the idea that security engineers, developers and IT engineers need to work closely together. SREs have not been part of DevSecOps, at least not directly. At best, they were plugged in indirectly by interfacing with developers and IT.
If you make incident management part of your software delivery process, however, and you also embrace DevSecOps, it becomes much easier to ensure that SREs and security engineers work toward common goals, instead of working against each other.
Implementing DevSecOps for SREs involves a few key practices:
When teams have these resources in place, it becomes much easier for SREs to come under the umbrella of DevSecOps.
IT organizations have broken down many of their internal silos over the past decade, but not those separating SREs from security engineers. Now is the time to bridge that gap by giving SREs the communication and collaboration tools they need to work efficiently with security teams.
{{subscribe-form}}