The Role of SREs in Observability
Although conversation about observability often ignores SREs, SREs have a central role to play in observability success.
April 29, 2021
5 min read
Kubernetes makes it easier in certain ways to manage reliability. But incident response teams and SREs must also be prepared to handle the unique reliability challenges that K8s creates.
By now, you've probably heard all about the Kubernetes "revolution", and how Kubernetes has forever changed the way developers write and IT teams manage microservices applications.
But what does Kubernetes mean for incident management? Which new opportunities and challenges does it create for SREs? Those questions are not as often at the fore of conversations surrounding Kubernetes.
Yet, they're critical questions for any team that wants to use Kubernetes not just to orchestrate applications, but to orchestrate them reliably.
With that reality in mind, here's a look at what Kubernetes means for incident management teams and SREs.
Kubernetes (K8s for short) is, of course, an orchestration platform that helps deploy and manage applications -- primarily those that run inside containers, although it can orchestrate VM-based apps as well.
Kubernetes is available in many shapes and sizes, from managed Kubernetes services that run in the public cloud, to distributions that you can set up yourself on an IaaS platform or on-prem environment of your choosing, to lightweight variants that will run on your humble laptop.
From the perspective of incident management, however, it doesn't matter much how or where you deploy Kubernetes. The points below apply to any Kubernetes environment.
You can break Kubernetes's impact on incident management into two categories: Opportunities (meaning ways in which Kubernetes makes life easier for SREs and incident response teams) and challenges (which mean the problems or complexities that Kubernetes introduces to incident management).
On the opportunities front, perhaps the biggest benefit of Kubernetes from an SRE's perspective is that Kubernetes introduces automations that help make application environments more reliable.
Granted, it would be a stretch to call Kubernetes a reliability tool. Its primary job is to orchestrate complex applications, not manage their reliability.
That said, orchestration and reliability go hand-in-hand to a certain extent. Kubernetes can do things like automatically restart failed application services and move workloads from one host server (or node, in Kubernetes-speak) to another in the event that the first server starts to fail. Starting with Kubernetes 1.8, it even offers a cluster autoscaler feature that will automatically adjust the number of nodes in a Kubernetes environment based on application needs, at least within public clouds. (The autoscaler currently doesn't support on-prem environments, so this is an example of a situation where your Kubernetes deployment method does impact reliability to a certain degree.)
You could also argue that by making it easier to deploy microservices-based apps, Kubernetes helps teams move to architectures that are inherently more reliable, at least to the extent that microservices tend to be more resilient (not to mention easier to update when you need to roll out a fix) than monoliths.
In short, then, Kubernetes injects a certain amount of reliability into application architectures and hosting environments. Kubernetes on its own is certainly no guarantor of total reliability, but it does some things to support reliability that would not be available in conventional application environments.
At the same time, however, Kubernetes can make life harder in certain ways for SREs and incident management teams.
For starters, consider the observability data that teams need to make informed decisions when responding to an incident. Kubernetes generates logs and event data, but its logging architecture is complex, to put it mildly. Collecting and analyzing data from the various logs that different Kubernetes services generate -- not to mention pulling log data from containers before the containers shut down and their logs disappear forever -- takes some doing.
This means that investigating reliability problems in Kubernetes can be more challenging. That's especially true for teams that don't have a streamlined Kubernetes observability solution in place and an easy way to share observability data across the team.
Deploying application updates to Kubernetes is also more complicated than in a traditional environment. Rather than just deploying the new version of your application directly into the environment, you need to update what Kubernetes calls a Deployment file to match the new release. You may also need to make changes to networking, storage and other configurations within Kubernetes to support the update.
These changes are not necessarily too difficult to make. But they may require special Kubernetes expertise that not every engineer has.
What this means in the context of incident response is that if you need to deploy an application update to fix a problem (or do a rollback) on Kubernetes, it's critical to ensure that your team members can delegate responsibilities and collaborate with each other in such a way that lack of Kubernetes expertise within the on-call team does not get in the way of smooth remediation.
Finally, the sheer complexity of Kubernetes itself makes incident response more challenging. Kubernetes is not just one tool or service, but a multi-layered collection of different tools that do different jobs. If something goes wrong within your application environment, tracing the problem back to the specific component of Kubernetes that caused it (assuming the issue indeed lies within Kubernetes) is hard work. It's also another example of a task that may require specialized expertise.
Here again, then, ensuring that your incident response team is equipped with the knowledge necessary to troubleshoot quickly on Kubernetes -- or that it can communicate effectively with other engineers who have that knowledge -- is essential.
The bottom line: Kubernetes can be both a boon and a burden to incident management. If you're part of an incident management team, your goal should be to encourage your organization to take advantage of Kubernetes in order to add inherent reliability to your application architecture and hosting environment. At the same time, however, you must also ensure that you have the Kubernetes-centric visibility and expertise on hand to remediate issues efficiently when they arise.
Rootly helps incident management teams work efficiently with Kubernetes-based environments. By automatically tracking the state of your Kubernetes workloads and cluster components, Rootly provides K8s-specific contextual data that makes it faster and easier to pinpoint the source of reliability issues in Kubernetes environments.
{{subscribe-form}}