A look at outages and disruptions to the IT systems that power the Olympics, from 1996 to today.
Quentin Rousseau
It's time to break down the silos separating SREs from security engineers.
Rootly is on a mission to create a world where maintaining reliability is frictionless, delightful, and accessible to anyone. Making resolving and learning from incidents every organizations superpower.
From chaos engineering to monitoring and beyond, SREs rely on several key types of tools to do their jobs.
Incident severity levels are a measurement of the impact an incident has on the business. Classifying the severity of an issue is critical to decide how quickly and efficiently problems get resolved.
What are the differences between incident management and incident response? The answer varies widely depending on whom you ask.
Service Level Objectives (SLOs) are a key component of any successful Site Reliability Engineering initiative. The question is, what are SLOs; and how do you determine what your SLOs should be? Once you've done that, how should you use them?
Kubernetes makes it easier in certain ways to manage reliability. But incident response teams and SREs must also be prepared to handle the unique reliability challenges that K8s creates.
SREs may have better long-term job prospects, but DevOps might be an easier career to pursue.
By adding new complexity to reliability engineering and making physical war rooms a thing of the past, COVID-19 has imposed permanent changes on incident management. Here’s how SREs can respond.