Blog

Incident management insights, guides, and product updates from Rootly

Search...
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Practical Guide to SRE: Automating On-Call

Practical Guide to SRE: Automating On-Call

Let's all face it, on call work isn't fun. But it can be better. Even if you have to work on call, it would be nice to have at least some of the work done for you, before you drag yourself out of bed at 3am to respond to an incident.

JJ Tang

JJ Tang

May 6, 2021
8 min read
How Kubernetes Can Both Help and Hinder Incident Management Teams

How Kubernetes Can Both Help and Hinder Incident Management Teams

Kubernetes makes it easier in certain ways to manage reliability. But incident response teams and SREs must also be prepared to handle the unique reliability challenges that K8s creates.

Quentin Rousseau

Quentin Rousseau

April 29, 2021
5 min read
Creating Chaos to Achieve Reliability

Creating Chaos to Achieve Reliability

How can creating chaos achieve better reliability? Chaos and reliability might seem mutually exclusive, but through the use of Chaos Engineering, SREs can bring about meaningful changes to system resiliency.

JJ Tang

JJ Tang

April 22, 2021
5 min read
Should You Be an SRE or a DevOps Engineer?

Should You Be an SRE or a DevOps Engineer?

SREs may have better long-term job prospects, but DevOps might be an easier career to pursue.

Quentin Rousseau

Quentin Rousseau

April 15, 2021
5 min read
How Would an SRE Conduct a Postmortem on the Suez Canal Incident?

How Would an SRE Conduct a Postmortem on the Suez Canal Incident?

The Suez Canal has been big news over the last couple of weeks. We wondered how a Site Reliability Engineer (SRE) might conduct a postmortem on what happened with the Ever Given, and what that might mean if a comparable incident occurred at a modern tech company.

JJ Tang

JJ Tang

April 7, 2021
8 min read
How SREs Can React to COVID-19's Impact on Incident Management

How SREs Can React to COVID-19's Impact on Incident Management

By adding new complexity to reliability engineering and making physical war rooms a thing of the past, COVID-19 has imposed permanent changes on incident management. Here’s how SREs can respond.

Quentin Rousseau

Quentin Rousseau

April 2, 2021
5 min read
Beginners Guide to Incident Postmortems

Beginners Guide to Incident Postmortems

Successful and blameless postmortems can turn incidents into a gift of learning and prevent repeat mistakes.

Camille Hodoul

Camille Hodoul

February 7, 2021
3 min read