The Incident Review: 4 Times When Typos Brought Down Critical Systems
Sometimes, as these 4 incidents highlight, major failure results from a mere typo or configuration oversight.
February 18, 2022
4 min read
An overview of how SREs can benefit from feature flags to improve reliability.
When you think of who uses feature flags, your mind most likely goes to developers. In general, feature flags are closely associated with software engineering.
But Site Reliability Engineers, too, can benefit from feature flags. SREs may not be the ones to create feature flags, but they should work closely with developers to ensure that the applications their teams support include feature flags.
Here’s why feature flags benefit SREs, and what SREs can do to make sure their organizations take full advantage of feature flags.
A feature flag – also sometimes called a feature toggle – is a technique in software engineering that allows developers to turn specific features on or off at application runtime. In other words, feature flags let you control which features are available within your application when it’s deployed to production.
Feature flags can be configured in such a way that they enable features to be toggled on or off for specific sets of users, rather than an entire set of end-users. This is useful if, for example, you want to turn on an experimental new feature for users who have agreed to test it, without exposing that feature to your general user base.
Usually, developers and SREs use feature flags to turn features on and off manually. But you could automate the process by, for example, writing a script that automatically shuts off a feature if your observability tools determine that the feature is causing a problem for users.
The main reason why feature flags are popular among developers is that feature flags make it easy to implement and experiment with new features while keeping your development pipeline simple.
Without feature flags, developers would have to create a new application release every time they wanted to implement new features. If any of the features in that release turned out to cause a problem, they’d have to remove the features from the code or fix the buggy features, then deploy a new release. This approach is tedious, and it leaves teams prone to delayed releases and application rollbacks.
But with feature flags, developers can continuously add new features to an application, without worrying that bugs within those features will lead to a rollback or a delayed release. If one feature has a problem, developers can simply turn it off using a feature flag, without having to do a new release.
It’s not just developers whose lives are easier thanks to feature flags. SREs, too, can leverage feature flags as a way of supporting the core SRE goal: Maximizing reliability.
That’s because feature flags minimize the risk associated with fast-moving software development pipelines. Feature flags allow teams to experiment with new application functionality – which is important if they want to keep pushing out application enhancements in order to delight users – while at the same time providing a safeguard against buggy feature implementations.
Imagine, for example, that you’re an SRE and you discover that the latest release of an application you support has a severe latency problem due to an issue with a new networking feature in the app. If that feature is controlled via a feature flag, you can remediate the issue very quickly by simply switching it off. You don’t have to wait for developers to fix the underlying code issue before you can solve the problem for your users.
This doesn’t mean, of course, that feature flags are a cure-all against reliability problems.
They won’t help you if the root cause of your reliability issue lies within your infrastructure rather than your software. If an app crashes because you misconfigured your Kubernetes clusters or you ran out of server capacity, for instance, you can’t simply turn the problem off with a feature flag.
Note, too, that feature flags can only be applied to specific features. In other words, they can only control application functionality if developers deliberately created a feature flag to control that functionality before releasing the app. If you discover a bug that’s not managed by a feature flag, the only way to remediate the issue is to redeploy the application (or microservice, if the problem is limited to just one microservice).
As an SRE, you most likely don’t have direct control over the software development process. You’ll therefore need to collaborate with developers to ensure that they build feature flags into their applications.
Doing this is simple enough. Start by educating developers about feature flags and making sure they understand the benefits. Be sure, too, to identify which features are most likely to create reliability issues, so that developers can prioritize creating feature flags for them. And consider factoring feature flags into your incident response playbooks, which will ensure that incident response teams remember to take advantage of feature flags when possible to remediate issues quickly.
{{subscribe-form}}