28 Nov Continuous Improvement and Being Anti-Fragile in DevOps
Continuous improvement is better than delayed perfection” – Mark Twain
Let me just start by saying it’s becoming increasingly more apparent year after year how beneficial it is for organizations to adopt DevOps practices. This years State of DevOps Report by Puppet has some incredible statistics on DevOps high performers, such as:
- 46 times more frequent code deployments
- 440 times faster lead time from commit to deploy
- 96 times faster mean time to recover from downtime
- 5 times lower change failure rate
One of the statistics I want to focus on now though is the 96 times faster mean time to recover from downtime. Everyone knows downtime is death. So how do high performing DevOps organizations recover so much faster? One of the driving factors is Continuous Improvement, and more importantly, being Anti-Fragile. For those of you that have never heard of the term Anti-Fragile (since it’s not a software specific term), the basic idea is that since failure is inevitable, it’s important to set your team up to be able to absorb the failure, recover quickly and then learn from it. Most successful organizations will say that if you aren’t failing every now and then, you aren’t trying hard enough.
With DevOps, failure isn’t always a negative thing, it’s a learning experience. High performing teams assume things are going to hit the fan here and there, so they build for fast detection and rapid recovery. That’s the key to this and where Continuous Improvement comes in. For an excellent deep dive into this, I suggest you read up on Netflix’s Chaos Monkey. Here is a question for you. During your Postmortems, do your failures result in a bunch of finger pointing? If so, you’re doing it wrong. This is a crucial time to discuss as a functioning team, where your processes fell down, and then how to strengthen them moving forward. Like Atlassian tends to say, Continuous Improvement and failure go hand in hand.
In order to understand if your Continuous Improvement efforts are actually working though, you need data. One of the biggest issues with data though is that since there are tools that measure just about every single kind of data you would ever need, teams don’t tend to focus on the data they actually should. Simply put, just because you can measure something, doesn’t mean you should.
So what data should you start with? Atlassian suggests you go back to agile development and collect the basics, such as:
- How often do recurring bugs or failures happen?
- How long did it take to go from development to deployment?
- How long does it take to recover after a system failure?
- How many people are using your product right now?
- How many users did you gain/lose this week?
By starting with these measurements, you can easily tell whether your Continuous Improvement efforts are working or not and how Anti-Fragile your team is becoming. Once you have mastered those data points and built a solid foundation, then you can dive deeper into your measurements, fine-tune your anti-fragile team with continuous improvement and be one step closer to a high performing DevOps organization.