In conversation with
The CEO and co-founder of Verica on using chaos engineering to navigate complex systems
There’s no business value in breaking services that your customers rely on.
I think “move fast and break things” is a phrase that sounds cool, and makes a lot of sense for disrupting old ways of doing business. But there's no business value in breaking services that your customers rely on. So we really need to uncouple that from the notion of chaos engineering—chaos engineering is not engineering chaos into the system. It's: hey, if you have a complex system, you already have chaos. How do you engineer around that? How do you navigate the complexity, instead of succumbing to it or being surprised by it?
Your prep and follow-up are the most important parts of the experiment.
The preparation you do before a chaos engineering experiment can often be the most valuable part of the exercise. Interviewing stakeholders, interviewing engineers, just going through the process of, “OK, if this happens, what do we think this system is going to do?” Often when we're asked how an organization should get started with chaos engineering, we say, “Who’s your lead engineer in the part of the organization that you're concerned about, or the group that you think most of the weight of the infrastructure relies on?” I ask them what they lose sleep over. Nine times out of ten, they say, “ZooKeeper—if that goes down, I don't know what's going to happen.”
So something like that might lead to a chaos game day or not. But going through, “OK, imagine we're doing a chaos engineering game day. We’ll limit the blast radius and then take down that ZooKeeper. What do we think would happen?” And just walking through those steps can be incredibly enlightening. You're trying to teach the people who maintain, operate, and build the infrastructure more about their infrastructure than they currently know. A lot of that can happen in the actual process of conducting the experiment, but a lot of it can happen just in the prep.
The follow-up portion after you go through that kind of exercise can also be incredibly valuable. We recommend facilitated individual interviews: that format works best for uncovering gaps in mental models. We want to find, “OK, this person over here thinks this system is going to behave this way, and then somebody in this organization thinks the system is going to behave that way.” We bring them together in a room and have a couple of different groups share, and that's when you get to see the light bulbs go off, and people are like, “Oh, I had no idea that XYZ. Actually, we've seen that before.” Those kinds of discussions really help elucidate the safety margin, and they also help engineers build up that intuition of the safety margin, which leads to safer systems.
Set the stage for better improvisation.
When you’re navigating complex systems, I don’t think you can reach a point where you say, “OK, now we've got a complete picture—and that's something that we could put into an artifact that will help distill reality to other people in a better way.” Software engineers tend to spend short spans of time at technology companies: if you're in software, you basically have to think of institutionalized knowledge as an ephemeral asset.
Instead of building up a base of institutionalized knowledge, the research we support suggests that you'll do better if you focus on communication, and better ways of adapting in the moment—improvising. If you can set the stage for participant actors to improvise better, then that will tend to have better safety outcomes. It's not so much as discovering what reality is; it’s more like making sure that people are able to discover things that they didn't previously know, and then to individualize that context.
A lot of teams rely on training or runbooks or documentation—and those things can be great. If it helps you to communicate by writing things down, that’s fine. But generally, they’re not going to move the needle when it comes to availability. From a business perspective, most significant outages are unique, and if you're relying on runbooks to determine a protocol for how humans behave given a certain set of circumstances, then you're basically documenting a bug you haven't fixed yet. You should fix the system so it doesn't happen, or automate the protocol so a human doesn't have to interfere. Humans don't operate that way—putting a human in a position where, in the presence of an incident, they have to go look at a protocol is not a great pattern for safe systems.
People can’t scale out—but tooling can.
The maturity of chaos engineering as a practice in an organization can go in one of two directions: either creating more and more mature experimentation, or getting broader adoption across the company or organization. Increasing the sophistication, we’d expect to see organizations going from infrastructure experiments to application-level experiments to business logic-level experiments. In terms of adoption, you’d typically expect to start with one small group, and then broadening the program out through more parts of the software organization.
Typically what we see in terms of resource allocation, from a business perspective, is a model that tries to follow the SRE model, which I don't think is necessarily the best one. Most organizations form a chaos engineering team, staff that up, and have them try to move horizontally through an organization. Like SRE, I think that model suffers from the same problem, which is that you can't scale people—it's really difficult to have their purview scale out.
Where we tend to see more success is when that team is able to build tooling, and support tooling as a centralized resource for the rest of the organization. The tooling can scale out. At Netflix we built ChAP, the chaos automation platform that can be applied to the microservice architecture of Netflix's control plane. That's something that, at the state of the industry we're in right now, tends to require a lot of customization, because everybody's infrastructure is so different. As that homogenizes over the years, I expect that the tooling will be more capable, and it will scale better.
Injecting failure into the system won’t make it more robust.
There’s a line in chaos engineering, that it's about breaking stuff in production—and that obviously couldn't be further from the truth. I prefer fixing things in production. I think there's this notion that if you just inject failure, eventually you'll somehow make the system more resilient or robust. That also is not true. It's an appealing idea, but it's kind of like saying, if you punch something enough times, eventually it gets strong enough to withstand the punches. We just don't have evidence to suggest that's true or a good methodology to experience.
You're trying to educate the people who maintain, operate, and build the infrastructure about what their safety margin is—that's a proven method for moving the needle on availability or safety properties of a system. Focusing on elucidating and enumerating safety margins is a great place to start. Architecturally optimizing for reversibility can help proactively improve your adaptive capacity, your ability to improvise. And a good facilitation of incidents—here I’ll call out, if you're doing root cause analysis, at best, you're wasting your time. Things like the 5 Whys are completely arbitrary, and just reinforce biases you already have, so you’re probably not getting value out of things like that, either.
Chaos engineering is about learning to communicate.
When teams do chaos engineering experiments, ideally you want them to learn: What are other teams around them concerned about? What is the business trying to get out of their system? And, of course, I want them to go out and read the book on it. But developing better ways of communicating inside the organization is the best indicator to see whether they got value out of the exercise.
At the end of the day, one of the things I like to remind engineers is the business doesn't care if the thing works the way you think it works—they just care about the system output. If the customer’s happy, the system could be on fire, but the business doesn't care. You, as somebody who has to put out the fires and support the infrastructure, you care. So that's something you should certainly take into account when you're deciding what to work on. But it's a very different perspective from modeling or testing, where you're trying to figure out, is this thing doing what I want it to do? Am I getting the result that I want? And do I really care if it's doing what I want it to do or not, as long as I'm getting the result that I want?
Casey Rosenthal is CEO and co-founder of Verica; formerly the Engineering Manager of the Chaos Engineering Team at Netflix. He has experience with distributed systems, artificial intelligence, translating novel algorithms and academia into working models, and selling a vision of the possible to clients and colleagues alike. His superpower is transforming misaligned teams into high-performance teams, and his personal mission is to help people see that something different, something better, is possible. For fun, he models human behavior using personality profiles in Ruby, Erlang, Elixir, and Prolog.
The VP of Product Strategy for Cortex at Palo Alto Networks on collaboration during security incidents
The founding engineer of Troops.ai on starting a company, learning from past mistakes, and being a mentor