In conversation with
The head of product at Levels Health on how to plan for incidents before they happen
You can’t plan how to handle incidents during an incident.
When incidents get kicked off, everybody is in this heightened emotional state. The technical term for it is an amygdala hijack. Even though we’re safe at our desks in climate-controlled rooms, there's something that seems physically threatening—the possibility of losing our job makes us lose our sense of safety. It's important to call out that, during those times, all of the higher reasoning parts of our brain are completely off.
So, first and foremost, the time to do any planning and coordination for how you’re going to handle incidents is not during an incident. Ahead of time, when cooler heads prevail, discover the type of roles that people feel more comfortable playing under pressure. (Expecting people to have growth moments during an incident often sets people up for failure.) The human animal behaves a certain way in a high-stakes scenario, and it's not up to us to counter that. It's far better to embrace it and work within it.
Talk about what you value ahead of time.
What do we believe about our company and our brand? How forthcoming do we want to be around incidents? What roles do we want people to play? It's best to get those answers hammered out—to talk about what you value—ahead of time. Company values help us make hard calls when we would otherwise be tempted to make a bad decision. Incident values do the same thing. So, for instance, we may be tempted to share a little bit less, to brush the details of our incidents under the rug. Those impulses aren’t malicious, they’re just the default—unless we state our values ahead of time. Incident values are a way to acknowledge up front that the company will face tough junctures during incidents. It’s also a way to ensure that, though we may want to make the easiest and least painful decision, we’re going to make the best decision—even if it’s harder—because of what we believe about our company and our values.
Be truthful and straightforward about what’s going on.
In the absence of information, your users are definitely going to be thinking of the worst case scenario (even if it’s highly unlikely). For that reason, I encourage frequent updates, even if there's no new information to be had. You want to be truthful and straightforward about what's going on without over-promising—for instance, an ETA down to the minute. If you deliver on it, it's a neutral outcome. If you miss it, then people lose trust, which is one of your primary currencies during an incident. Keep in mind, trust in a brand can actually increase depending on how it navigates an incident or crisis.
The same rules for external communications apply to internal communications as well. On one side, you've got customers worried about what's going on. On the other, you've got internal people—executives or the leadership team—worried about what's going on. The only difference is that you can adjust the level of information shared during updates.
Severities are a great example of this: a single point of understanding for the entire organization, an immediate way to communicate customer impact to both a customer support representative and the CEO. And it can help guide how many people you bring into action. Are we going to turn the volume up or down on our response timings, our escalation, et cetera? And, as we talk about running drills and having documentation for different incidents, severity levels can be used as a fork in the road where the organization turns onto either a more frequent or less frequent communication path. Keep in mind, a specific incident might go through various severity levels as time goes on.
People value the emotional dignity and the respect of giving them their time back.
When people are on-call—especially when they have bad nights and bad weekends that encroach on their family time, their personal time, their sense of identity outside of work—they value the emotional dignity and the respect of giving them their time back above being monetarily compensated. When we take a little trinket of dollars and pass it across the table, there's a way in which we’re valuing their family time at X dollars per hour. It’s not going to be worth as much as the feeling that someone is going above and beyond to extend that time and that consideration back to you.
For someone who is part of an on-call rotation, if it feels like it's not going to get better, that you will always have a turbulent life—that’s very stressful. What’s worse is that you feel you have no control over it. Focusing on fixing specific issues around stability, making sure that systems are not waking people up in the middle of the night, it just bodes better for everybody.
Celebrate the wins and celebrate the missteps.
Incidents are very tough on organizations. The customer doesn’t necessarily see that a lot of your systems are running around with massive technical debt, but they do see incidents. If you're feeling tepid about declaring an incident, there are probably cultural forces contributing to that. So you need to lower the amount of judgement that goes into it. An incident declaration guide—where you say how bad you think the incident is, how much our customer base is being impacted—takes the emotions out of that initial step. It stops feeling like a judgment call, and instead feels like “I just followed the steps we all agreed upon.”
In addition to preparation, culture is key. Having an emotional understanding of what it's like to be a technical person on-call makes it viscerally clear why spending time on stability and on test coverage is worthwhile, because on the surface they don’t seem to add customer value. I’ve been alone on call, woken up at 2 a.m., in a delirious state, trying to both fix the system and communicate with our customers. One very big opportunity that leadership teams have is to place people in the on-call rotation that aren’t on the technical side of things.
More broadly, what would it look like if incidents were celebrated, or congratulated? What can we learn from them if we ensure the process is fun and good, as opposed to just focusing on uptime numbers? Celebrate the wins and celebrate the missteps. Take it all in stride.