In conversation with
The VP of Infrastructure & Security at Algolia on navigating both availability and security incidents
Keep the temperature down.
What are the biggest challenges that teams you work with encounter related to information security incidents?
From my point of view, with security incidents, you don’t want to freak people out. Because with availability incidents—the service is down, we’re trying to understand if customers are impacted—the worst-case scenario is that it just doesn’t work. And yes, someone is potentially losing money there. But with security incidents, the security team comes in, dives into it, and they start to open one Pandora’s box after another.
With availability incidents, you’re trying to prove that the system doesn't work, but with security incidents, you’re trying to prove that nothing happened. It's very hard to prove that something didn't happen. When your normal system is working, you can prove that it doesn't have availability issues. But when your normal system is working, can you prove it doesn't have security issues? That’s difficult.
Security incidents require a different sensitivity. People might be overreacting and going down rabbit holes: “This is possible, and this is possible…” You need to rationalize the discussion in terms of what could reasonably happen based on how you’ve observed the system to behave—and you have to keep the temperature down. If you start to overheat as an incident commander, you’re going to take everybody with you, and they’ll start to lose hope in any consistency of confidentiality of the system. I don’t want to say take it slower: the discussion just needs to be more mature. And people need to be more careful about what they're saying, because it can have a very large impact.
The incident commander needs to manage expectations.
If keeping everyone calm and mitigating people overheating is critical to running incident responses, I’m curious how you actually do that. How do you navigate the team dynamics as your experiences are unfolding?
In a security incident, the incident commander becomes more valuable than ever, because they really need to suppress the overreaction. Most incident response processes are built on the idea that the incident commander is infinitely responsible—and infinitely reasonable. And that the incident commander is not completely flawed.
What if you have an incident commander who doesn’t have the self-reflection, and just wants to hunt for the next amazing security incident? With security incidents, you really want them to be as boring as possible. Availability incidents can be exciting—we triggered some weird kernel bug, which caused the socket to close prematurely and the connections reset. But in security, you mostly want it to be boring—yeah, there was a buffer overflow somewhere in the application, so it crashed, that’s where it ends.
So the incident commander needs to be even more responsible, and the job becomes harder, because they need to manage expectations, and possibly unnecessary fear. They might be walking on the edge of thinking, if we don't fix this, it could be the end of the company, from “we throw money at the problem” to “we don't have enough money to throw at the problem.”
It also depends on the culture inside the company: how security is respected, or how seriously it’s being taken. If a lot of incidents end up as false positives, the overall infosec environment gets very noisy—and it gets very panicky. People build resistance to the system, and they're going to question everything. But if security comes only into a situation where it's like, “OK, now everybody stop and listen, we are working on this,” then it's good, right? Then you can control the information. People are not second guessing security—and they're not second guessing the incident commander.
Share actionable information up the chain.
You touch on something so important here, which is how teams accurately calibrate and understand the various signals they're getting. What challenges do teams face around communicating internally and externally—and how do you actually weigh all of the different streams of communication coming at you?
There’s this important concept of managing up: a security manager manages the security team, and they report to the CTO, and the CTO reports to the CEO, and the CEO reports to the board. Now, in which part of the chain does the information come in? It needs to bubble up, but some of it is suppressed at every level, depending on the importance. Some of this stuff is not going to get to the board—there is no point. Some of this stuff is not going to get to the CEO—there is no point. Some of this stuff is not going to get to the CTO—there is no point.
The bubbling up puts the information into a larger and larger picture. But that's precisely why you have someone above you—until you become the CTO, and then there’s only the CEO and the board, and you probably have a red phone to dial them directly and say, “Hey, we have a problem. Let's bring in legal. We need to talk about this.”
I don't want to wake up my CTO in the middle of the night with a bunch of fractional information, or to tell him there’s potentially an issue, but maybe not. He doesn’t pay me to wake him up in the middle of the night with a range of options; I need to filter things out. So it’s important that when you reach out, there needs to be something actionable, or something informative enough to say, “OK, this is the problem, and this is what we are doing to do.”
Report all kinds of incidents in a consistent way.
Different types of incidents can change the responsible parties for resolving and mitigating. How does that affect the way that the teams work together and communicate those incidents?
I’m probably biased, because I do both infrastructure and security: I have an SRE background, and on top of that, I do security. I want to inform customers about availability issues the same way I inform them about security issues. I don't see that much of an incentive to complicate the situation by splitting the two. Rather, we have a very high standard in terms of communication for security, and we have a very high standard for detections and for status pages around availability. OK, let’s merge it together—and let's keep informing about both of those in a consistent way.
Because yes, those are incidents impacting the service without any specification—availability or performance or confidentiality, they’re just incidents impacting the service. And indeed, if you take a look at it, in most contracts, the wording around what is an incident is going to be super vague, and you actually might be in breach if you’re not informing your customers about it. So we’re trying to keep those two as as aligned as possible, and not treating them too differently.
Remember: A security incident is just another incident.
You have a unique role leading teams across Infrastructure and Security at Algolia, which means your teams handle a variety of incidents spanning availability and technical incidents. How do you think about training teams and designing systems to handle many different incident types?
I think that keeping security incidents and the processes around them as close to availability incidents as possible makes them not sexy in a helpful way. There is value in making security incidents seem ordinary and akin to availability incidents—for one, it means people aren’t gawking at some big disaster.
It also means you don’t need to retrain people, you have a systematic way of handling issues. No one can come and say, “Oh, but I don't know how to handle security incidents,” because it's the same, and vice versa. Our security engineers can handle availability incidents because they’re not that different. You don't need to overthink it. You're looking at different stuff, but the reaction is the same—and the expectation is the same.