In conversation with
A principal developer advocate at Honeycomb on embracing defects and treating reliability as a product feature
Don’t be afraid of defects.
The number one thing I encourage people to do is adopt progressive delivery practices: make it so fixing a bug doesn’t require a massive overreaction or response. At Honeycomb, escaped defects happen multiple times per week—it’s routine. We have this idea of testing in production: you’re never going to find all the defects in your staging environments, and it doesn’t make sense to try to catch everything before it reaches customers. You can catch some things, but overinvesting in your staging environment isn’t a good idea.
Instead, the investments we’ve made have been about making sure our delivery pipeline is able to produce artifacts reliably within 15 minutes, every time. If we catch a bug, most of the time it can be feature-flagged off. In the event that it can’t, we have a process for reverting to the previous build. And if that doesn’t work, we have the capability to just fix it within 15 minutes. Basically, we treat escaped defects as a routine thing that happens every day and that we’re prepared to fix, at least on a product level. (Reliability issues are a broader subject, and there we do have to build in greater defenses.)
But overall, my message is: don’t be afraid of defects, embrace that they’re going to happen, and instead of trying to decrease the number that happen, decrease how severe it is when they do happen.
Treat reliability as a product feature.
When it comes to platform stability and reliability, we’re guided by our service-level objectives (SLOs). So we ask: Is this impacting our SLOs? If it is, and we’re going to burn through our SLOs within a couple of hours, that’s where we go all-hands-on-deck to fix it. Often the fix can be very quick, but if it’s not, we’re happy to divert engineering effort and call in anyone that needs to be called in.
On the flip side, if we have a series of outages and we are not meeting our SLOs anymore, that’s a sign that we need to slow or pause doing product work in order to get the system stable again. That is not an emergency response. That’s us saying, “OK, the system is stable now, but our customers are potentially not going to be super happy about the current stability. Let’s not throw more monkey wrenches into the mix—let’s focus some engineering effort on stability.”
This really helps us treat reliability as a product feature, and one that everyone at Honeycomb is on the same page about. You can ship all the features in the world, but if your customers don’t trust that they’re going to work, then your customers are not going to be delighted with them.
Make implicit SLOs explicit.
If you don’t have any SLOs, you kind of have implicit SLOs. The level of reliability that you’ve been delivering? That’s your implicit SLO. If you start performing ten times worse, your customers are probably going to complain. So you can convert that implicit SLO to an explicit SLO, and when you’re setting the target percentage, that’s where you have to think about how critical it is to the business and to your customers.
Center customers in the discovery process.
At Honeycomb, ideas percolate for a month or two, and eventually, there’s enough signal to say, OK, we’re going to put something on the roadmap. Then we go through this discovery process with our customers, where we ask people to buy a feature session: every customer gets 1,000 tokens to spend, and some features will cost 200 tokens, some will cost 500 tokens. You watch what features they’re buying in collaboration with other customers, that back and forth—it’s really cool to watch it happen. That’s how you can get input into what goes on that roadmap.
Try not to work on mountains.
We have this idea of mountains, boulders, rocks, pebbles, and sand. Those are our sizes for how much effort something’s going to be. Sand is small things we do immediately. But in general, we try to not work on mountains; we’re always breaking down mountains into boulders, and then breaking down boulders into rocks. And eventually it gets to the scale where it can be assigned to a feature team to work. A feature team is made up of platform and product and telemetry engineers, so you get the appropriate set of people to build a particular set of features.
Product-led growth means thinking holistically.
During our incidents, the on-call person or the incident commander is in charge, and can make whatever decisions are needed to stabilize the system. But after the system is stabilized, any follow-up issues wind up being owned by the respective team, or if it’s a broader issue, one where we need a new team, then it’s a product management concern.
That’s how a product-lead organization works: you don’t have Sales butting in and saying, “We’re overriding this effort,” or, “Hey, can you build us a shiny new feature that we’ve already sold to a customer?” That’s not how product-lead growth works. Ultimately, the decisions about what we staff, what we decide to build, are made by our product management team.
The product manager can say no. They can say, “You’re going to have to let that go. We have more important things to drive further product growth, rather than building an individual thing for one individual customer. That’s not worth it to us.” They can prioritize the success of the product as a cohesive whole, rather than a bag of features that salespeople promised. And to be clear, this very rarely happens at Honeycomb. We have a great relationship between our sales and product team. But sometimes, you do need to give that feedback of, “Hey, if one customer is asking for it, let’s see if three or four customers start asking for it—then we’re going to take action.”
Liz Fong-Jones is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 16+ years of experience. She is an advocate at Honeycomb for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.