Welcome to Allma’s Incidentally, a newsletter where we interview Engineering Leaders on how they’ve scaled their companies and the secrets and systems that they use to build and evolve reliability practices on their teams. Our goal is actionable for you to implement and high-level to be applicable to you; your feedback is always sought.
How HubSpot’s Former Director of Reliability Uses First Principles and Customer-centric Philosophy to Scale Reliability
From IT to Director of Reliability at HubSpot, to Director of Product Engineering at CollegeVIne, Ian's held almost every type of engineering and product development role. In this interview, Ian Marlier talks to Allma about systematizing reliability, the importance of studying the humanities in engineering, and the difficulties of scale.
Recalling his zig-zagged journey humanities to engineering
Hi, my name is Ian Marlier. I've been in technology, professionally for around 15 years. It was originally something I fell into by accident. My academic background is in humanities, I don't have a technical background in particular but I grew up using computers and learned enough about them to get an IT job, then leverage that into a job that involved operating sort of production systems and things have just sort of spiraled from there. At this point I've done just about everything you can do in the technical world, software engineering, product management, and managed teams throughout.
The focus on reliability was not something I pursued intentionally. Coming out of the operations world, it was a natural fit once I became more interested in software engineering and product focused development.
The focus on reliability was not something I pursued intentionally.
I tend to be interested in processes and philosophies within an organization as opposed to purely technical solutions and that has lent itself well to helping grow a culture of reliability within organizations that aren’t pre-disposed to think that way.
Over my 15 years in tech, I’ve observed the evolution of reliability. We've gone from reliability being a small business and organizational concern to one that is becoming an overarching focus across the entirety of an organization.
The HubSpot reliability journey
When I joined HubSpot in 2013, the Engineering team was ~50 people. At that point, the engineering team had completed a major rewrite of the entire software stack, bringing in a lot of the technologies that had been used at Performable (a company HubSpot had acquired) and migrating most of HubSpot’s existing architecture over to that technology stack.
Coming into HubSpot, one of the first things I noticed was the customer-centric ethos instilled by David Cancel and Elias Torres (Co-founders of Performable and CPO and CTO, respectively at HubSpot).
David and Elias had done an exceptional job of establishing a customer-first, product centric perspective in the engineering team. Such that, collectively, all of the engineers, designers, researchers, product managers on the team were operating with those principles.
When I joined, one of the issues that had arisen was when you're constantly pushing for new features for customers, you can lose track of the stability, reliability, and performance of the product.
As much as there were exceptional practices in place around developing new software,. the practices around the stability and liability were much less developed, at the time. We knew we had a reliability problem, but didn't know what the solution was.
Exploring Google’s SRE model
As a proposed solution, the initial thinking was that we would establish something akin to the sort of Google style SRE model, which had come to prominence a year or two before via the publication of their SRE handbook. The logic being if there's a playbook that exists, we just need to run that playbook.
That said, it became apparent pretty quickly that it wasn’t going to work for us to merely copy and paste the SRE model. We realized the SRE model did not take into account the culture of the individual organization the way we needed.
Ultimately, the handbook is a set of best practices and if your culture is not structured in such a way that those make sense, you're going to end up with resistance across the organization. It is important to ask: what is the willingness of the organization to actually accept these practices and how do those practices match with the broad goals of the organization?
At HubSpot our maniacal focus was on autonomy and speed in service of the customer. The way our teams were structured were in cohorts of engineers, who have a thing that they own and they are solely responsible for it. There is no one outside of that unit. That team is responsible for deciding the code that they write, the features they create, when the things are ready to ship.
We felt a tension between this design and focus and the SRE model, which inserts an external person who is now on the hook for the reliability of your software. That person now has veto power over whether your software is ready or not. The moment you insert another person into the process, particularly someone who is standing outside and incentivized on something other than purely “Is this the right thing for the customer?” As soon as that externality and that incentive are there, it changes things, inserting friction into the process that can be significant.
Creating the reliability model for HubSpot
We ended up creating a model that aligned our engineering principles with a design that supported our values and ways of operating. Teams were responsible for their own metrics (technical and business) while our Reliability team provided the expertise and guidance they would need to implement and achieve those goals
Example Metrics included:
- Testing the API endpoint once a minute and measuring against our bar for acceptable downtime. Particularly for endpoints that were core within our infrastructure, we measured the performance of those.
- How often teams got paged alongside response time. Frequency of getting paged was actually probably the most meaningful signal that a team that was about to be in a reliability hole.
- Understanding and calculating the dollar value cost to the business of software not working. This came from our philosophy that writing code isn’t powerful unless it’s the right code. If no one is willing to give you money for the thing that you wrote for any reason, regardless of why, you have not made the business more valuable you've made the business less valuable.. The idea was to have something that we could look at and say that we are about to incur costs to the business. Let's stop things before we get to the point of incurring that cost.
My role as Director of Reliability became almost like a safety officer in a manufacturing plant, with a giant red button that can stop the entire assembly line. I sort of had that power for the Engineering team. You know, it's an incredibly costly thing to stop line even for a single engineering team, even for a single set of features. The idea was to have a set of metrics that allowed us to see when something was going to go wrong so we could stop the line before it went wrong instead of after
The final element was education, helping the organization understand that the reliability of their software did have an influence on whether their software was valuable or not. And once that perspective started to take root, once people started to sort of bake that into their calculations, it stopped being difficult to get people to focus on reliability.
It actually became pretty easy. Rarely did the team push back because there was an innate understanding that if our software is not reliable, we are no longer delivering the value to customers that we committed to deliver to them. That understanding is what makes it possible to prioritize reliability and performance and to allocate your effort appropriately.
Lessons learned from scale
For a while the model hummed, there was enough understanding of the value of focusing on reliability within the organization that I could sort of just multiply the time and the effort that I was spending to educate people and keep track of how things were going across the organization.
But that eventually broke down, around the end of 2015. I knew, because it was the first time that we had an outage, and when we were doing a post mortem, I realized that I had never met any of the people on the team involved. The engineering team had gotten big enough and turnover had gotten frequent enough that it was possible for an entire engineering team to come into existence without our interacting.
Key Man Risk:
One of the issues that arose was there became an excessive overreliance on me. And an impossibility of multiplying and scaling myself sufficiently to meet the need
Focus on Metrics from Day One:
It took us at least 6 months to wrap our heads around the metrics that mattered. And it took us closer to a year before we were in a place where we were consistently measuring the things that actually mattered across the organization. Looking back, I would have loved to start doing that on day one, in retrospect I think that maybe would have had more impact on things than anything else because that would have meant the conversations around reliability would have always been anchored in something observable and tangible.
At the Center of Everything are Humans.
When learning from incidents, often through formal post-mortems, it is important to recognize and tricky to navigate the human element.
What I mean is philosophically I am biased towards thinking that blameless post-mortems are a good idea. At the same time, I think you miss a lot. Because the reality is people do screw up, and I, I have been in organizations where the notion of blameless post mortem is taken to such an extreme that it's essentially off the table to say, a human screwed up. When more often than not, what happens is someone writes code that doesn't work the way that he/she expected or we don't fully understand the problem that our customers are solving and we make choices that are ultimately harmful and not helpful to them.
Being able to change your mind based in service of creating a thing that is as valuable as possible to the consumer, that is ultimately what matters most. It’s important to acknowledge and understand the human element in order to learn and evolve.
The Case for humanities in pursuit of engineering
Humanities degrees are powerful for learning how to be an engineer in the real world. Rather than assuming you need to go down the path of a Computer Science degree, I would encourage aspiring engineers to start taking classes like English, History, Foreign Language. Go write a bunch of stuff, learn how to think, learn how to analyze and speak in many different languages.
I think correspondence across language -- written language and spoken language and computer language -- the ability to synthesize and refine information is underrated. Alongside that, empathy, the ability to share in someone else’s perspective is crucial to product development. Learning to speak and think and write in many languages and subjects is a great way to build that empathy.
Two book recommendations that are on completely and utterly opposite sides of the spectrum.
The first book is a technical book most people will have heard of called the Phoenix Project. There are certainly ways in which it's simplistic. That said, from my perspective I would consider it to be the one book that every person in product development, absolutely has to read. I go back and reread it at least once a year. The folks who work for me and my teams, if they've never read it, I buy them copies and drop it off at their house. In many ways it's sort of my Bible for how to actually operate an engineering team.
On the opposite end of the spectrum, and I will throw it out there because it is one of my favorite books; A River Runs Through It by Norman Maclean.
A lot of people know the movie, relatively few people have read the book. It's very short. It's absolutely gorgeous and getting back to that idea of empathy, the idea of being able to simply and clearly state, a larger vision and narrow it down to something that's comprehensible.
I really do believe the ability to use language well is essential to being good at product development. Whether technical or business side or whatever else. And I cannot think of a better example of a writer using language well.
Continue the conversationjoin the incident collaboration slack community
Allma is a tool built with incident best practices baked in, designed for everyone in your organization to collaborate on incidents.