Allma

Sign in

Incidentally#015

Welcome to Incidentally, Allma’s publication. We interview engineering and reliability leaders, founders, and makers on the secrets and tools they use to scale their systems and teams. We share incident collaboration stories and learnings, in solidarity and with openness. Our goal is actionable for you to implement and high-level to be applicable; your feedback is always sought.

Former Head of Netflix Operations Engineering and Playback Services on the birth of streaming, best reliability practices, and being the CTO of Datazoom - Part 2

“Never fail the same way twice”

I don’t know who coined this phrase, but I heard it for the first time from a guy I used to work with at Netflix and I love it: “never fail the same way twice”. I remember early on in my career (I think many of us have been there) when I was more lax about digging into root causes of production issues. If a problem went away on it’s own I might cross my fingers and hope it didn’t happen again. We’re all busy and it’s tempting.

Over time, I realized that the stuff that breaks in production is the stuff that will likely break again in production. If you keep letting these things pile up, it will be the death of 1000 cuts and you will never have a reliable service."

Every outage requires a review, and every problem needs to be addressed in the short term (alleviate customer pain) and long term so that it never happens again. Ideally the long term solution is implemented as code and integrated in all relevant applications or services. It becomes a systemic fix. I think of this as a one-way ratcheting effect where you only move towards greater reliability. This philosophy works well but it is reactive. If you want to get proactive then consider embracing chaos.

The value of chaos engineering

Incidents typically happen at the worst possible time. During a surge in traffic, at night, or (if you have a global service) in the wee hours of the morning. And, naturally occurring production issues can take time to investigate and resolve. This is bad for end users and engineers alike. The solution is to break things in production under close observation. 

This may sound scary but, if you accept that things are going to break in production anyway, proactively inducing the same issue under controlled conditions is far better."

With appropriate defensive engineering practices services are less likely to experience cascading failures. And, chaos can be applied with escalating scope to reduce impacts on end users. For example, isolating and creating graceful degradation mechanisms is critical before you start inducing failures in production. You have to harden the dependent services first to make sure that when dependencies fail there are default responses (fallbacks) in place and thread pools are isolated so that one service dependency failure doesn’t block access to other healthy services.

You also need to know how something is going to break functionally before you understand how it's going to break at scale. We built a tool called the Fault Injection Test Framework (FIT). It allowed us to induce failures at the individual device level and then ramp up failures for a percentage of traffic. Once you’ve tested that, you can dial the percentage of traffic to 100% to see how your service handles a full production failure of a dependency. 

For example, for our viewing history service, which serves up the list of movies previously seen by members, we changed the client library to return an empty list if VHS was down. Once in place we could take the entire service down without breaking the end user’s application experience.

Incident management and the OODA Loop

If you are running a service at scale, you need people focused on incident management and enough of them that you don’t burn them out. When there’s a service failure, they’re the first ones on the call, managing the incident, tracking progress, and following up to make sure the situation is remediated. They also track trends. If there are consistent issues across services, then there might be a new engineering solution that may address the problem systemically. This allows problems to be addressed holistically to achieve the ratcheting effect I mentioned earlier. Since most engineers are focused on their own domains and services, having people dedicated to incident management allows for there to be more eyes on the overall quality of the service.

One of the main philosophies that formed the basis of our approach to operations engineering and incident response is the concept of the OODA loop, created by Colonel John Boyd, a military fighter pilot and instructor."

He is most famous for inventing the concept of the OODA Loop, which stands for Observe, Orient, Decide, Act. When applied to dogfight scenarios Boyd came to realize that those pilots who could rapidly iterate through the loop - observing their surroundings, identifying friend vs. foe, making a decision about what to do next, and acting on that decision - were the ones most likely to succeed and come home alive. Now apply this to the operation of a production service. The same things are true. Engineers who can rapidly observe a problem (get an alert, see a trend on a dashboard), orient through investigation, zero in on a decision, and act will drive higher availability and overall operational quality of a service. And this cycle can happen over longer periods of time in the form of observing trends, understanding the source of those trends, defining an engineering solution, and implementing it as broadly as necessary to ensure that it’s fully resolved.

Regardless of the timing, the faster you can iterate through the loop, the higher your availability will be. Automation and engineering are natural extensions of the loop because, even if your engineers are good at figuring out a problem the first time, they shouldn’t have to do so more than once. Machines are faster at the “observe” and “orient” phases, which leads to more rapid problem detection and enabling a fast response. And, longer term, with substantial investment in terms of statistical analysis and ML pattern matching, decision making and action can be automated as well. The fewer steps of the OODA loop that need to be performed by people the better. 

The paved road

At Netflix, the focus is on rapid innovation and execution on common goals so the faster engineers can develop and deploy software the better. To enable this, we applied the concept of “the paved road”, a well integrated tool chain of the most effective tools and technologies for a maximally efficient developer experience. 

Paved roads also guide engineers towards the most effective practices derived from following the OODA loop. When I led Operations Engineering, the paved road that we supported consisted of the Ubuntu OS, Java as the primary programming language (but others emerged quickly), Docker, Jenkins, SpInnaker (Netflix’s OS multi-cloud CD automation tool), etc. This is not a one-and-done process. As technologies are adopted by various teams, the centralized teams need to continuously assess what gets folded into the paved road and what does not. The lens that we applied when making these decisions was based on what investments would drive “velocity with confidence”, a phrase I picked up from another talented colleague. How do we allow engineers to develop and deploy software as quickly as possible without breaking things? 

At Netflix, we believed that rapid innovation, development, and delivery of product features is how you get and stay ahead of your competition and continuously grow the business."

Reliability is the biggest roadblock and distraction as you try to achieve this. We were able to easily correlate our rate of change to production with outages. So the more you invest in mechanisms that protect reliability during the delivery process, the more you can free folks up to focus on innovations that drive business value. I learned this and so much more during my time at Netflix, and while I thought that I would apply this at larger companies post-Netflix but I ultimately found myself at an early stage startup called Datazoom, the first video data platform.

From large company to small startup: The journey to Datazoom

By the end of 2016 I had left Netflix and was taking some time off. I first heard about Datazoom from Diane Strutner, the CEO, in the summer of 2017 when she reached out to me about an exciting idea within the streaming space. I was intrigued by the idea so we met up for a beer. Diane was previously the VP of sales and business development at a streaming video analytics company called Nice People at Work (NPAW) which sold a product called Youbora. She identified significant gaps in the offering which focused on streaming video QoE metrics calculations at the collection point (player/app) vs. raw event and metadata collection which is far more powerful and flexible. 

In addition, there’s an entire workflow of services required to deliver streaming video from player to CDN (content distribution networks), origin services, packaging, encoding, etc. But almost all the focus to date has been on collecting QoE metrics from the application and player. There is also a wealth of QoS data collected by 3rd party service providers but those logs can be hard for content publishers to access and leverage.

Diane knew that she could enable better streaming experiences, not only by collecting raw player telemetry, but collecting relevant logs from all of the services engaged in the streaming video ecosystem.

The Datazoom vision is to collect telemetry from all of the stages across the workflow. And, this multi-sourced data can be correlated using common identifiers, which can be propagated through the workflow during streaming."

If you add in standardization of the data for each node in the workflow (players, CDNs, origins, etc.) and identifier propagation with each content object request you can build a universal video telemetry translator which collects, standardizes, and delivers correlated data to a variety of analytics tools, enabling the richest possible analysis of end user experiences. 

Today, we have a player data dictionary that standardizes all the telemetry across players and, more recently, a CDN data dictionary, which leverages CDN log streaming mechanisms. We’re working our way up the stack from there. Datazoom allows content publishers to see how their CDN is performing and, if they want,  they could even build their own CDN switching algorithms based on our data.

When I first met Diane in 2016, I loved the idea behind Datazoom, but I wasn’t ready to join an early startup. I hadn’t been a hands-on engineer or architect for many years so I thought it was a bad fit. I was looking for a company that was farther along that needed organizational and principle-based leadership skills more than hands on technical skills. However, I did join as an advisor and angel investor within six months after meeting Diane because I believed in the foundational concept and I saw Diane’s potential as an emerging industry leader. Then at the end of 2019, just as Diane was working on Datazoom’s first round of seed funding, the CTO at the time had to step down for personal reasons. At first, Diane and I discussed an acting CTO role to get the company through fundraising. I would then help Diane find another CTO. However, I quickly realized that investors needed to see a committed CTO or the funding would never happen. So I joined Datazoom. It was like falling into a gravity well. There was no escaping it. 

Datazoom and continued learning 

It’s important to continue learning and Datazoom has given me that opportunity at this later stage in my career. For example, something I've changed my mind about since coming to Datazoom is the value of service-level load testing. At Netflix, since everything happened at scale, we invented ways to throttle traffic and launch incrementally and there was very little investment into simulated service-level load tests. On the other hand, as an early startup, not yet at massive scale, we have to over-provision capacity for stateful services like Kafka to deal with traffic spikes which makes it hard to figure out your baseline cost of goods for service pricing or where your performance bottlenecks are. 

We’ve discovered that, at this stage of our business, just like chaos testing, we have to do load testing to understand bottlenecks before we encounter them with live traffic. For example, even if we have additional Kafka capacity, streaming video is famous for unpredictable traffic spikes due to content launches, breaking news, etc. We started doing load tests in production about 6 months ago and we’ve learned a ton. It has been a big eye-opener for the team providing meaningful insight into service vulnerabilities and opportunities to improve efficiency.

Future excitement at Datazoom and leaving a legacy

There are many different things I’m looking forward to now that I’m at Datazoom. One of the biggest things is bringing integrated, correlated, standardized video telemetry to the streaming industry and building a successful business on that concept. It was what got me interested in Datazoom from the beginning. And, there’s nothing like the feeling when you have the wind at your back as the company starts to catch air. We just hired a sales team, we know our product is ready for the market, and the market is interested in what we’re doing. We’re in the just add-sales phase so this is the year we’re going to get traction. 

Another thing I am looking forward to is building a business with great people. I’m so lucky to work with people that I appreciate personally and respect professionally. It makes it worthwhile to get up in the morning and go to work. Datazoom is my legacy job. I essentially came out of retirement to do this and it's a final opportunity to leave my mark. I want to do this well and leave something lasting behind. I want to make these people and this company successful. Whether that means watching coworkers grow from their experience and do great things over time or for the company to live on long after I’ve moved on, I believe in this team and vision. At this point in my career, it's all about the purpose and people. 

Recommendations

I love video games and they are a regular hobby of mine. My favorite games are first-person shooters. I've been playing video games for as long as I’ve been in tech, starting with the original Castle Wolfenstein and Doom pc games. More recent favorites are the Halo series, Bioshock, Gears of War, Dead Space, and the recent Doom series reboot. Luckily, my wife has accepted the fact that, in this respect, I'm a stunted teenager at heart.

JE

Josh Evans

Josh Evans is the CTO of Datazoom.

Continue the conversation

join the Allma Discord community

incident
management
collaboration.

Allma– UI-less Incident Collaboration. Natively in Slack.

Get early access

Continue reading

How HubSpot’s Former Director of Reliability Uses First Principles and Customer-centric Philosophy to Scale ReliabilityWhat the former CTO of Artsy learned about automation on his way to principal engineer at AWS
view all issues

join allma club for access to special invites, resources, exclusive interviews, merch, and more

Incident collaboration

What is incident collaboration?Why allmaSlack Native WorkflowsCommunications RoutingIntegrationsTimeline & AnalyticsInteractive Emulator

allma, inc © 2021