Incidentally: An Interview with Josh Evans
The former head of Netflix operations engineering and playback services on the birth of streaming, best reliability practices, and being CTO of Datazoom
By Josh Evans
Allma is proud to share this Incidentally featuring the journey and advice of Josh Evans, the former head of Director of Operations Engineering and Playback Services at Netflix and the current CTO of Datazoom. Read on for Josh's experience getting to Netflix, taking up the helm of streaming services, and the trials and tribulations he faced on the way.
An accidental career choice
Hi, I’m Josh Evans. When I was young, despite a love of science fiction, I had no intention of working with computers or technology. My father was a mainframe programmer and frequently railed about the daily grind, largely working for insurance companies and banks. So I went to school with no intention of getting a computer science degree. Instead, pragmatist that I was at the time, I ended up following a childhood passion and got an art degree with a focus on printmaking. While I was in school, I was working part-time as a waiter and a house cleaner and I hated both jobs. My girlfriend at the time was a nanny for a guy who worked at Borland, a software company in Scotts Valley that made developer and business software, directly competing with Microsoft. I was able to get a job there as the nighttime 800-line operator and security guard. Once I graduated, I realized that I didn’t want to be a starving artist and the job market for art teachers was pretty challenging. I had been tinkering with Borland’s utility products (Sidekick personal assistant, Sprint Word Processor, Superkey, etc), so I literally walked across the hall and applied for a utilities tech support job. Within a year of doing tech support, I started coding, and that’s when I discovered how fun it was to write software and create programs.
I was hooked and couldn’t stop myself from writing code if I wanted to. I was so lucky to be in the right place at the right time."
Borland was a great place for me to cut my teeth in tech because there were so many smart, talented people that I was able to learn from. The people I worked with in tech support were as good as many of the top engineers I know today, helping people write code over the phone and learning the tools inside and out. Without a computer science background, I felt the usual imposter syndrome but I was able to work through that pretty quickly with the support of my coworkers, a lot of reading, and a lot of coding.
After Borland, I worked at a small startup that no one has ever heard of. My boss was a former manager and friend from Borland. In ‘98 he told me he was going to this company called Netflix that delivered DVD rentals by mail.
I can still remember thinking that it was the most incredibly stupid idea I had ever heard of. Who the hell was going to wait a week for a DVD to show up instead of going to Blockbuster?"
A year later I was sitting across a picnic table from Neil Hunt, Chief Product Office of Netflix. Within minutes he had sold me on the new subscription model they were rolling out the following month. Once I joined I didn’t look back, and in my 17 years at Netflix, I had the pleasure of participating in the war with and demise of Blockbuster, the rise of streaming video, the shift to the cloud, all the way through the global launch of Netflix streaming.
The Netflix DVD business and the problem with monoliths
When I joined Netflix, I started in the e-commerce space as an engineer for the DVD-by-mail business. It was the right place to start. During my tenure in ecommerce I learned internet service development, infrastructure, ecommerce, marketing, and engineering management. It prepared me for the next step in my career, streaming video. Looking back, our tech stack was quite primitive and fragile. We were using Active Server Pages and Oracle databases. We had one large “Store” database that was a real kitchen sink directly linked to logistics and finance databases. To make matters worse, we had a single, common web application (called Javaweb) which was deeply coupled to the Store DB via PL/SQL.
It was fragile when I joined in 1999 and it got worse from there. It was clear to everyone that the architecture was not going to scale."
It’s all too common for companies to start with a monolith because it’s fast, simple, and you don’t have to worry about modularizing your code. However, for all of the reasons it’s easy to do at the beginning, it’s painful to get out of a monolithic architecture because you end up with a spaghetti application and no clear separation of concerns within the application stack. At the time, we were only running 10-100 mbps network links in our data center back then so network call duration was more of a factor. Going back and forth between the ASP application layer and the Oracle database took time and it was noticeable if we were too chatty, even within our network. We had to start clubbing the calls together to make a single PL/SQL call into the database, do as much work as we could there, and then ship all the data back. It was a heavyweight process, and it was especially hard to debug. And, of course, there was also the challenge of vertically scaling a monolithic database. Every year we had to buy more expensive hardware to get through the end of year holiday.
On the application side, we experienced the quintessential problem with monoliths. We had a slow-moving memory leak in our Javaweb session initialization code that left our engineering team dead in the water for over a week. We didn’t dare deploy new code on top of an unreliable code base and it was almost impossible to isolate the change that caused the leak. Too many changes had been deployed at the same time. We eventually discovered the issue through a lot of trial and error, but so much of this pain would have been remediated with a microservice architecture.
Not only do microservices encourage a clean separation of functions but they scale horizontally. For this reason, and others, we eventually ended up moving AWS and rebuilding much of our architecture from scratch."
But before we started our move to the cloud, we launched Netflix’s first streaming services which came with its own set of challenges.
Pivoting to streaming
In 2009, I was managing the e-commerce team and was asked to run the streaming services team (called Electronic Delivery at the time). Switching teams was like night and day. The e-commerce team was more slow moving, relatively speaking, and the stakes around service availability were lower. In fact, at one point the entire DVD delivery service was down for 2 days and customers barely noticed because we were still shipping and receiving DVDs. We spent more time than you can imagine figuring out the easiest way for people to report problems with DVDs and shipments than we did on service reliability. Streaming was the polar opposite. When the entire experience is online from selection to consumption, you need to be a utility. If we were down with an outage for even a few minutes, that's a few minutes customers couldn’t use the service they’re paying you for and they noticed.
So, I was in the critical path of this new part of the business. We needed to move at light speed with new device deployments and functionality, while making the service highly available, ASAP. I was definitely in the hot seat. In my first 2 weeks, I actually had a full-blown panic attack. I quickly realized I needed 20 people and I only had 6, the schedule wasn’t going to change, and on top of that, we had serious scaling and reliability issues. I actually went to my boss, shared the story of my meltdown with him. I told him that I wasn't the guy for the job and that he should probably send me back to e-commerce.
He turned to me and said that it was because I was panicking that I was the right person for the job. I cared enough to worry about the success of the team and the company and I would find a way forward."
Not what I wanted to hear at the time but a pivotal moment in my career. And I did stay.
As we launched streaming, we ran into regular technical challenges. We had many outages of our core streaming services and with the API gateway that dealt with delivering the front-end UI experience. We had tremendous problems operating and scaling our new platform and we also were not great at building out and running data centers. All of these factors lead to a decision to move to the cloud.
In 2009, the combination of data center challenges, monolithic architecture, and international aspirations created a tipping point for Netflix leadership. A decision was made to move to AWS. So, as we were aggressively expanding our device footprint and launching internationally we chose to do something no other company had ever done - move a live, large scale service to the cloud without taking downtime while redesigning every aspect of the architecture.
To give you a little more color on the decision - we knew it was time to get out of the business of building data centers when we started to scale up and start streaming on new devices. Every time we would launch on a new device with an established user base, we would have 10s of 1000s of people signing in on the first day. We needed to be ready at the beginning and over-provision certain services to have the scaling and elasticity during these big burst launches. Moving to the cloud was a major advance for us because it forced an overhaul of our architecture from the monolithic data center to AWS. We were lucky our leaders had the vision and the nerve to embrace AWS before any company at scale had chosen to do so. There was a lot of internal consternation about the readiness of AWS and how to deconstruct our monolithic systems but we made it work and the rest is history. It was worth the risk and the effort because it provided a reliable, scalable, global foundation for our global streaming service. It was painful in the short term but a well-placed strategic bet for the long term.
The importance of God metrics
In our early days as a streaming service, the first reliability challenges we ran into were observability and actionability. I recall a conversation with a senior engineering VP in 2009. He said that he wanted a dashboard that he could look at over his morning coffee that would tell him the health of our service. When you have incredibly detailed dashboards, it's difficult to look at it from the 30,000-foot view and understand how the whole service is working. What you really want is a single metric, if one exists, that acts as an overarching signal.
So, we selected stream starts per second or, as it quickly became known, SPS. This was our God metric and probably still is at Netflix today. It was calculated by tracking DRM license challenges. The license challenge was the last HTTP request made to the control plane before the video playback started. It’s amazing how powerful that one metric was from an availability perspective. Of course, we broke it out further by tracking permutations by AWS region, geographies, device types, player versions, etc.
The smooth, predictable diurnal traffic patterns for SPS (very few people are watching at 4 a.m. and traffic ramping up throughout the day, peaking at around 7-8 p.m.) lends itself to programmatic analysis. We were able to apply statistical algorithms like double exponential smoothing to detect traffic spikes or drops within minutes with high accuracy. We relied on this metric for years to alert operations teams of unexpected outages. So that was the foundation that underpinned operational metrics and dashboards going forward.
So—my advice to technical teams just getting started with monitoring and alerting—pick a few foundational metrics and monitor them well.
“Never fail the same way twice”
I don’t know who coined this phrase, but I heard it for the first time from a guy I used to work with at Netflix and I love it: “never fail the same way twice”. I remember early on in my career (I think many of us have been there) when I was more lax about digging into root causes of production issues. If a problem went away on it’s own I might cross my fingers and hope it didn’t happen again. We’re all busy and it’s tempting.
Over time, I realized that the stuff that breaks in production is the stuff that will likely break again in production. If you keep letting these things pile up, it will be the death of 1000 cuts and you will never have a reliable service."
Every outage requires a review, and every problem needs to be addressed in the short term (alleviate customer pain) and long term so that it never happens again. Ideally the long term solution is implemented as code and integrated in all relevant applications or services. It becomes a systemic fix. I think of this as a one-way ratcheting effect where you only move towards greater reliability. This philosophy works well but it is reactive. If you want to get proactive then consider embracing chaos.
The value of chaos engineering
Incidents typically happen at the worst possible time. During a surge in traffic, at night, or (if you have a global service) in the wee hours of the morning. And, naturally occurring production issues can take time to investigate and resolve. This is bad for end users and engineers alike. The solution is to break things in production under close observation.
This may sound scary but, if you accept that things are going to break in production anyway, proactively inducing the same issue under controlled conditions is far better."
With appropriate defensive engineering practices services are less likely to experience cascading failures. And, chaos can be applied with escalating scope to reduce impacts on end users. For example, isolating and creating graceful degradation mechanisms is critical before you start inducing failures in production. You have to harden the dependent services first to make sure that when dependencies fail there are default responses (fallbacks) in place and thread pools are isolated so that one service dependency failure doesn’t block access to other healthy services.
You also need to know how something is going to break functionally before you understand how it's going to break at scale. We built a tool called the Fault Injection Test Framework (FIT). It allowed us to induce failures at the individual device level and then ramp up failures for a percentage of traffic. Once you’ve tested that, you can dial the percentage of traffic to 100% to see how your service handles a full production failure of a dependency.
For example, for our viewing history service, which serves up the list of movies previously seen by members, we changed the client library to return an empty list if VHS was down. Once in place we could take the entire service down without breaking the end user’s application experience.
Incident management and the OODA Loop
If you are running a service at scale, you need people focused on incident management and enough of them that you don’t burn them out. When there’s a service failure, they’re the first ones on the call, managing the incident, tracking progress, and following up to make sure the situation is remediated. They also track trends. If there are consistent issues across services, then there might be a new engineering solution that may address the problem systemically. This allows problems to be addressed holistically to achieve the ratcheting effect I mentioned earlier. Since most engineers are focused on their own domains and services, having people dedicated to incident management allows for there to be more eyes on the overall quality of the service.
One of the main philosophies that formed the basis of our approach to operations engineering and incident response is the concept of the OODA loop, created by Colonel John Boyd, a military fighter pilot and instructor."
He is most famous for inventing the concept of the OODA Loop, which stands for Observe, Orient, Decide, Act. When applied to dogfight scenarios Boyd came to realize that those pilots who could rapidly iterate through the loop - observing their surroundings, identifying friend vs. foe, making a decision about what to do next, and acting on that decision - were the ones most likely to succeed and come home alive. Now apply this to the operation of a production service. The same things are true. Engineers who can rapidly observe a problem (get an alert, see a trend on a dashboard), orient through investigation, zero in on a decision, and act will drive higher availability and overall operational quality of a service. And this cycle can happen over longer periods of time in the form of observing trends, understanding the source of those trends, defining an engineering solution, and implementing it as broadly as necessary to ensure that it’s fully resolved.
Regardless of the timing, the faster you can iterate through the loop, the higher your availability will be. Automation and engineering are natural extensions of the loop because, even if your engineers are good at figuring out a problem the first time, they shouldn’t have to do so more than once. Machines are faster at the “observe” and “orient” phases, which leads to more rapid problem detection and enabling a fast response. And, longer term, with substantial investment in terms of statistical analysis and ML pattern matching, decision making and action can be automated as well. The fewer steps of the OODA loop that need to be performed by people the better.
The paved road
At Netflix, the focus is on rapid innovation and execution on common goals so the faster engineers can develop and deploy software the better. To enable this, we applied the concept of “the paved road”, a well integrated tool chain of the most effective tools and technologies for a maximally efficient developer experience.
Paved roads also guide engineers towards the most effective practices derived from following the OODA loop. When I led Operations Engineering, the paved road that we supported consisted of the Ubuntu OS, Java as the primary programming language (but others emerged quickly), Docker, Jenkins, SpInnaker (Netflix’s OS multi-cloud CD automation tool), etc. This is not a one-and-done process. As technologies are adopted by various teams, the centralized teams need to continuously assess what gets folded into the paved road and what does not. The lens that we applied when making these decisions was based on what investments would drive “velocity with confidence”, a phrase I picked up from another talented colleague. How do we allow engineers to develop and deploy software as quickly as possible without breaking things?
At Netflix, we believed that rapid innovation, development, and delivery of product features is how you get and stay ahead of your competition and continuously grow the business."
Reliability is the biggest roadblock and distraction as you try to achieve this. We were able to easily correlate our rate of change to production with outages. So the more you invest in mechanisms that protect reliability during the delivery process, the more you can free folks up to focus on innovations that drive business value. I learned this and so much more during my time at Netflix, and while I thought that I would apply this at larger companies post-Netflix but I ultimately found myself at an early stage startup called Datazoom, the first video data platform.
From large company to small startup at Datazoom
By the end of 2016 I had left Netflix and was taking some time off. I first heard about Datazoom from Diane Strutner, the CEO, in the summer of 2017 when she reached out to me about an exciting idea within the streaming space. I was intrigued by the idea so we met up for a beer. Diane was previously the VP of sales and business development at a streaming video analytics company called Nice People at Work (NPAW) which sold a product called Youbora. She identified significant gaps in the offering which focused on streaming video QoE metrics calculations at the collection point (player/app) vs. raw event and metadata collection which is far more powerful and flexible.
In addition, there’s an entire workflow of services required to deliver streaming video from player to CDN (content distribution networks), origin services, packaging, encoding, etc. But almost all the focus to date has been on collecting QoE metrics from the application and player. There is also a wealth of QoS data collected by 3rd party service providers but those logs can be hard for content publishers to access and leverage.
Diane knew that she could enable better streaming experiences, not only by collecting raw player telemetry, but collecting relevant logs from all of the services engaged in the streaming video ecosystem.
The Datazoom vision is to collect telemetry from all of the stages across the workflow. And, this multi-sourced data can be correlated using common identifiers, which can be propagated through the workflow during streaming."
If you add in standardization of the data for each node in the workflow (players, CDNs, origins, etc.) and identifier propagation with each content object request you can build a universal video telemetry translator which collects, standardizes, and delivers correlated data to a variety of analytics tools, enabling the richest possible analysis of end user experiences.
Today, we have a player data dictionary that standardizes all the telemetry across players and, more recently, a CDN data dictionary, which leverages CDN log streaming mechanisms. We’re working our way up the stack from there. Datazoom allows content publishers to see how their CDN is performing and, if they want, they could even build their own CDN switching algorithms based on our data.
When I first met Diane in 2016, I loved the idea behind Datazoom, but I wasn’t ready to join an early startup. I hadn’t been a hands-on engineer or architect for many years so I thought it was a bad fit. I was looking for a company that was farther along that needed organizational and principle-based leadership skills more than hands on technical skills. However, I did join as an advisor and angel investor within six months after meeting Diane because I believed in the foundational concept and I saw Diane’s potential as an emerging industry leader. Then at the end of 2019, just as Diane was working on Datazoom’s first round of seed funding, the CTO at the time had to step down for personal reasons. At first, Diane and I discussed an acting CTO role to get the company through fundraising. I would then help Diane find another CTO. However, I quickly realized that investors needed to see a committed CTO or the funding would never happen. So I joined Datazoom. It was like falling into a gravity well. There was no escaping it.
Datazoom and continued learning
It’s important to continue learning and Datazoom has given me that opportunity at this later stage in my career. For example, something I've changed my mind about since coming to Datazoom is the value of service-level load testing. At Netflix, since everything happened at scale, we invented ways to throttle traffic and launch incrementally and there was very little investment into simulated service-level load tests. On the other hand, as an early startup, not yet at massive scale, we have to over-provision capacity for stateful services like Kafka to deal with traffic spikes which makes it hard to figure out your baseline cost of goods for service pricing or where your performance bottlenecks are.
We’ve discovered that, at this stage of our business, just like chaos testing, we have to do load testing to understand bottlenecks before we encounter them with live traffic. For example, even if we have additional Kafka capacity, streaming video is famous for unpredictable traffic spikes due to content launches, breaking news, etc. We started doing load tests in production about 6 months ago and we’ve learned a ton. It has been a big eye-opener for the team providing meaningful insight into service vulnerabilities and opportunities to improve efficiency.
Future excitement at Datazoom and leaving a legacy
There are many different things I’m looking forward to now that I’m at Datazoom. One of the biggest things is bringing integrated, correlated, standardized video telemetry to the streaming industry and building a successful business on that concept. It was what got me interested in Datazoom from the beginning. And, there’s nothing like the feeling when you have the wind at your back as the company starts to catch air. We just hired a sales team, we know our product is ready for the market, and the market is interested in what we’re doing. We’re in the just add-sales phase so this is the year we’re going to get traction.
Another thing I am looking forward to is building a business with great people. I’m so lucky to work with people that I appreciate personally and respect professionally. It makes it worthwhile to get up in the morning and go to work. Datazoom is my legacy job. I essentially came out of retirement to do this and it's a final opportunity to leave my mark. I want to do this well and leave something lasting behind. I want to make these people and this company successful. Whether that means watching coworkers grow from their experience and do great things over time or for the company to live on long after I’ve moved on, I believe in this team and vision. At this point in my career, it's all about the purpose and people.
I love video games and they are a regular hobby of mine. My favorite games are first-person shooters. I've been playing video games for as long as I’ve been in tech, starting with the original Castle Wolfenstein and Doom PC games. More recent favorites are the Halo series, Bioshock, Gears of War, Dead Space, and the recent Doom series reboot. Luckily, my wife has accepted the fact that, in this respect, I'm a stunted teenager at heart.