Allma

Sign in

Incidentally#014

Welcome to Incidentally, Allma’s publication. We interview engineering and reliability leaders, founders, and makers on the secrets and tools they use to scale their systems and teams. We share incident collaboration stories and learnings, in solidarity and with openness. Our goal is actionable for you to implement and high-level to be applicable; your feedback is always sought.

Former Head of Netflix Operations Engineering and Playback Services on the birth of streaming, best reliability practices, and being the CTO of Datazoom - Part 1

Allma is proud to share this special 2-part Incidentally featuring the journey and advice of Josh Evans, the former head of Director of Operations Engineering and Playback Services at Netflix and the current CTO of Datazoom. Read on for Josh's experience getting to Netflix, taking up the helm of streaming services, and the trials and tribulations he faced on the way.

An accidental career choice

Hi, I’m Josh Evans. When I was young, despite a love of science fiction, I had no intention of working with computers or technology. My father was a mainframe programmer and frequently railed about the daily grind, largely working for insurance companies and banks. So I went to school with no intention of getting a computer science degree. Instead, pragmatist that I was at the time, I ended up following a childhood passion and got an art degree with a focus on printmaking. While I was in school, I was working part-time as a waiter and a house cleaner and I hated both jobs. My girlfriend at the time was a nanny for a guy who worked at Borland, a software company in Scotts Valley that made developer and business software, directly competing with Microsoft. I was able to get a job there as the nighttime 800-line operator and security guard. Once I graduated, I realized that I didn’t want to be a starving artist and the job market for art teachers was pretty challenging. I had been tinkering with Borland’s utility products (Sidekick personal assistant, Sprint Word Processor, Superkey, etc), so I literally walked across the hall and applied for a utilities tech support job. Within a year of doing tech support, I started coding, and that’s when I discovered how fun it was to write software and create programs. 

I was hooked and couldn’t stop myself from writing code if I wanted to. I was so lucky to be in the right place at the right time."

Borland was a great place for me to cut my teeth in tech because there were so many smart, talented people that I was able to learn from. The people I worked with in tech support were as good as many of the top engineers I know today, helping people write code over the phone and learning the tools inside and out. Without a computer science background, I felt the usual imposter syndrome but I was able to work through that pretty quickly with the support of my coworkers, a lot of reading, and a lot of coding.

After Borland, I worked at a small startup that no one has ever heard of. My boss was a former manager and friend from Borland. In ‘98 he told me he was going to this company called Netflix that delivered DVD rentals by mail. 

I can still remember thinking that it was the most incredibly stupid idea I had ever heard of. Who the hell was going to wait a week for a DVD to show up instead of going to Blockbuster?"

A year later I was sitting across a picnic table from Neil Hunt, Chief Product Office of Netflix. Within minutes he had sold me on the new subscription model they were rolling out the following month. Once I joined I didn’t look back, and in my 17 years at Netflix, I had the pleasure of participating in the war with and demise of Blockbuster, the rise of streaming video, the shift to the cloud, all the way through the global launch of Netflix streaming. 

The Netflix DVD business and the problem with monoliths

When I joined Netflix, I started in the e-commerce space as an engineer for the DVD-by-mail business. It was the right place to start. During my tenure in ecommerce I learned internet service development, infrastructure, ecommerce, marketing, and engineering management. It prepared me for the next step in my career, streaming video. Looking back, our tech stack was quite primitive and fragile. We were using Active Server Pages and Oracle databases. We had one large “Store” database that was a real kitchen sink directly linked to logistics and finance databases. To make matters worse, we had a single, common web application (called Javaweb) which was deeply coupled to the Store DB via PL/SQL.

It was fragile when I joined in 1999 and it got worse from there. It was clear to everyone that the architecture was not going to scale." 

It’s all too common for companies to start with a monolith because it’s fast, simple, and you don’t have to worry about modularizing your code. However, for all of the reasons it’s easy to do at the beginning, it’s painful to get out of a monolithic architecture because you end up with a spaghetti application and no clear separation of concerns within the application stack. At the time, we were only running 10-100 mbps network links in our data center back then so network call duration was more of a factor. Going back and forth between the ASP application layer and the Oracle database took time and it was noticeable if we were too chatty, even within our network. We had to start clubbing the calls together to make a single PL/SQL call into the database, do as much work as we could there, and then ship all the data back. It was a heavyweight process, and it was especially hard to debug. And, of course, there was also the challenge of vertically scaling a monolithic database. Every year we had to buy more expensive hardware to get through the end of year holiday. 

On the application side, we experienced the quintessential problem with monoliths. We had a slow-moving memory leak in our Javaweb session initialization code that left our engineering team dead in the water for over a week. We didn’t dare deploy new code on top of an unreliable code base and it was almost impossible to isolate the change that caused the leak. Too many changes had been deployed at the same time. We eventually discovered the issue through a lot of trial and error, but so much of this pain would have been remediated with a microservice architecture.

Not only do microservices encourage a clean separation of functions but they scale horizontally. For this reason, and others, we eventually ended up moving AWS and rebuilding much of our architecture from scratch."

But before we started our move to the cloud, we launched Netflix’s first streaming services which came with its own set of challenges.  

Pivoting to streaming

In 2009, I was managing the e-commerce team and was asked to run the streaming services team (called Electronic Delivery at the time). Switching teams was like night and day. The e-commerce team was more slow moving, relatively speaking, and the stakes around service availability were lower. In fact, at one point the entire DVD delivery service was down for 2 days and customers barely noticed because we were still shipping and receiving DVDs. We spent more time than you can imagine figuring out the easiest way for people to report problems with DVDs and shipments than we did on service reliability. Streaming was the polar opposite. When the entire experience is online from selection to consumption, you need to be a utility. If we were down with an outage for even a few minutes, that's a few minutes customers couldn’t use the service they’re paying you for and they noticed. 

So, I was in the critical path of this new part of the business. We needed to move at light speed with new device deployments and functionality, while making the service highly available, ASAP. I was definitely in the hot seat. In my first 2 weeks, I actually had a full-blown panic attack. I quickly realized I needed 20 people and I only had 6, the schedule wasn’t going to change, and on top of that, we had serious scaling and reliability issues. I actually went to my boss, shared the story of my meltdown with him. I told him that I wasn't the guy for the job and that he should probably send me back to e-commerce.

He turned to me and said that it was because I was panicking that I was the right person for the job. I cared enough to worry about the success of the team and the company and I would find a way forward."

Not what I wanted to hear at the time but a pivotal moment in my career. And I did stay. 

As we launched streaming, we ran into regular technical challenges. We had many outages of our core streaming services and with the API gateway that dealt with delivering the front-end UI experience. We had tremendous problems operating and scaling our new platform and we also were not great at building out and running data centers. All of these factors lead to a decision to move to the cloud. 

Cloud migration

In 2009, the combination of data center challenges, monolithic architecture, and international aspirations created a tipping point for Netflix leadership. A decision was made to move to AWS. So, as we were aggressively expanding our device footprint and launching internationally we chose to do something no other company had ever done - move a live, large scale service to the cloud without taking downtime while redesigning every aspect of the architecture. 

To give you a little more color on the decision - we knew it was time to get out of the business of building data centers when we started to scale up and start streaming on new devices. Every time we would launch on a new device with an established user base, we would have 10s of 1000s of people signing in on the first day. We needed to be ready at the beginning and over-provision certain services to have the scaling and elasticity during these big burst launches. Moving to the cloud was a major advance for us because it forced an overhaul of our architecture from the monolithic data center to AWS. We were lucky our leaders had the vision and the nerve to embrace AWS before any company at scale had chosen to do so. There was a lot of internal consternation about the readiness of AWS and how to deconstruct our monolithic systems but we made it work and the rest is history. It was worth the risk and the effort because it provided a reliable, scalable, global foundation for our global streaming service. It was painful in the short term but a well-placed strategic bet for the long term. 

The importance of god metrics 

In our early days as a streaming service, the first reliability challenges we ran into were observability and actionability. I recall a conversation with a senior engineering VP in 2009. He said that he wanted a dashboard that he could look at over his morning coffee that would tell him the health of our service. When you have incredibly detailed dashboards, it's difficult to look at it from the 30,000-foot view and understand how the whole service is working. What you really want is a single metric, if one exists, that acts as an overarching signal. 

So, we selected stream starts per second or, as it quickly became known, SPS. This was our God metric and probably still is at Netflix today. It was calculated by tracking DRM license challenges. The license challenge was the last HTTP request made to the control plane before the video playback started. It’s amazing how powerful that one metric was from an availability perspective. Of course, we broke it out further by tracking permutations by AWS region, geographies, device types, player versions, etc. 

The smooth, predictable diurnal traffic patterns for SPS (very few people are watching at 4am and traffic ramping up throughout the day, peaking at around 7-8 pm) lends itself to programmatic analysis. We were able to apply statistical algorithms like double exponential smoothing to detect traffic spikes or drops within minutes with high accuracy. We relied on this metric for years to alert operations teams of unexpected outages. So that was the foundation that underpinned operational metrics and dashboards going forward. 

So - my advice to technical teams just getting started with monitoring and alerting - pick a few foundational metrics and monitor them well. 

Tune back in Thursday, July 29th for the rest of Josh Evans' reliability advice, his journey to Datazoom, and leaving his legacy in tech.

Recommendations

I love video games and they are a regular hobby of mine. My favorite games are first-person shooters. I've been playing video games for as long as I’ve been in tech, starting with the original Castle Wolfenstein and Doom pc games. More recent favorites are the Halo series, Bioshock, Gears of War, Dead Space, and the recent Doom series reboot. Luckily, my wife has accepted the fact that, in this respect, I'm a stunted teenager at heart.

JE

Josh Evans

Josh Evans is the CTO of Datazoom.

Continue the conversation

join the Allma Discord community

incident
management
collaboration.

Allma– UI-less Incident Collaboration. Natively in Slack.

Get early access

Continue reading

How HubSpot’s Former Director of Reliability Uses First Principles and Customer-centric Philosophy to Scale ReliabilityWhat the former CTO of Artsy learned about automation on his way to principal engineer at AWS
view all issues

join allma club for access to special invites, resources, exclusive interviews, merch, and more

Incident collaboration

What is incident collaboration?Why allmaSlack Native WorkflowsCommunications RoutingIntegrationsTimeline & AnalyticsInteractive Emulator

allma, inc © 2021