Allma

Sign in

Incidentally#016

Welcome to Incidentally, Allma’s publication. We interview engineering and reliability leaders, founders, and makers on the secrets and tools they use to scale their systems and teams. We share incident collaboration stories and learnings, in solidarity and with openness. Our goal is actionable for you to implement and high-level to be applicable; your feedback is always sought.

The VP of Engineering at Justworks on security at the New York Times and solving incidents in the B2B space

From early tech to Condé Nast

Hi, I’m Yujin Kim. I was born in the States, but I grew up in Korea. I came back to the U.S for college and like most folks in my generation, I grew up with access to computers as I got my liberal arts degree. After I graduated, I didn’t know which direction to go in with my degree so I ended up going to engineering school for industrial engineering. I was working as a research assistant when I got the chance to work with UNIX computers and write in c. I was primarily self-taught and my enjoyment working with these systems led me to seek a career in coding and tech.

I was lucky because I was looking for a job in engineering pre .com bubble when there was a shortage of engineers. It was a brand new thing in most industries. I started in Washington D.C. at a data company that wanted to utilize this brand new technology at the time called “Java”. They asked who would be willing to learn it, and I immediately volunteered. I got to follow the evolution of Java as it started to encompass all things throughout the decade, so I became comfortable with various types of computer science: data algorithms, data structures, etc. In 2007, I came to New York to work in the technology arm of Condé Nast when magazines were still big and glamorous and there was a newsstand on every corner. As a member of the tech team, we worked to manage the digital assets that were owned by the company, like Reddit, and build a web and technology experience around our different publications. It was an amazing group of people to work with, and most of them have gone on to be CTOs of different publications that are still popular today. I was fortunate to be surrounded by all of these excellent mentors who helped me with proper leadership experience throughout the process. One of my first leadership positions was when I ran infrastructure and platform at the New York Times. 

Security threats at the New York Times

At the New York Times, I was in charge of building out all of the APIs and the things below them, like our data centers, cloud, and printing systems, to make sure we could efficiently scale up the different pieces of our backend. While we were developing things that were customer-facing, like mobile applications, behind the scenes we were dealing with cloud migrations and high-risk targeted attacks. Because of the nature of the New York Times, we got targeted all the time because people wanted to take us out. We had a CISO who had a background in the dark web and was a great hacker, so he taught me how to combat bad actors who were trying to shut us down from the inside. One time our team ended up on an 8-hour call with CloudFlare, which was still an up and coming security firm at the time, Google, and 3-4 other companies to triage against an attack we were dealing with. It was amazing to see that people cared and wanted to make sure we stayed up and were willing to jump in and help. 

The information we provided to people was important and our availability affected people’s livelihoods."

I was working there when the Boston bombing happened- we experienced 200 times the normal traffic, and we had to make sure our systems could handle the scale of big events so people stayed informed. 

Scaling and learning from mistakes

The best way to understand how to scale when you experience heavy traffic at somewhat random times is to always be learning from your mistakes. In any content-driven site, you learn how to execute a high level of caching and offset it so you can handle X amount of traffic based on the traffic to caching ratio. This doesn’t happen overnight and is driven by past learnings from mistakes and incidents, so you need a way to be able to understand what went wrong, why, and to be able to go back and look at these mistakes again. We built out a robust system at the New York Times to respond to critical issues, closely watch and record them, and then resolve them. 

You need an established system so you don’t leave rocks unturned and deal with the same issues over and over again with no way to learn from them."

Moving to smaller startups

One of the biggest changes for me going from a big, well-established company to earlier-stage startups like WorkMarket was building up knowledge within our team. The New York Times is well known for great talent and had many senior engineers working in the organization. When you move to a startup, you need time to build up the muscle. At a young company, you have less-experienced engineers coming in and you need to build workflows and best practices. 

It’s not about the quality of the code or the work, it’s about training, learning to act when there are incidents, and establishing an incident management system to optimize for understanding the issue."

 It's one of the things you take for granted when you work in a bigger company. However, as a leader, you have the opportunity to help other people learn something based on your experience. Instead of a lecture, you go on the journey together to experiment, try different vendors and apps, and communicate to find what clicks for your team. Working up to self-sufficiency is an amazing feeling as we move forward through problems and learn, and that growth has been one of my favorite parts of working for smaller companies. The startups I worked in were also new because they weren't as customer-facing, which came with its own set of problems. 

Working through incidents in the b2b space

One nice thing about working in b2b companies that can also be your biggest Achilles’ heel is that everything happens around business hours. You don’t have to be on 24/7, but all incidents seem to happen at the time of peak traffic. I had to adjust to this change and figure out how to manage disruptions without disrupting business. We needed to establish a seamless process to grab the necessary resources to triage issues while allowing business to continue as normally as possible. It’s a different angle to work from because you have every client calling you to figure out what is going on. It gives you a different kind of rigor. This is similar to how things operate for SaaS platforms because you don’t have the luxury of having a cache layer to cover up an issue, you need to serve the dynamic content. It’s a different vehicle or architecture to think about from the incident management perspective. 

Focusing on the human element

When it comes to dealing with incidents, I’ve noticed a trend to want to move away from the human element of understanding issues. Now that computing is cheap and there’s more processing power to handle issues in real-time, people rely on tools to provide the source of truth and the root of the problem. Because the volume of data we can track now is so large, we have to deal with more voices, signals, and streams of data to get through to determine what’s happening. It takes longer and longer each time even though there are so many new, well-intentioned machine learning tools to read through it all. 

Not everything is a binary, yes/no decision, and we like to lean on these binary decisions without the contextual boundaries around them."

The context and the decision go hand in hand together. 

Understanding the foundations of DevOps and SRE

Like the importance of context in decision-making when it comes to new tooling, my advice to those coming into SRE and DevOps is to be strong in the fundamentals. I might sound old-fashioned, but it is essential to understand the larger picture. Nowadays, we have all of the containers, Kubernetes, AWS, and other abstract solutions and it’s possible to miss the foundation of what makes these things work. I used to spend days and nights hacking into the kernels of Linux. I learned all the system programming, how context switches or the threading worked. 

Developing contextual awareness and spending time at the lower levels of languages and systems makes it easier to build on them and grow as an engineer."

If I have an SRE engineer who understands threat modeling and therefore can understand why something in the CPU spikes, it’s a different level of diagnosis than an engineer who relies on tooling. For those younger professionals who are new to the industry, sometimes having random nights where you end up staying up until 2 or 3 in the morning hacking with stuff is a great exercise that will set you up for success later on in your career.  

Leading a team- the 5 tiers of success

There are 5 connected tiers of development that I've seen better set up teams for success and growth. They are: 

  1. Understanding. Ensure your team understands the company's objective and mission when developing strategies to solve problems. We are all different human beings, so we hear different things from the same message. Actively orienting your team around the core of the company means that you can align your collective thinking around common values.
  2. Structure. There’s no such thing as a perfect organizational structure. It is about having 1 consistent structure built around the same mission and strategy that makes sense for your company. If you overhaul your company's hierarchy every quarter, it's harder for employees to form good habits if the structure of the company keeps changing.
  3. Talent and teams. Always ask yourself if your teams are set up with the best people in leadership positions so everyone is managed effectively. 
  4. Culture. Establish the glue that brings people together. Different cultures manifest from different structures, i.e. being goal-driven versus being mission-driven. Have a shared awareness of what connects everyone.
  5. Technology. Finally, you add in the technology side of things. Instead of choosing a technology for the sake of technology, look through different options and find one that works best with the factors above. For example, it’s not worth using Kubernetes just because everyone uses Kubernetes. Establish the company values and fit the tech to it instead of the other way around. 

Personal and Technological Leadership

When I think of leadership, it diverges into 2 areas: there’s the personal development of being a leader, and there’s staying up to date with modern technology. Over time, you get weaker with the second one; you’re not going to pick up an entirely new skillset. But you can always follow different platforms to understand why certain things are becoming more popular and what tech is changing the game. For the former, having people to keep close to you as your own “board of directors” and mentors has helped me as I’ve grown as an individual and a leader. It’s about being intentional about meeting with them regularly so you can ask specific questions and get advice. I have a couple of people who operate on the business side that I’ve kept in touch with and talk to all the time. They have been instrumental in my career and there are always learning opportunities when you talk with peers. There’s also the holistic aspect of keeping in touch with the world, like knowing the new Pokemon updates, but that’s more supplemental. 

Recommendations

I have a four-year-old son, so when I’m not doing work most of my time is playing with him. Whatever he’s playing, I’m playing. We’ve recently gotten into 50 piece puzzles because it shows him how to think about problems and connections. We’ve also been showing him how to use dice to understand numbers and how to add the numbers that show up. Dice and playing with dominos have been great for that!

When it comes to books, my favorite is called The Alchemist by Paulo Coelho. It’s not an engineering book at all, but I’m a big fan of the proposed arguments in the book. It sounds simple, but there’s a lot of deeper human psychology behind it and a spiritual angle. I read it every 3ish years as a nice refresher and it is beautifully written. 

YK

Yujin Kim

Yujin Kim is the VP of Engineering at Justworks.

Continue the conversation

join the Allma Discord community

incident
management
collaboration.

Allma– UI-less Incident Collaboration. Natively in Slack.

Get early access

Continue reading

How HubSpot’s Former Director of Reliability Uses First Principles and Customer-centric Philosophy to Scale ReliabilityWhat the former CTO of Artsy learned about automation on his way to principal engineer at AWS
view all issues

join allma club for access to special invites, resources, exclusive interviews, merch, and more

Incident collaboration

What is incident collaboration?Why allmaSlack Native WorkflowsCommunications RoutingIntegrationsTimeline & AnalyticsInteractive Emulator

allma, inc © 2021