Welcome to Incidentally, Allma’s publication. We interview engineering and reliability leaders, founders, and makers on the secrets and tools they use to scale their systems and teams. We share incident collaboration stories and learnings, in solidarity and with openness. Our goal is actionable for you to implement and high-level to be applicable; your feedback is always sought.
Wayfair’s Head of Commercial Product Engineering on his Journey from Banking to Amazon and Beyond
Freedom Dumlao is the new VP of Commercial Product Engineering at Wayfair, after a long history of engineering in companies like Amazon and startups alike. Starting as a principal engineer and quickly moving up to Director and VP of Engineering, Freedom is the person to talk to when it comes to success when navigating through rapid career development. Luckily, Allma had the chance to sit down with him to talk about his new position at Wayfair, some of the best reliability practices from Amazon, and how taking ownership of your projects can change your career.
My Origin Story
Hi, I am Freedom Dumalo. I took what some might say is an unusual path to get to where I am today. Unlike most of my peers, I don’t have a degree - I wasn’t able to afford school, so my learning was rooted in books I bought at Borders with coupons or at the library. I was fortunate to get a head start at about 5 years old when my mom won a computer (a Panasonic GR 200 U) at a bar trivia night. You couldn’t really do much on it because it didn’t have a disk drive, so all I could do was program games. For my 1 hour of allotted computer time, I would spend 50 minutes of it trying to key in a game listing that came in a little handbook and only about 10 minutes actually playing. As I got better at typing these programs, I realized I could change the game to make the game easier, like making myself move faster or make it so my bullets would blow up the evil robots. This was where I really started programming.
From that point on, whenever I had access to a computer, I looked for how I could program it to make it do something other than what was on the screen at that time.
I got my first role after doing a lot of contract software engineering for many different groups, from creating websites to building rap applications. At this time I was also working as a branch manager for Sovereign Bank, but I wanted to get out of that to do software engineering full time. However, a new engineering role opened up within the Bank, so I interviewed and then transitioned to that job within a month of applying, starting my journey as a software engineer full time. I started out working in progress for GL and dotnett and grew from there. After moving up to Boston to continue working within banking, I started getting calls from recruiters about how cool it was to work in other software positions for startups and such - I quickly left my bank job for a better paying, cooler startup job and then grew from there.
From Everyday Engineer to VP of Engineering
My career growth came mainly from 2 places. The first was advice I got from one of my mentors, Craig Daniel, who is now actually the current VP of product at Drift. He told me that strong engineers are far more useful by utilising the force multiplier concept and adding an extra 10-20% to the work of other mid-level engineers, instead of trying to get everything done on their own.
I was skeptical at first, but once I started doing it, I started to see the impact of letting other people take ownership of the different pieces and how we could get so much more done on a product.
The second thing that’s contributed to my career growth has been ownership. I tend to see gaps in an organization, process, or plan and then look for who owns that gap and whose job it is to fill it. If I can’t find the answer, I own it myself until a better owner comes along. I think that has led to rapid career progression, because owning something gives me the opportunity to influence the deliverable and have an impact. This has given me a lot of experience in a short amount of time, so my advice to anyone who’s trying to figure out how to level up and grow is to find those gaps and assume ownership of those areas because nobody’s going to push back on you. However, if you do decide to own something, you have to follow up; you can’t take ownership and then not deliver, because that’s worse than leaving it alone.
Reliability- Top 3 Things for a Solid Foundation
Measurement. You have to spell out and measure the key things related to what you’re building. I found for measuring, it’s best to build a dashboard for each thing, so that you can see at a glance how things are performing and how they have historically performed to compare. If you don’t measure, it’s going to be near impossible to troubleshoot or see the impact on changes to the system and its behaviour.
Testing. The second most important thing is being engaged in regular, active, automated, and hands on testing, depending on your systems. There’s a myth that engineers can’t test their own code due to bias and closeness to the code itself, but I think this is completely untrue. If I hired an engineer to build me a bridge, they would be the first one to cross it after it was constructed. The same applies to software; the engineer should be capable of testing and producing code that is (as reasonably imaginable) free of bugs. Ideally, you get other testers and quality engineers involved too, but having additional eyes does not absolve the engineer from testing it themselves.
Planning. Lastly, you have to have a plan for when there are issues, because there is no system in the world that won’t have failures at some point. Both monitoring and testing directly influence this, which is why they are additionally so important. This is why playbooks and runbooks are excellent, so that anyone can see what to do and look at in order to adequately figure out a response to a problem. Then, once we resolve the issue, we have set “next steps” like RCA, a post mortem, a retro, etc. The last piece is critical because even though the issue is fixed, it doesn’t mean the team is done- this is where the actual learning comes in, which is what you take along with you.
Looking at the Little Guy- Starting Reliability at the Beginning of your Journey
The best way to implement reliability practices as a newer company is to make a hard commitment to always complete an RCA or post mortem for an event that surpasses a critical level. Even if the answer or fix seems obvious, you have to build the discipline to follow through and document it.
Having someone to own the process, designated as the Captain or Retro Manager, helps to ensure that there is a definition of what a completed post mortem process looks like and that all of those that follow meet this standard.
At Amazon, this was a very detailed, rich, and deeply methodical process that will not be right for every company, but just starting out with a basic definition about what should be in a post mortem makes all the difference. The converse of this is also true; sometimes you also have unexpected technological wins, and you ship something and it blows you away with what it’s capable of. This is another great time to do a post mortem. It’s easy to celebrate, but it's also still important to understand and learn why it worked that well to hopefully replicate and grow from it. That wasn’t necessarily something we did at Amazon, but it’s something that I try to do, whenever I can convince people to do it.
The number one thing I look for when doing a root cause analysis when it involves any aspect of human error is to question if there are/were any processes in place to prevent that human from making this error in the first place.
This may be quite a contrived example, but say I set up an online store and one of my customers accidentally enters another human’s credit card number and it goes through. While the human error was there, where was the process that prevented that person from making that mistake? Where was the process to validate that credit card number? So the root lies at talking about where the breakdown was and what we can do to prevent that human error from creeping in again. To me, human errors always have some process behind them unless there's malicious intent. In which case, you can state that there was malicious intent involved, and then that person doesn't work there anymore.
Correction of Error and Compounding Learning
One thing that Amazon has that a lot of companies do not is the vastness of their experience- they have over a decade of learning how to manage incidents and how to deal with different scenarios. The process there is called the “Correction of Error”, or the COE. The COE is a serious artifact that has to be completed from beginning to end to document all of the steps, from detecting the problem to sitting down and discussing how to prevent it from happening again. There’s no question mark if this process will be completed or not; it is a necessity, and if there is a significant event, a COE will result. They’re managed in much the same way that any other project would be managed where there is an expected, assigned completion date. All the COEs are also shared publicly, so there is a system where they all live, and you can see other people’s COEs.
In fact, it’s not an uncommon practice to go search for keywords in the COE system when you’re shipping something to see if anyone else has had problems with concepts x, y, and z.
Amazon takes engineering and operational excellence seriously, so at regular intervals, there will be a meeting with all the engineering leads to talk about how they are performing. And that meeting includes performance and COEs. Having these meetings created a group accountability to complete your COEs, because otherwise you have a whole room of engineers wondering, “hey, why didn’t you finish your COE, what’s blocking you there?” Everyone wants to learn and understand so that we wouldn’t all have the same problems.
Managing growth: From Amazon to Wayfair
My role as Wayfair’s Head of Commercial Product Engineering is to build the software that our suppliers and partners use to interact with Wayfair to get their catalogue uploaded, the merchandising done, and to manage pricing costs to buy our services.
Having empathy and empowering suppliers is such an important part of enabling growth.
Because Wayfair is growing so fast, we’re regularly getting more suppliers and more interest, so we are actively figuring out how we can improve their time working with us. My initial experience has been incredibly positive, and I’ve definitely enjoyed everyone I’ve worked with so far. It’s a company that’s eager to get outside ideas and bring them in to level up over and over again.
When it comes to the pandemic, Wayfair was preparing itself for an event like this without really knowing it. What I mean by this is that a lot of work has gone into making sure we were ready to support a lot of growth. It just so happened that this growth came all at once. At first there was some backlog, but if you look now, you won’t see much of that at all. I think we were able to handle this partially because of how well we partner with our suppliers. We were able to deal with a lot of those challenges quickly, and I was personally amazed at how frictionless it was to sort out.
This success came down to imagining what the future size of Wayfair was going to be and thinking about what this would entail ahead of time. The other thing was investing in key leadership positions to make sure that people with the right experience were there to drive things in the right direction from the beginning.
Onboarding during a Pandemic
I’m definitely a people person; I like the high fives and the watercooler moments. All of the onboarding at Wayfair has definitely been different because I used to have a lot of those initial interactions with new colleagues at the coffee machine while I was waiting to fill up my cup- you can’t do that when you’re fully remote. Instead, I started keeping a list of names and every time I heard a name, I would take a note with their name, what I’d heard, and what they were doing. Then I would reach out to them and ask if I could get 10 to 30 minutes on their calendar so I could introduce myself to see if there was any way we overlapped. This can be challenging in a more senior role because I don’t work with the same handful of people from day to day; it is constantly in flux.
If you’re starting remotely as an individual contributor, you’re probably joining a scrum team, or something along those lines, and you have your set 5 or 6 people. It takes that extra effort to make sure that you’re driving yourself to connect with everyone. Luckily, people are very patient because they understand the kind of circumstances we’re in, and I’m looking forward to the day we’ll be back in the office.
I have been playing Starcraft 2- it’s a strategy-based video game that came out in 2010, so it’s probably considered an old game now, but I still love it. Jobu is another one that is super fun and very challenging, it's like playing a puzzle and a strategy game at the same time. It's a beautiful game with a nice wooden board and pretty stones as well. I love games with a really good tactile feel, and will buy a game sight unseen if it has cool pieces.
Continue the conversationjoin the Allma Discord community
Allma– UI-less Incident Collaboration. Natively in Slack.