Welcome to Allma’s Incidentally, a newsletter where we interview Engineering Leaders on how they’ve scaled their companies and the secrets and systems that they use to build and evolve reliability practices on their teams. Our goal is actionable for you to implement and high-level to be applicable to you; your feedback is always sought.
What the former CTO of Artsy learned about automation on his way to principal engineer at AWS
Daniel (dB.) Doubrovkine is a force, and I’ve respected the hell out of him for many years now. dB and I worked together at Artsy (an online platform dedicated to buying, selling, and comparing fine art) where he was CTO and built and scaled the team from the ground up. Also well known on the New York startup scene, dB is now at Amazon Web Services where he helped launch AWS Data Exchange. dB took some time to sit down with Allma to discuss how he grew Artsy over the years, the importance of automation, and the beauty of open source.
My journey to Artsy
Hi, my name is Daniel (dB.) Doubrovkine. I taught myself engineering by copying source code from SVM magazine and trying to reliably compile and run it in the 90s. After studying computer science in college, I worked for a number of companies, including Microsoft and other small startups, before becoming the CTO at Artsy for eight years. I joined AWS about a year ago as a Principal Engineer.
When I joined Artsy, there was just a prototype of an MVP. It was an attempt to implement the Art Genome Project, which is a similarity search for features and characteristics of artwork, but it was implemented in a way that could never scale or stand up against mass queries. Because the initial prototype was slightly over-engineered and had reliability issues, we scrapped it and restarted with a plain Ruby on Rails app.
Automation and early decisions
From day one, we said we would never be in a position of having to do anything manually. We decided early on that any engineer should be able to deploy the system to production at any time, which meant that the state of the source code had to be excellent. This level of automation successfully supported us for a number of years. Early on, we had a demo at Fondation Beyeler in Switzerland which was“fly or die”. Important art people were invited to look at a secret project nobody had ever seen: the Artsy Art Genome Project. We pulled out big, beautiful monitors, ready to go... and then quickly realized that there was a bug in the aspect ratio for landscape artworks. It looked awful on a large monitor, which was a huge problem given the goal of Artsy is to display the world’s art online. But, since we had everything automated, the fix was able to be rolled out just as guests were starting to click on the website. Nobody at the event ever noticed, so the day-one automation paid off immediately.
The automated infrastructure lasted for a number of years as we continuously evolved it, until we eventually started peeling off from the Rails app into multiple services that we would automate similarly. Years later, we moved from Heroku to a more native AWS solution with OpsWorks and finally to K8. Over the years, we were constantly trying and changing our minds with different systems, integrations, etc.
The principles that were there in the beginning always remained the same: continuous integration and continuous deployment.
The Importance of Open Source
When I started at Artsy, we found a very strong and engaged community in New York who gave us great advice and recommendations for tech. After all these people were willing to help, I wanted to give back, so I wrote a lot of open source code to engage with the rest of the community. Whenever I looked at our code base and saw a version of something that wasn’t Artsy-specific, I released it as a Ruby gem. When we hired more and more engineers, I just made open source part of their job. That eventually led to Artsy being open source by default.
I'm glad it helps so many other companies, but in full transparency, none of that was for the greater good of humanity -- I initially developed my use of open source for Artsy kind of selfishly, because what I needed was more engineers in a highly competitive environment. In the end, I say I did open source as a mechanism to hire better engineers.
Scaling Operations at Artsy
Brand aside, we always practiced DevOps as a team, so there was no such a thing as an operations engineer at Artsy. Operations was always the job of every developer- everyone on the Engineering team was responsible for conceptualizing, building, testing, and deploying their software.
Because of this, we hired people who were able to think in full systems, full features, and build and operationalize them themselves.
The ideal engineers for us were typically generalists who were not afraid to roll up their sleeves and write a script for something that was not necessarily complicated engineering work, but made all our lives easier. We have always had this goal for all of our engineers as we started to scale the service and operate for millions of customers, until we finally had to hire engineers to be responsible for the operational systems in production when we got further along.
Compounding Learning from Day One
From day one we did a few things in the genre of incident analysis.
- First, we always wrote tests for regressions, and they were tested no matter what. We made a rule right off the bat that if a bug made it to production, the fix would involve a test for that exact scenario going forward.
- Second, we would always have post-mortems after a production issue and debrief it to correct errors. ****There was always a very detailed email sent to the entire company explaining all the details of what, where, when, and why if the site had gone down -- It was a real trust builder for the organization to see that we acknowledged mistakes.
- Third, we adopted Heroku early on because it took a lot of the headache out of our infrastructure when something started failing. It might not have been the best solution, but it did reduce the blast radius when it came to the impact of a problem. On the other hand, We still had some central, single points of failure in the system and that were hard to undo; we took it as a trade off.
Reflecting on Incidents Across my Career
Thinking back, I have some regrets now that I see how groups like AWS do alerting and monitoring. At Artsy, we discovered too many incidents where somebody in the team would relay that the site was down and there was no chart showing anything wrong. We did eventually introduce better practices, but we were still late in the game.
Now that I’m at AWS, I’ve been impressed by the way we architect systems. There’s an immeasurable amount of attention put into the reliability of AWS services, which is unique for a company so massive.
AWS has spent many more years in architecting and engineering reliable, secure services compared to all the other cloud providers. Most of the AWS services were built to scale from day one. For example, I work for a service called AWS Data Exchange, and we run integration tests across many systems all the time. There's a lot of excellent operational time spent on improving pipelines and making sure that they uphold the reliability of the service when something makes it to production.
Further, there are some obvious ways that AWS reinforces and monitors infrastructure, for instance mean charts and graphs to understand what’s working, breach lines, boundaries... You need to know these things upfront and implement them in monitoring systems. All our services and API's have strict limits, so we do capacity planning using those limits. This way, you can guarantee a certain SLA for every single customer. We all have operational goals that are highly organizational and depend on other services. So, if the service below me fails, it's on me because I have not architected the system well enough to be resilient to failures. At AWS, if a customer has a problem, it's my problem.
Lessons Learned and the Surprising Discovery around (In)efficiency
At Artsy, we likely went overboard at times. For example, we wrote our own deployment system called Heroku Bartender in the early days of Artsy, and frankly that was another painful experience. Clearly, we liked operations far too much, and I ended up writing the jenkins-ansicolor plugin that's now used in about every Jenkins installation, just because I couldn't get color polarizing in text output in logs.
Perhaps we over-indexed on operations, but that also meant we paved the way for the new ones.
Sometimes you find a new, shiny version of a tool and you have to go delete the old versions and iterations, or you end up with a mess... So, when scoping projects around DevOps or other operations, you do benefit from doing everything. I think investing in the adoption of tools that really work is worth it.
Personally, I've needed time to understand that efficiency in how something is done is not always the most important. Inefficiency can actually have utility in certain cases.
I always thought that duplicating work was a bad idea, but I have recently changed my mind on this quite a bit. It may even involve doing something manually to make it the best version it can be, which is a horrible thing to think about. However, I’ve started to realize doing things manually and doing things efficiently are not mutually exclusive. I love the Amazon way of thinking about all these things, which is fully customer-centred. It’s all about what the customer wants and what they’re telling you to do, and you focus on working backwards from there without worrying about the inefficiencies.
I just finished reading How to Change Your Mind by Michael Pollan. It’s all about opening your mind to new ideas based on the science of psychedelics, and I think it’s great.
Continue the conversationjoin the incident collaboration slack community
Allma is a tool built with incident best practices baked in, designed for everyone in your organization to collaborate on incidents.