Welcome to Incidentally, Allma’s publication. We interview engineering and reliability leaders, founders, and makers on the secrets and tools they use to scale their systems and teams. We share incident collaboration stories and learnings, in solidarity and with openness. Our goal is actionable for you to implement and high-level to be applicable; your feedback is always sought.
Squad Reliability Lead at VGW on his path from Infrastructure to SRE
Starting in tech
Hi, I'm Bruce Dominguez. I started my journey in tech back in 2000 at an IT help desk for a television company in the UK supporting desktops and teleco. After cutting my teeth on support tickets, I moved up to IT services where I racked and cabled servers, which is very different from today's push button deployments. In this role I learned a lot about server architecture, networking, and application deployment. I decided to come back to Australia in the early 2000’s. Working in support provided a great perspective on the impact a line of code can have on a customer's experience. From then on, much of my career has been in support and working reactively to put out fires. I’ve been living in Perth for a good while now where I have worked at a financial institution that had maintenance windows on weekend nights, which was fun for about 6 months. But after a few years of racking and patching servers on weekends I started to burn out - I wanted my weekends back! Luckily, I was able to shift to a 9-5 role in the same bank as an ITIL Change Manager where I picked up the importance of the Change Management process. From there I moved to a Strategic Testing Manager position where I led a team of 30 other testers. Given my infrastructure background, I was able to bridge the gap between development and infrastructure teams. This was back in the day where server virtualization was not in cloud service providers, but on physical boxes in the data center. During this time I picked up some new skills as both a manager and as a tester. Then, as one does, I started a CrossFit gym. You could say it was a little bit of a change.
CrossFit and work-life balance
For context, while I was working in the testing space, I was also coaching at a CrossFit gym and really enjoying it. So I thought, why not? My wife and I opened up our own CrossFit gym close to the city and it was great. Running the gym was a new level of growth for me because I had to learn how to run a business on my own and everything that entailed, from understanding social media marketing strategies to member retention methods. This did mean quite a few long days, where I would coach in the morning or the evening and work during the day. We had the gym for around 7 years until we decided to sell. While I loved having the gym, my work-life balance had taken a massive hit trying to get everything to stay up and running. During this time I had taken some time off from tech to just focus on the gym. After 2 years, I jumped back into a role as a Consultant and would go to client sites and advise them on test strategies or how to improve their DevOps pipeline. After coming back from my my tech break, I was astounded by the rapid advancements in such a small space of time.
A new world in DevOps
Jumping back into IT with a role in DevOps was a shock to the system. The idea of working in and on “the cloud” was new to me because the bank I had been working at had only been dipping its toes into Azure and AWS. I was used to going down to the server room, putting in the racks, sliding in the server, and plugging it all in. Coming from an infrastructure background, I was used to racking and stacking servers and rolling out configuration.
With the introduction of DevOps as a practice, you’re also responsible for understanding monitoring, networking, and cloud technology as well as development practices.
The path to VGW
I did not find out about VGW and get a job there by traditional means at all. I was on a client site creating a test strategy for a university, and it was one of those days where nothing was going right and I wanted to blow off some steam. On LinkedIn, I saw that the VPs of Engineering at VGW had posted that they were having some beers in the office and playing some Super Smash Brothers. It sounded like exactly what I needed. So after a quick message I hopped on a train into the city and went on a whim. The office was amazing and it was daunting because the room was filled with very cool, smart people playing different board games, poker, and of course some Smash Bros.
The best part of it was that I had never had so many different and exciting conversations with the engineers, I felt like the dumbest person in the room and I learned a lot that night!
It was such a great environment and I was keen to learn more. So, call it a happy accident, but I applied for a role as an SRE and was successful.
Transitioning into Site Reliability Engineering
The SRE role was fairly new in Perth- I had no frame of reference except for what Google was doing. But, given my diverse background, I was very much up for the challenge.I was keen to drive the SRE practice forward not in VGW, but Perth as a city.
SRE can mean many different things at many different companies, it all depends on the context.
I did a lot of reading and studying to develop a course of action, but I felt lucky that there was a lot of bleed over from my past jobs in testing, understanding infrastructure, incident response, and monitoring. AT VGW, I’m not on call for everything but am there as an escalation point or as an Incident Commander. We’ve gotten to a point where we have a very tight and repeatable incident process, tightened up our monitoring, and killed off the last dangling bits of AWS infrastructure. We are constantly moving forward and recently improved our observability by introducing Open Telemetry into our services.
The Kobayashi Maru game and incident management
I have read a lot of books to increase my understanding of the core tenants of SRE, their responsibilities, and how I could apply that at VGW. A great book I came across was Seeking SRE, and in it there was an essay about using gameplay to improve the incident response process. When I first joined we already had engineers in the team that were on-call and the competency was great, but we need consistency. It varied from person to person what they would do if there was an incident, next steps, etc. This would often lead to a stressful time going on call. We were also looking to get more people involved in our on-call rotation and helping everyone feel comfortable doing so.
Inspired by the essay in Seeking SRE, I came up with a fun game of my own that we could all play to go over different roles, processes, and more in our incident response structure.
Each game was 5 minutes long and everyone held different cards with specific roles on them. There was a designated incident command, primary on-call, and secondary on-call to model the structure we had. The game would start with a metaphorical pager duty alert, and the team can only respond using the cards they are given. There are action cards to complete a process and even specialty cards that make it easier to find solutions. For example, one of the specialty cards is “Infrastructure as Code”, which means they can deploy a fix quickly because they wouldn’t need to go in and make manual changes or click through the console. The first time we ran it, we ended up playing for about an hour. Since then we run a session each month and each time we iterate and refine it. We’ve gotten amazing feedback from players about feeling less nervous when their pager goes off, feeling more comfortable when executing the next steps, and working through an incident more efficiently. It’s helped us find gaps in our processes, establish new runbooks and, overall, pushed our team to want to help each other out during times of high stress.
I am the type of person who needs to always do something, whether that is in the gym, spending time with the family or studying new tech and reading. I’m late to the party for this book recommendation but over Christmas I read Ready Player One and I loved it! That said, I also enjoyed the movie, which I know is polarizing.
I am often asked what books I should read as a pathway to becoming an SRE, and there are plenty out there. But that said, here is what I would start with as a grounding.
- Site Reliability Engineering: How Google Runs Production System - Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff
- Release It!: Design and Deploy Production-Ready Software - Michael T. Nygard
- Seeking SRE: Conversations About Running Production Systems at Scale - David N. Blank-Edelman
- The field guide to understanding "Human Error" - Sidney Dekker
- Distributed Tracing in Practice - Austin Parker, Daniel Spoonhower, Jonathan Mace, Rebecca Isaacs
- Chaos Engineering, System Resiliency in Practice - Casey Rosenthal and Nora Jones