What is Incident Collaboration
Tl;dr Everyone in your organization plays a role working through incidents. Incident Collaboration is about coming together as a company to route and resolve incidents.
Early in my career, I joined Artsy, at the time, an early-stage tech startup in NYC, where I reported directly to the Founder and CEO. We grew quickly. While exciting, it was also scary. As often happens, the number of incidents increased in frequency and complexity. I found myself running around gathering and sharing updates across different teams (notably easier back in the days of open-floor offices), drafting external communications to customers and journalists, debating whether to push that marketing language or hold off, helping to understand impact of the incident on the business.
My biggest takeaway from my experience was that everyone in the company plays a role working through incidents. Each time an incident occured, all of us across the organization were working to handle different aspects.
Reflecting on my time at Artsy, one thing I did not expect, was the challenge of figuring out how to route and resolve all kinds of organizational incidents, beyond technical incidents. Something that became much harder as we scaled. What used to be a simple question, "who do I send this customer bug report to?" suddenly became a complex workflow. Working through all sorts of problems together as a company became trickier.
Living through company growth opened my eyes to the fact that incidents extend beyond technical breaks and bugs across your software, services, and applications. Incidents are also marketing language snafus, pricing mix-ups, product tradeoffs, people issues, payroll complications, data losses, office dishwashers breaking...
An incident is a problem, challenge, question in an organization that requires you to identify the right set of people and work with them to understand what's going on, mitigate and resolve the suboptimal outcomes, learn from what happened, and evolve the way you work.
Within a company, at any given moment, there are likely multiple incidents taking place— some related, some unrelated, cascading failures, and interdepencies — all involving people coming together to work through problems.
This didn't always use to be the case. Back in the day, we were silo'd as a company, divided into distinct teams responsible for building and maintaining our domain areas of the business. For instance, technical incidents were managed by a dedicated Operations teams responsible for maintaining our software and systems and mitigating and resolving incidents.
The way we structure our organizations and collaborate across the company has changed dramatically over the past decade. For instance, the emergence of strong ownership Engineering team models where we're now owning build and maintenance of our software. The explosion of work tools (the average company uses 288 tools 🤯). Distributed and hybrid companies. The prevelance of Slack, Discord, MS Teams which have brought us together as a company to collaborate in a meaningfully new and different way. These changes mean that our work has become increasingly overlapping across teams in a company and thus our incidents have too.
No longer are we silo'd teams with non-overlapping responsibility. No longer is incident response relegated to a singular set of team members. Rather, we all hold the responsibility, as a company, to work through incidents together.
The reality is incidents are events that the entire organization takes part in across Engineering, Customer-Facing teams, Product, Marketing, Sales, Legal, Finance People teams, Exec, and Operations.
We see this in the data around how teams use Allma. On average, 50% of the entire company uses Allma to actively mitigate and resolve technical incidents. For teams, major incidents occur up to every other day involving dozens of employees, spanning engineering, customer-success, product, and marketing, all actively participating in mitigating and resolving the incident.
To understand how each team in the company is involved in incidents today, let's walk through a technical incident as an example— Say, we're experiencing an issue where payment processing isn't working and customers can't check out on our website. We can examine who is doing what across the organization to resolve the incident.
Across the company, we all contribute to incident response and a huge part of working through an incident comes down to figuring out the right team members to work with, gathering information, and communicating with each other on who needs to do what and how.
Incident Collaboration is the recognition that everyone in the organization has a role to play in incidents, that incidents take many forms, and that a company likely experiences several org-wide incidents simultaneously. At the end of the day, incidents are opportunities for collaborative discovery — for coming together across the company, navigating problems as they arise, and consistently learning and growing.