Effective incident management using Slack
Endless configurability does not have to mean endless stress
By Pete Cheslock
Slack is a flexible communications tool that offers an endless number of ways to manage your technical incidents. Some might even call it...too flexible. The sheer number of tools and options can make incident management within your Slack workspace feel even more chaotic. At Allma, we’ve interviewed over 300 engineering leaders to learn the most effective ways to manage a technical incident within your Slack workspace. Follow these tips the next time your team is collaborating on an incident within your company.
Decide where incident management happens
When it comes to managing a technical incident within Slack, the first question you need to answer is where exactly should you keep the conversation and the events that are taking place. There are three main places we have seen teams collaborate on their chaotic and messy events:
- An existing group channel (like #ops or #warroom)
- Individual threads within a group channel
- Per-incident channels (like #incident-20210914-database-outage)
A single shared group channel
You create opportunities for serendipity when you run a technical incident within one of your existing group channels. When you have your entire technical team within a single channel like #ops or #engineering, the event is much more visible to a wider group of individuals. Because of that, some users may casually see a log message or other line of investigation mentioned and have some critical insight that can help solve the problem or otherwise move the event forward. The downside to this strategy is that a single channel can get quickly overwhelmed with other topics—especially if you experience a second event happening at the same time as the first event.
Individual threads within a channel
Threads are polarizing: loved for their structure, hated for their transience. They keep conversations compartmentalized and provide updates for teams posting within a given thread, but threads themselves are easily lost, especially within busy group channels. Based on Allma’s conversations with over 300 engineering leaders, we don’t recommend using threads for collaborating on your technical incidents. But if this is your organization’s incident management protocol, Slack’s tips for using threads effectively is a good resource for maximizing the good and minimizing the bad.
Per-incident channels allow teams to encapsulate all parts of a technical discussion, making it extremely easy to pull together the details of an incident within a centralized conversation. Allma’s research backs this up: for the engineering leaders we interviewed, per-incident channels are by far the most effective way to run a technical incident.
Per-incident channels are by far the most effective way to run a technical incident.
When using this method, ensure you pull in all the relevant folks (people you think are able to help out on this particular issue) and make sure that teams are comfortable leaving the channel should they no longer be able to assist. Keeping the list of channel members to only those actively involved will dramatically improve focus, cut down on context switching, and help prevent communication-related delays to the resolution.
Assign roles and responsibilities to your incident management team
Running an effective incident within Slack means including only members of the team who can help out with the investigation and debugging process. (Folks who can only ask “What’s the update?” will not be helpful here.) Then, when you add members to the channel, give them specific roles and detail the responsibilities required so that there is no confusion about who is working on what. Most importantly: communicate this information clearly to everyone within the channel by using the Slack “pin” functionality to save a message that details who is working on what. This will help new team members get caught up quickly when they join the channel.
Keep the conversation inside the channel
Tools like Zoom and Google Meet are great for high-bandwidth conversations and to get a point across succinctly to other members of the team, but that valuable interaction is often lost the second the virtual meeting is completed. When starting a new virtual meeting to debug a particular issue, assign one user to take notes and post those notes within the incident channel. Additionally, pin those key messages within your channel so that other team members—or late joiners to the channel—can refer to them. These pinned items can also be used as key timeline events for your post-incident review session. In general, try to avoid using Zoom or Google Meet's in-meeting chat system as those messages will be lost to the void after the meeting is closed.
Ownership, accountability, and communication are key
Incident management in Slack can be either chaotic or calm depending on the level of structure you’ve built into the system. While you don’t need a heavyweight process to effectively run a technical incident in Slack, a few key strategies can provide the right amount of structure to reduce the stress and anxiety of your technical incidents. Creating per-incident channels, assigning roles with clear responsibilities, and ensuring conversations are recorded in-channel will help each member of the incident response team get the context and information they need, and will help your organization reach a successful incident resolution all the more effectively.