Allma Incident Collaboration Playbook
Playbook covers knowledge and process to guide you and your team during an incident
This playbook is based on the Incident Command System (the framework used in emergency response fields). We also draw from best practices in the field of incident management and our own experiences. You may have additional roles or unique practices that your company uses.
We’ve worked to build flexibility and configurability into Allma to enable you to drive how you use the product. Your feedback is always welcomed and appreciated.
Guidance for the team responding to the incident:
Aid your team in resolving the incident responsibly, thoughtfully, and swiftly.
1. Contribute to the best of your abilities.
2. Have a bias for action.
3. Know what you do not know
4. Use common sense.
5. Focus and be able to differentiate between what is important and what is not.
6. Know when to escalate and ask for help.
1. Join the dedicated Slack incident channel
1. If you plan to participate, know your capacity and form of contribution.
1. Are you by a computer?
2. Do you have threshold knowledge around the incident process and the incident to be able to contribute fully?
3. What is your best, highest use role?
2. Designate the Incident Commander (“IC”).
1. The IC is considered the person in charge of the incident resolution efforts.
2. Think of them as the incident leader: responsible for determining the action plan and designating who needs to do what.
3. The IC should not be directly debugging or communicating with customers.
4. The IC should be looking down from above at the process and system, directing the course of action, and delegating responsibilities.
5. Note: You may have several people on your team trained to serve as Incident Commanders. Typically the IC is either the person who spotted the issue or for a more complex incident can be a pre-designated IC, but can be whomever is best suited for your org.
3. Focus on understanding and assessing the problem
1. Assess and contribute directly in the Slack channel
2. Track the conversation in the channel, add information (graphs, screenshots, comments) that are explicitly relevant.
3. In service of problem identification, start by asking yourself
1. What factors are at play?
2. Is there a known, obvious cause? (call it out, if so!)
3. Does this incident have similar symptoms of past incidents?
4. Is there a pattern I can recall?
4. Ask yourself before posting:
1. Will this information increase our probability and velocity of identifying the problem?
2. Are there other folks on my team who I should triangulate with given their level of subject matter expertise or general incident experience?
3. Post knowledge and recommendation to the incident commander, then defer to them for next steps
5. There are likely many graphs, screenshots, and Slack threads coming at you at once. Remember, you are trying to reduce the scope of the problem’s surface area and narrow in on the exact source of the issue.
4. Listen to your Incident Commander for instructions
1. If the IC is not yet in the Slack channel, invite them in.
2. Look to your incident commander for actions and to delegate responsibility.
5. Mitigate the Problem
1. Now that you’ve identified the problem, the goal becomes mitigation.
2. There are likely many graphs, screenshots, and Slack threads still coming at you. At this point, you are hopefully pointed in the direction of the problem. Now the focus is determining the simplest, fastest path to mitigation.
3. Do not worry about diagnosing why the problem occurred and don’t worry about any other non-relevant bugs or problems identified along the way
Guidance for the Incident Commander
Lead the team in resolving the incident responsibly, thoughtfully, and swiftly.
1. Know your Team’s strengths and level of expertise and design and delegate accordingly.
2. Your responsibility is to lead the doing, not do.
3. Know what you do not know.
4. Think in patterns. We often think we are unique individuals experiencing things for the first time, but chances are what is occurring right now has occurred before in some form to someone. Look for patterns that can aid in swift resolution.
5. Use common sense.
6. Know when to escalate and ask for help.
1. Determine availability
1. Make sure you know who is available and what their capacity is (computer, mobile, SME?)
2. If there are significant gaps, fill them with the best resource available, ensuring you have adequate coverage.
3. Be prepared for knowledge handoffs along the way, especially for longer-lasting incidents. Have a plan for transferring knowledge.
2. Make it clear to your team you are serving as the Incident Commander
3. Designate any other relevant roles.
2. Comms Lead
4. Articulate the level of severity or the impact of the issue and act accordingly.
1. Note severity level can change through resolution as you uncover new pieces of information
5. Delegate, delegate delegate!
1. A main tenet of your role is to ensure that the right roles and responsibilities have been assigned.
2. You should be looking from above, assessing what is happening to identify and mitigate the problem and, along the way, delegating the right actions to your team to reduce surface area of the problem.
6. Call out as soon as you have identified the visible cause contributing to the incident and assign who should do what next.
7. If you do not have a clear signal as to the cause, determine a reasonable investigation path.
1. Ask yourself are there any patterns or commonalities: pull in relevant past incidents, runbooks…resources
2. Delegate actions based on your team’s relevant knowledge and capabilities
3. Listen to what your team is finding and sort through their analysis and recommendations to determine the most reasonable next course of action
4. Call out if people are double doing work or missing coverage of key areas worth investigating
8. Ensure someone is communicating to key parties consistently (internal / external).
9. Continue to understand the scope and complexity of the problem and re-design and re-delegate your approach as needed.
1. Evaluate pulling in additional engineering resources or re-designating roles.
2. Re-consider communication strategy: who needs to be informed and how
3. Monitor and continue to adjust, as needed
4. Run any knowledge handoffs
10. Once you’ve resolved the incident:
1. Communicate to relevant people (internal and external)
2. Identify post-incident work
3. Schedule post-incident review and have clear responsible part(ies) for creating the timeline
Guidance for the Communications Lead
Ensure all relevant parties have the information they need at the consistency they need it (internal and external).
1. Be honest and transparent in communications.
2. Use common sense
3. Serve as leverage not distraction. Know when to follow up, when to observe, and when to interject.
4. Know what you don’t know.
1. Listen to and follow instructions from the Incident Commander
2. Update relevant parties through designated channels (status page, email, Slack…) as appropriate
1. Be clear on who the communication is going to
2. Craft the communication according to the “who” it is going to
3. Triangulate with others on the communication as is helpful to ensure you’re getting the right, accurate information to the right set of people in the right way and at the right cadence. All delivered in a communication style and substance that fulfills that set of people’s needs.
3. Know who knows the customer best and rely on their judgment to message appropriately (ie a Customer Facing team member).
4. Know when to communicate and when to wait for updates.
1. Strive for a balance between clear, consistent communication, and dedicated periods of silence during which the team is making progress.