Allma Post-Incident Review Playbook
tl;dr
1. Ensure everyone who was key to the incident is involved and present.
2. Recommend putting in the work to create the timeline (before the Post-incident review). The quality of your Post-incident review is a direct function of the quality of the inputs (Garbage in = Garbage out).
3. It’s not about blame or credit, it’s about getting to the truth and learning and evolving.
4. Be wary of stopping at the proximate cause vs. continuing to the root cause. Ensure you’ve dug deep enough to find the underlying behavior, not just the action that triggered the incident.
5. When designing the solution, ensure each action item has a clear owner with a defined deadline.
Want to get started? We created an easy to use template here for you.
What is a Post-incident review?
Post-incident review is an opportunity to come together and understand what happened, why the incident occured, what can be learned from the incident, and identify any design changes to be made going forward.
Why is it important to conduct a post-incident review?
1. Pattern recognition: by understanding why incidents occurred and compounding those learnings over time, we develop the capacity to recognize patterns around incidents (incident A is reminiscent of incident B!) which ultimately leads to problems not repeating, faster resolution of problems, and stronger incident processes and systems.
2. Learning and evolution: Post-incident review is an opportunity to come together and learn and evolve in a safe, blameless environment.
Post-incident review checklist:
1. Scheduled time / place for the Post-incident review.
2. Timeline pre-prepared: forms the narrative of “what happened” during incident mitigation (includes key events, information, timestamps…).
3. Designated person to lead the Post-incident review.
4. Right people in the room.
5. List of Sub-Optimal Outcome(s): Determine what happened, identify the internal/external elements that produced the suboptimal outcome(s).
6. Post-incident analysis: Dive into what happened, why, and what can be learned in a safe, blameless manner.
7. Solution: designate actions to improve systems, process, tooling, team design, with assigned owners, prioritized based on broad engineering priorities / capacity.
Principles
1. Be willing to have hard conversations and healthy, thoughtful disagreement.
2. Create a safe space for learning.
3. Be aware that everyone has biases and blind spots. Know your biases and blind spots and guardrail against them.
4. People have different communication and working styles. Ensure you create a safe environment that encourages many different forms and styles of participation.
Preparing for the Post-incident review
This is meta but you can do a Post-incident review on the systems and software involved in the incident AND Post-incident review the incident management process, itself.
Timeline Creation: establishing the timeline of key events
Goal: compile an accurate narrative that tells the story of “what happened” during incident management.
Process: Create your timeline through pulling and aggregating relevant, key information from your incident communication channel (Slack, Zoom..) and organize by timestamp, state, and other important lifecycle events.
Guidance: Keep it simple! It’s easy to get lost in the weeds here and fall into a rabbit hole debating details of what happened. Stay at the level of important, relevant information to having a healthy discussion.*
Post-incident review session logistics:
Guidance: Key participants involved in the incident should attend so you have the right people in the room to have a comprehensive discussion. You can expand beyond core, key folks and make the Post-incident review open to your entire company in service of transparency. If you decide to do so, you may want to have guidelines around participation/observance so you can strike a balance between both an inclusive and effective session.
Post-incident analysis
Many frameworks have been developed to aid with Post-incident analysis. You may be familiar with the “5 Why’s,” Socratic Questioning, Cause and Effect Mapping... While they have differences in approach and substance, by and large, these frameworks are more similar than different in their process and goals. We suggest picking what works best for you and your team based on your culture and workflows.
Below, we’ll take you through one Post-incident review process below.
1. Step 1: Establish and agree on the suboptimal outcome(s):
Before diving into the Post-incident review, the first step is to reach an agreement on what the suboptimal outcome is. You need a clearly identified and agreed upon the suboptimal outcome before understanding the root of why that break occurred.
Goal: identify what went wrong and where
You can also start with optional ice breaker questions:
1. What did you like?
2. What was lacking?
3. What did you learn?
4. What do you long for going forward?
To identify the suboptimal outcomes, you use your timeline (detailing “what happened”) and map it to your picture of “what should have happened.” You are looking to identify the deltas. Ie., where things happened differently from how they should if everything had been working well.
1. Establish “what should have happened”: To visualize how the system should have functioned ask: How should the system have worked? Have a clear map of how the system should work when everything is working properly.
2. Establish “what actually happened” via your documented Timeline.
3. Now compare the “what should have happened” to “what happened.” The delta’s represent the breaks in the system.
What should have happened (your mental map of how your systems and process should have functioned) | What happened (timeline) | Delta |
A | A | ✅ |
B | S | B X |
2. Step 2: Error Categories / Groupings
1. Take the delta’s. Those are your breaks.
2. Group these breaks into error categories that make sense based on relevancy, theme, etc…
For example: if we have two breaks where 1) A deploy was slow, and 2) The wrong build was deployed, these could be grouped together under “Bad Deploys.”
3. Once you’ve grouped, get clear and in agreement on what the suboptimal outcome was for each. Ie., why was it a problem that the deploy was slow? Example below:
Bad Deploys | Latency related issues | Process not followed |
1. Break #1 2. Break #2 | 1. Break #2 2. Break #7 | 1. Break #4 2. Break #6 3. Break #3 |
Suboptimal outcome(s) = | Suboptimal outcome(s) = | Suboptimal outcome(s) = |
Never try to solve all the problems at once — make them line up for you one-by-one.
— Richard Sloma
3. Step #3: Select one error category and kick off off the Post-incident analysis:
Now that you have an agreed upon list of error categories and their suboptimal outcomes, you can analyze the breaks to understand the Post-incident review = the underlying behaviors that produced the problem.
In a complex system, there is rarely a singular underlying behavior, but rather a series of triggers and inter-dependent causes at play. The process of Post-incident analysis is valuable in its ability to unearth the multitude of these behaviors and their relationships with each other.
Most teams do some variation of asking a series of questions to get to the underlying behavior. Commonly referred to as “5 Why’s.” Important to note the purpose is to ask a series of substantive questions vs. merely asking “why” 5 times. Guidance below—
1. Start with the first suboptimal outcome within the bucket you’ve chosen and ask why it occurred.
2. Based on the prior explanation, keep asking questions underneath that initial why until you have found the underlying behavior(s).
3. You may have multiple reasons why and a set of underlying behaviors. You can note this in a decision tree format and the many why’s may result in multiple underlying behaviors.
Guidance: It can be easy to stop at the proximate cause (the action.) but keep going until you get to the underlying behavior, which is the reason, not the action that triggered the incident.
You can keep going for what seems like forever, but you will likely end up in the details and lose sight of the goal. A reasonable gauge to stop asking questions is when asking does not produce new / helpful responses.
Be careful of “hindsight bias” - avoid judging events and outcomes after they have occurred as if you could have predicted them.
4. Step 4: Stress Testing your Post-incident review + Pattern Identification:
Once you’ve gone through your Post-incident analysis framework, you can stress test your logic.
Questions to stress test:
1. If this behavior were to be done differently next time, would the suboptimal outcome still occur?
2. What could we be missing / not seeing about this incident that could lead to a different conclusion?
3. What do we not know?
4. What assumptions are we making?
In addition to stress testing your logic, you can map these new learnings to the broader context of your incident history to identify patterns through time. To determine whether the failure is a one-off or symptomatic of a larger pattern, ask:
1. Have we seen this underlying behavior(s) before? Or a similar flavor?
2. If so, can we group these underlying behavior(s) into themes and track learnings?
5. Step 5: Designing the Solution
Final step! Armed with your learnings, you can determine the set of actions needed to evolve your systems or incident process.
To design the solution, ask:
1. How should the system or the incident design evolve? (ie, what needs to change?)
2. What should be done differently next time?
3. What should be preserved that worked well?
Finally, once you’ve identified the actions that need to be taken to improve your systems, software, and process, ensure you have prioritized the actions in the context of broader engineering work and assigned each action item to an owner with a deadline. This way, it won’t get lost in the day-to-day of work. Ask:
1. Who is best equipped to own these actions?
2. How does this action plan fit with our current priorities? How do we weigh what we should focus on next and how?