Common Mistakes and Pitfalls of Post-Incident Reviews and How Your Team Can Avoid Them
The important thing is not to stop questioning. Curiosity has its own reason for existing.
- Albert Einstein
Post-Incident Reviews have become a widely accepted practice in the world of incidents. They offer the opportunity to bring the team together, reflect on what happened and why, and use these learnings to make key fixes and improve the process.
Yet, post-incident reviews are tricky. They can lead to oversimplification. They can be powered by poor assumptions or faulty timelines leading to inaccurate root causes. They can devolve into finger pointing...
In this post, we will walk through some of the common ways we have seen post-incident reviews go awry and we’ll offer guidance on how to lead an effective post-incident review.
Watch out for:
1. Oversimplification: Proximate vs. Underlying, Root Behaviors
tl;dr Push yourself to go beyond the superficial cause to find the deeper, core set of behaviors that triggered the incident. Underlying behaviors are such that if you were to eliminate or alter that behavior, it would change the outcome.
Without realizing it, teams often stop the post-incident review at the proximate cause (I’ve been there). It can be tempting to think you’ve found the underlying cause. Often though, you’ve found the superficial action instead of the underlying behavior.
Let’s break down what is meant by proximate vs. underlying root behavior:
In many post-incident review teams stop at “human error” or “bad code deployed.” These are proximate causes (the cause above the cause…). Proximate cause is the action, not the behavior. Ie., if you stop at “human error,” how does your team know what to change to prevent the incident from recurring?
But, if you dig further, you can uncover the underlying, root behaviors. For instance, the root under “human error” could be a poorly designed on call schedule, which lacked proper rotation mechanisms, that led to a fatigued, burnt out team, contributing to the suboptimal outcome. Now you have one underlying behavior. If you were to alter the design of the on-call schedule, it would fundamentally change the outcome. Note, there are often multiple underlying behaviors that contribute to an incident.
The difference between proximate and underlying root behaviors is the difference between glossing over the problem vs. understanding and solving the problem for the future.
2. Poor Assumptions and faulty Timelines vs. Quality Input
tl;dr Question your assumption and biases and put in effort to create a quality timeline that reflects the multitude of events.
The saying Garbage In = Garbage Out is particularly applicable to the world of post-incident reviews. While creating a timeline can be burdensome and time consuming, it is well worth the effort.
The quality of your post-incident review is contingent on accurately and comprehensively synthesizing the events and factors at play during the incident.
By gathering key information and data points into a timeline ahead of time, you’re ready to focus the post-incident review on understanding the why (underlying behaviors at play) vs. rehashing the what.
The better, more accurate your timeline, the higher probability of getting to the underlying behaviors. Imagine the output of a post-incident review that relied on an inaccurate or incomplete timeline. How would you expect to understand why the incident occurred or know what needs to change across your systems / software / processes? You’d be making decisions based on faulty, inaccurate information.
In addition to building a quality timeline, it’s also important to call out assumptions being made. As humans we all have biases and blindspots. Our memories are never as good as we like to think.
Rather than relying on a potentially flawed, memory-based account to build your timeline, we encourage you to go back and create a system of record directly from the source: whether that conversation is in Slack, Zoom, or another communication channel where the incident took place.
Finally, as you create the timeline, it’s helpful to note where you are making assumptions and what you don’t know so that you can discuss during the post-incident review.
3. Finger Pointing vs. Thoughtful Disagreement
The purpose of a post-incident review is to get to the truth by understanding what happened and why and then using that information to make impactful changes that will produce better outcomes.
Sometimes post-incident reviews devolve into vicious finger-pointing, leading to unhelpful blaming.
Alternatively, and also unhelpful, post-incident reviews can stay at the superficial level because people are not willing to have difficult conversations about what went wrong and why.
It’s important to reiterate the goal of a post-incident review: an opportunity for collaborative discovery to get to truth. It is not about blame or credit, it is about understanding and holding ourselves accountable to learning and evolution. To get there, hard conversation is often required.
In order to foster thoughtful, challenging discussion, consider the following practices:
1. Create a safe place to think and disagree: everyone in the room needs to feel comfortable sharing his or her view. Prior to the start of the post-incident review, articulate your team’s norms and guidelines for operating with each other.
2. Ground the discussion in facts: this is where a quality timeline is key. You can use this artifact to drive discussion and reduce emotion and bias in the conversation. Remember to call out assumptions, known unknowns, and blind spots. And remember there are many that exist that you are not seeing or are unknown unknowns!
3. Encourage healthy, thoughtful disagreement: Productive disagreement is valuable. Harnessing different perspectives, digging into the uncomfortable, these are what form the basis of a productive post-mortem and a means of actually getting to the core of the root cause. Beyond post-incident reviews, thoughtful, hard conversation forms the basis of strong relationships based on mutual trust. Embrace constructive disagreement and know the difference between valuable and harmful conflict.
One principle we find useful is reminding the team to “seek to understand vs. seek to be understood.” To achieve this, focus on asking questions, being aware of your own biases and assumptions, and identifying the factors at play and what is known vs unknown.
Post-incident reviews are a significant and expensive undertaking. You have a room of your engineers and often teams across the organization. The stakes are also high: you need to learn and to figure out what needs to change across your systems, software, workflows to improve outcomes going forward.
Given the high-cost, high-stakes nature of post-incident reviews, it’s worth investing effort in systematizing the process to ensure you’re getting the value from the practice.
If you put the time in, here are some of the key benefits companies have experienced from post-incident reviewss:
1. Pattern Recognition: by understanding why incidents have occurred and compounding those learnings over time, you develop the ability to recognize patterns around incidents, leading to thoughtful and faster incident resolution.
2. Future Fires Prevented: Learning from our incidents enables us to move from running around “putting out fires” to preventing fires from occuring again.
3. Team Learning and evolution: Post-incident review is an opportunity for teams to come together and learn in a safe, blameless environment.
By addressing some of the common ways a post-incident review can go awry and offering guidance on how to address them, we hope help your team get the most of post-incident reviews
You can access our Post-Incident Review Playbook for step-by-step guidance on running your incidents.