Enova has been doing post-incident analysis since before I joined the company in October 2015. Back then they were called “post-mortems” and the team was only able to hold them sporadically due to the limited amount of staff we had to facilitate them.
Since then, we’ve done a tremendous amount of work to evolve our processes, in part based off existing industry trends as well as many Enova-specific learnings.
- We changed the name from “Post-Mortem” to “Retrospective.” We are fortunate to work in an industry where our production events are not life or death, so we felt retrospective better reflected the nature of our incidents. It also allowed us to shift the tone of the meeting to something more positive — where the goal was to understand, learn, and improve as we move forward.
- We started holding retros for every.single.incident., no matter what.
- We began inviting both SMEs (subject matter experts) as well as stakeholders to these meetings to build empathy between the two groups who sometimes seemed to have opposing goals.
- We started tracking as much data as we could — not just a rough timeline with seemingly random notes, but things like how we were alerted, all of the people/teams involved, the impact across our businesses, and lessons learned, just to name a few.
I personally worked hard at ensuring that retrospectives were seen as a safe space where engineers would not be brought in to defend their actions, but rather show their expertise in helping us build a more resilient enterprise.
Adaptive Capacity Labs Workshop
While we are proud of the process we have, we are always looking to improve. Earlier this fall Sierra Navarro (who runs Technical Operations here) and I attended a week-long workshop in New York City on Post-Incident Analysis led by the folks at Adaptive Capacity Labs: Dr. Richard L. Cook and John Allspaw.
We spent 5 days learning not only from the facilitators but also sharing experiences and best practices with 22 other technologists from around the world specializing in the area of incident analysis in one way or another.
Dr. Cook and John Allspaw presented us with their ideal Post-Incident Analysis workflow, one which evolves the idea of “blameless post-mortem” into a multi-step process which takes multiple days, extensive research, interviews, and yes, a meeting (sometimes several!), but one with a slightly different purpose than the industry is used to.
There were many familiar topics we discussed over the course of the week, but I think it’s important to highlight them for everyone who may be at different points in their post-incident analysis journey.
(1) Contributing Factors vs. Root Cause
More often than not (perhaps always) incidents have more than just a single root cause. For example, we may think an incident was caused by a software release, but realize that software release only uncovered a problem that existed due to architectural decisions made years ago, perhaps because we had a very tight deadline. Testing should have caught it, but our test suite did not go as far as it could have. Alerting did not alert because it had been disabled due to false positives. The people who knew what to look out for were on vacation and had left a junior engineer in charge during the release. Etc, etc, etc.
So we prefer to think of incidents as caused by contributing factors. These can be categorized as factors on either the blunt end or the sharp end of the spectrum.
- Blunt end – these refer to things such as policies, procedures, and pressures. Addressing these can help us design a system that allows people to make difficult decisions under tight time constraints. These factors are obviously more challenging to change and are more of a theme to guide us forward.
- Sharp end – people, code. Addressing these is what we’ve historically done with our action items.
Finally, in case it is not clear yet, none of us work in a vacuum, and understanding that fact can help us learn from incidents and eventually improve our systems and organizations.
(2) Themes vs. Action Items
We have done a good job with action items at Enova. We even have one of our yearly goals built around them. But action items don’t always capture all the ways we can leverage incidents. The goal of post-incident analysis is learning, and so sometimes action items can get in the way of learning by making it seem like their mere existence is solving all of our problems.
Often during a retrospective we will encounter a learning that cannot be so easily solved through an action item, usually related to the “blunt end” of contributing factors. Instead we are better off tracking those learnings, thinking through them, discussing them, and then deciding on how to act. A “theme” should be a broader topic, something that we can then link to a different incident. For example, while a root cause can be “software release” and the action item is to “improve post release checks”, a theme would be “complexity of our legacy system has led to engineers not having a complete grasp of the impact their changes are making”.
(3) The role of the facilitator
Running a retrospective may seem easy, but it takes a lot of patience and practice. The facilitator’s job is to ensure that the time spent on post-incident analysis is productive and not simply a ceremonious meeting that people begrudgingly attend because they are obligated to.
In order to make sure to maximize the value of each retro, the facilitator’s role is to manage the room and drive the conversation without a personal agenda. As the facilitator, this is not “my meeting”. I’m simply helping make it happen, knowing that my goal is to facilitate learning — not to create action items, get to one single root cause, or put blame on anyone. I want to create an environment where those involved or affected by an incident are able to explain their experiences. I don’t assume things or make conjectures. I’m not afraid of tangents because that is how I find these “knowledge islands” where we can find the themes to analyze. My job as the facilitator is not to resolve disagreements but to highlight them and give clarity in the hope that later on they are resolved by those empowered to do it.
A Few Take-Aways
- We’re doing really well.
I am usually an optimistic person by nature, and I genuinely love doing Incident Analysis, but I will admit that I often suffer from impostor syndrome. I don’t often get to meet people who do what I do, so it is hard to know where we stand.
But it was clear that we are doing very well. We’ve already been doing many of the things that were being taught at the workshop, and other attendees expressed interest in borrowing many of our ideas (such as the concept of tracking and celebrating near-misses, as well as analyzing and presenting quarterly and yearly data where we look at broader themes and offer recommendations to improve our resilience posture).
- Since it’s all about learning, it’s also all about your audience.
At Enova we have an Incident Report for every single Production Incident, and I am quite proud of that fact. However, not everyone in our organization (including new hires) knows that. Our incident reports are kept in JIRA but they’re not broadly read outside of our Resilience Engineering or TechOps groups.
Incident reports should be written to be read, not simply to be filed.
We want these reports to be living documents that people want to read. We want people to go back to and learn from them, and even add to and start their own discussions. This is probably going to be the hardest thing for us, but we’re working on making these reports more compelling.
- Doing this well takes a lot of hard work.
Facilitating the level of learning we’re striving for takes a lot of behind-the-scenes work. Incident Analysis should be data-driven; we want to gather as much data about an incident as we can, including Slack history, logs, Jira tickets, emails, transcripts, etc.
We also want to know how people experienced the incident while they were in it, which we can only get from conversations and interviews. By the time we hold the post-incident meeting (i.e. “the retro”) we should have most of the information ready and should instead be spending our in-person meeting time focusing on the themes.
Doing work ahead of time gives us the chance to maximize the results of each retro.
There are a lot of things we can do to improve our post-incident analysis, and we hope to move in that direction, but we are also realistic of how time and resource expensive the process is. While we make improvements, we will continue doing what has worked for us so far — slowly making changes — and when the opportunity is right, we’ll do a full-blown analysis that includes interviews, transcript analysis, etc.
We have been able to prove our success so far with metrics on the incidents themselves, but going forward we will know we are even more successful because organizationally we see a shift in how we think of post-incident analysis:
- People will be reading incident reports (and participating in them, turning them into active documents).
- People will want to attend retros because that is where all the interesting learning happens.
- Going through the process, people remember incidents more fully and share their learnings with new people.
We will never fully get rid of incidents — and that is not our goal! But we can shift our mindset to see them as “unplanned investments” in understanding our systems and organization better.