Post Mortem Process

A description on how to write a good post-mortem and what information should be included in such a document.

Why do we write a Post Mortem?

Every major incident needs a follow up with a post-mortem. This is a blame-free, detailed description, of exactly what went wrong to cause the incident, along with the list of steps to take to prevent a similar incident from occurring again in the future (improving our processes daily). The incident response process should also be included.

Post Mortem Boilerplate

Timeline
Overview
What happened
Resolution
Contributing Factors
Impact
What Went Well?
What Didn't go so well?
Action Items

Timeline

List events as they happened. Be as specific as possible on times, using monitoring alerts to supplement the notes you took during the incident (you did take notes, right?). Try to be specific with events while still grouping similar events together for better readability.

Overview

Include a short sentence or two summarizing the contributing factors, timeline summary, and the impact; for example, "On the morning of August 14th, we suffered a 14 minute SEV-1 due to a runaway process on our primary database machine. This slowness caused roughly 3% of reports generated during this time to be completed out of SLA."

What happened

Include a short description of what happened, usually based on the timeline.

Resolution

Include a description what solved the problem. If there was a temporary fix in place, describe that along with the long-term solution.

Contributing Factors

Include a description of any conditions that contributed to the issue. If there were any actions taken that exacerbated the issue, also include them here with the intention of learning from any mistakes made during the resolution process.

Impact

Be specific here. Include numbers such as customers affected, cost to business, etc.

What Went Well?

List anything you think we did well and want to call out. It's okay to not list anything.

What Didn't go so well?

List anything you think we didn't do very well. The intent is that we should follow up on all points here to improve our processes.

Action Items

Include action items such as: (1) fixes required to prevent the issue in the future, (2) preparedness tasks that could help mitigate a similar incident if it came up again, (3) remaining postmortem steps, such as an internal follow-up email, updating the public status page, etc.

Releasing a Post-Mortem

Before releasing a post-mortem, ensure it has been reviewed by at least three different people, including its author. Preferably, one of these three would be a member of management.

For a link to a really good example Post-Mortem, see this one: https://amazeelabs.pagerduty.com/postmortems/1630b58b-f2d9-e0e1-bd24-7ef3cd78dca4‚Äč