Post Mortem Process#
A description on how to write a good post-mortem and what information should be included in such a document.
Why do we write a Post Mortem?#
Every major incident needs a follow up with a post-mortem. This is a blame-free, detailed description, of exactly what went wrong to cause the incident, along with the list of steps to take to prevent a similar incident from occurring again in the future (improving our processes daily). The incident response process should also be included.
Post Mortem Boilerplate#
- Timeline
- Overview
- What happened
- Resolution
- Contributing Factors
- Impact
- What Went Well?
- What Didn't go so well?
- Action Items
Timeline#
List events as they happened. Be as specific as possible on times, using monitoring alerts to supplement the notes you took during the incident (you did take notes, right?). Try to be specific with events while still grouping similar events together for better readability.
Overview#
Include a short sentence or two summarizing the contributing factors, timeline summary, and the impact; for example, "On the morning of August 14th, we suffered a 14 minute SEV-1 due to a runaway process on our primary database machine. This slowness caused roughly 3% of reports generated during this time to be completed out of SLA."
What happened#
Include a short description of what happened, usually based on the timeline.
Resolution#
Include a description what solved the problem. If there was a temporary fix in place, describe that along with the long-term solution.
Contributing Factors#
Include a description of any conditions that contributed to the issue. If there were any actions taken that exacerbated the issue, also include them here with the intention of learning from any mistakes made during the resolution process.
Impact#
Be specific here. Include numbers such as customers affected, cost to business, etc.
What Went Well?#
List anything you think we did well and want to call out. It's okay to not list anything.
What Didn't go so well?#
List anything you think we didn't do very well. The intent is that we should follow up on all points here to improve our processes.
Action Items#
Include action items such as: (1) fixes required to prevent the issue in the future, (2) preparedness tasks that could help mitigate a similar incident if it came up again, (3) remaining postmortem steps, such as an internal follow-up email, updating the public status page, etc.
Releasing a Post-Mortem#
Before releasing a post-mortem, ensure it has been reviewed by at least three different people, including its author. Preferably, one of these three would be a member of management.
For a link to a really good example Post-Mortem, see this one: https://amazeelabs.pagerduty.com/postmortems/1630b58b-f2d9-e0e1-bd24-7ef3cd78dca4
Severity Levels#
Severity Level | Description | Typical response |
---|---|---|
SEV 1 |
Critical issue that warrants public notification and liaison with executive teams.
|
Platform Blocker Ticket |
SEV 2 |
Critical system issue actively impacting many customers' ability to use the product.
|
|
SEV 3 |
Stability or minor customer-impacting issues that require immediate attention from service owners.
|
|
SEV 4 |
Minor issues requiring action, but not affecting customer ability to use the product.
|
|
SEV 5 |
Cosmetic issues or bugs, not affecting customer ability to use the product.
|
JIRA Ticket or Lagoon Issue |
Severity Levels: https://response.pagerduty.com/before/severity_levels/