Skip to content

Post Mortem Process#

A description on how to write a good post-mortem and what information should be included in such a document.

Why do we write a Post Mortem?#

Every major incident needs a follow up with a post-mortem. This is a blame-free, detailed description, of exactly what went wrong to cause the incident, along with the list of steps to take to prevent a similar incident from occurring again in the future (improving our processes daily). The incident response process should also be included.

Post Mortem Boilerplate#

- Timeline
- Overview
- What happened
- Resolution
- Contributing Factors
- Impact
- What Went Well?
- What Didn't go so well?
- Action Items

Timeline#

List events as they happened. Be as specific as possible on times, using monitoring alerts to supplement the notes you took during the incident (you did take notes, right?). Try to be specific with events while still grouping similar events together for better readability.

Overview#

Include a short sentence or two summarizing the contributing factors, timeline summary, and the impact; for example, "On the morning of August 14th, we suffered a 14 minute SEV-1 due to a runaway process on our primary database machine. This slowness caused roughly 3% of reports generated during this time to be completed out of SLA."

What happened#

Include a short description of what happened, usually based on the timeline.

Resolution#

Include a description what solved the problem. If there was a temporary fix in place, describe that along with the long-term solution.

Contributing Factors#

Include a description of any conditions that contributed to the issue. If there were any actions taken that exacerbated the issue, also include them here with the intention of learning from any mistakes made during the resolution process.

Impact#

Be specific here. Include numbers such as customers affected, cost to business, etc.

What Went Well?#

List anything you think we did well and want to call out. It's okay to not list anything.

What Didn't go so well?#

List anything you think we didn't do very well. The intent is that we should follow up on all points here to improve our processes.

Action Items#

Include action items such as: (1) fixes required to prevent the issue in the future, (2) preparedness tasks that could help mitigate a similar incident if it came up again, (3) remaining postmortem steps, such as an internal follow-up email, updating the public status page, etc.

Releasing a Post-Mortem#

Before releasing a post-mortem, ensure it has been reviewed by at least three different people, including its author. Preferably, one of these three would be a member of management.

For a link to a really good example Post-Mortem, see this one: https://amazeelabs.pagerduty.com/postmortems/1630b58b-f2d9-e0e1-bd24-7ef3cd78dca4

Severity Levels#

Severity Level Description Typical response
SEV 1

Critical issue that warrants public notification and liaison with executive teams.

  • The system is in a critical state and is actively impacting a large number of customers.
  • Functionality has been severely impaired for a long time, breaking SLA.
  • DDOS or Infrastructure Incident impacting an entire hosting region
  • Customer-data-exposing security vulnerability has come to our attention.
Platform Blocker Ticket
SEV 2

Critical system issue actively impacting many customers' ability to use the product.

  • Web app is unavailable or experiencing severe performance degradation for most/all users.
SEV 3

Stability or minor customer-impacting issues that require immediate attention from service owners.

  • Partial loss of functionality, not affecting majority of customers.
  • Non working deployments for a majority of customers
SEV 4

Minor issues requiring action, but not affecting customer ability to use the product.

  • Delayed Deployments
  • Cluster Host failure (e.g. single K8s node failure)
SEV 5

Cosmetic issues or bugs, not affecting customer ability to use the product.

  • Bugs not impacting the immediate ability to use the system
  • Workaround available
JIRA Ticket or Lagoon Issue

Severity Levels: https://response.pagerduty.com/before/severity_levels/