Outage Handling#

Outages can be very stressful, even more so if we are working in small distributed teams. Not everyone behaves similar under stress. Our process tries to make sure we are able to handle an outage well and don't overload people physically and mentally.

Warning

This page is still a draft and needs updating

Roles#

Similar to on-call work we follow the the "one voice" approach of having one person that communicates to customers and stakeholders while the rest of the engineers focuses on solving the issue at hand. This approach makes sure we have a distraction-free zone and can solve issues quicker.

Right when the outage starts the roles are assigned to the engineers working on the issue.

Incident Commander#

The Incident Commander acts as the single source of truth of what is currently happening and what is going to happen during a major incident.

Responsibilities#

Establish communication channels
Decide on who else takes the roles of Communication and Subject Matter Expert

Communication#

Responsibilities#

Running communications via our status page - status.amazee.io
Running communications towards customers in Slack, RocketChat
In case the issue affects several customers - be brief and inform everyone in the same way. The long-form is always visible on our status page and the post-mortem-report
Update Statuspage regularly (in long-running incidents at least once hourly)
Set a timer for 55 minutes after the last update and you automatically get reminded to update the status page around 1 hour after the last update to keep a steady flow

Subject Matter Expert / Engineering#

Responsibilities#

The engineering part is working on solving the issue at hand

Etiquette#

Ask how you can support the team or if they have any needs (a coffee, a break). This can get important during long outages.