Outages can be very stressful, even more so if we are working in small distributed teams. Not everyone behaves similar under stress. Our process tries to make sure we are able to handle an outage well and don't overload people physically and mentally.
This page is still a draft and needs updating
Similar to on-call work we follow the the "one voice" approach of having one person that communicates to customers and stakeholders while the rest of the engineers focuses on solving the issue at hand. This approach makes sure we have a distraction-free zone and can solve issues quicker.
Right when the outage starts the roles are assigned to the engineers working on the issue.
The Incident Commander acts as the single source of truth of what is currently happening and what is going to happen during a major incident.
- Establish communication channels
- Decide on who else takes the roles of Communication and Subject Matter Expert
- Running communications via our status page - status.amazee.io
- Running communications towards customers in Slack, RocketChat
- In case the issue affects several customers - be brief and inform everyone in the same way. The long-form is always visible on our status page and the post-mortem-report
- Update Statuspage regularly (in long-running incidents at least once hourly)
- Set a timer for 55 minutes after the last update and you automatically get reminded to update the status page around 1 hour after the last update to keep a steady flow
Subject Matter Expert / Engineering#
The engineering part is working on solving the issue at hand
- Ask how you can support the team or if they have any needs (a coffee, a break). This can get important during long outages.