Outages can be very stressful, even more so if we are working in small remote teams. Not everyone behaves similar under stress. Our process tries to make sure we are able to handle an outage well and don't overload people physically and mentally.
Similar to on-call work we follow the the "one voice" approach of having one person that communicates to customers and stakeholders while the rest of the engineers focuses on solving the issue at hand. This approach makes sure we have a distraction-free zone and can solve issues quicker.
Right when the outage starts the roles are assigned to the engineers working on the issue.
The Incident Commander acts as the single source of truth of what is currently happening and what is going to happen during a major incident.
Establish communication channels
Decide on who else takes the roles of Communication and Subject Matter Expert
Running communications via our status page - status.amazee.io
Running communications towards customers in Slack, RocketChat
In case the issue affects several customers - be brief and inform everyone in the same way. The long-form is always visible on our status page and the post-mortem-report
Update Statuspage regularly (in long-running incidents at least once hourly)
Set a timer for 55 minutes after the last update and you automatically get reminded to update the status page around 1 hour after the last update to keep a steady flow
The engineering part is working on solving the issue at hand
Ask how you can support the team or if they have any needs (a coffee, a break). This can get important during long outages.