Being On-Call#

What is On-Call?#

On-call is part of the System Engineer job at amazee.io. This means you can be contacted at any time to investigate and fix an issue that may be triggered on the platform.

Being a fully distributed team on-call means that usually your responsibilities start around the handover meeting in the morning and the handover in the evening.

In case a team within a timezone isn't able to handle a situation within the team boundaries an escalation to the senior engineers in another timezone is the standard operating procedure.

Stressful Situation

We are fully aware that being on-call can be a huge stress. If you ever feel that the situation gets out of control. This is a point we can't stress enough : Never hesitate to escalate to the next on-call engineer.

Responsibilities#

Prepare#

Have your laptop and Internet with you (office, home, a MiFi dongle, a phone with a tethering plan, etc).
Make sure you have something to charge your laptop and your phone.
Have PagerDuty configured with your phone number, this allows PhoneDuty to route calls to the 24/7 emergency number to you appropriately. It is also wise to check this action and make sure the local PagerDuty number is not blocked by your phones "Do Not Disturb" mode.
Team alert escalation happens after 15 minutes. Set your notifications accordingly.
Make sure PagerDuty texts and calls can bypass your "Do Not Disturb" settings.
Be prepared (environment is set up, a current working copy of the necessary repos is local and functioning, you have configured and tested environments on workstations, your credentials for third-party services are current and so on...)
Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments etc.

Triage#

Acknowledge and act on alerts whenever you can (see the first "Non-Responsibilities" point below)
Determine the urgency of the problem at hand:
Is it something that needs immediate attention or do we need to escalate this into a major incident? - escalate it
Is it an non-critical issue (e.g. disk utilization, pod counts, ssl certificate with more than 72 hours of validity) - snooze the alert until a more suitable time (working hours)
If the issue is a reoccurring one file a ticket to fix the issue and make people aware in #amazeeio-log.
If the issue needs another set of eyes in another team create a ticket and flag the ticket (so it can be discussed in the handover meeting)
Check Slack for activity. Sometimes actions that could lead to the on-call being pinged will be announced there (e.g. a production deployment with downtime)

Fix#

You are empowered to dive into any problem and act to fix it
Involve the other team members as necessary: do not hesitate to escalate if you cannot figure out the cause within a reasonable timeframe or the alert is something you have not tackled before.
If the issue is not time sensitive and you have other priority work, create a JIRA ticket to keep track of it
Leave a trace on what you did in #amazeeio-log (use threads to document what you did - this will help others understand what you did).

Improve#

If an issue keeps re-appearing; if an issue alerts often but turns out to be a preventable non-issue – perhaps improving this should be a longer-term task.
noisy alerts, disk that fill up
If information is difficult / impossible to find, write it down.

Support#

When your on-call "shift" ends, let the next on-call know about issues that have not been resolved yet and other experiences of note. (may this be trough a flagged ticket or as a verbal handover in the handover meeting)
If you are making a change that impacts the schedule (adding / removing yourself, for example), let others know since many of us make arrangements around the on-call schedule well in advance.
Support each other: when doing activities that might generate plenty of pages, it is courteous to "take the page" away from the on-call by notifying them and scheduling an override for the duration.

Admin work#

As On-call work happens usually outside of working hours, there are some administrative tasks that need to be done:

Log the time spent on your on-call shift to IOO-4 - PagerDuty On-Call Work
If you need more information on the weekend working-hours and how we handle those you can find the policy Confluence Working-Hours On-Call (PagerDuty)

Non-Responsibilities#

No expectation to be the first to acknowledge all of the alerts during the on-call period.
Commute (and other necessary distractions) are facts of life, and sometimes it is not possible to receive or act on an alert before it escalates. That's what we have the backup on-call and schedule for. If you can't take the alert just escalate it.
No expectation to fix all issues by yourself.
No one knows every system. The whole team is here to help. There's no shame and much to be learned (if you never touched a legacy system - good for you, there might be a team member that built it and knows the ins and outs) - Don't hesitate to escalate to the team - You can fix it together.
As we are a relatively small team everyone has his parts of the platform where they understand how it works. It sometimes happens that the documentation is lacking, working together and fixing an issue and documenting it in the process is often the best way moving forward.

Suggested Practices#

Alert Notifications#

If an alert is not acknowledged after 15 minutes it's escalated to the 2nd person on call.

Monitor Slack: #amazeeio-ops gets the critical alerts
After 1-2 Minutes: Use Push Notifications as your first method of notification
After 2 Minutes for every 5 Minutes: Use Phone and/or SMS notification. If you don't pick up for 15 minutes it's escalated.

Merging Alerts#

You can (and likely should) merge related alerts into a single event via PagerDuty. When merging events, bring them together in logical groups. Grouping by cluster is a good start. Also make sure to rename the grouped alert so that others viewing the PagerDuty overview can know what the related alerts are, without having to dig.

Etiquette#

Don't acknowledge an incident out from under someone else. If you didn't get paged for the incident, then you shouldn't be acknowledged by you. Offer your help in the operations channel if you happen to know the particular issue.
If you are testing something, or performing an action that you know will cause pages, it's good behavior to "take the pager" for the time you are working on the system. Just let the person on-call know that you will take the pager.
"Never hesitate to escalate" - Don't feel ashamed to get external help from your team if you don't know how to resolve an issue.
Always consider covering an hour or so for someone else's on-call time if they ask for it and you are able. Most of our on-call times are during business hours. Covering while someone goes on an extended lunch with friends or takes half a day off is greatly appreciated, one day you might also want to run out early on a Friday enjoying the weekend ;)
If an issue comes up during your on-call shift for which you are paged you are responsible resolving it. Even if it takes 2 hours and you only have an hour left in your shift. You can hand it over to the next on-call if they agree or even solve it together.

Acknowledgements#

https://response.pagerduty.com/oncall/being_oncall/

Emergency Phone Number#

One part of our On-Call System is automated monitoring. Another crucial part is the emergency phone number which customers can alert us if they need to direct our attention during hours where there's no support staff online or if they need to warn us of something and can't get in touch with us for whatever reason (e.g. unavailability of other communication channels).

Process#

We use PagerDuty Live Call routing to which we direct all the Local Phone Numbers.

Customers must first try to get in touch with us via the means of written communication (e.g. creating a ticket) and then get in touch with us via the emergency phone number.

The emergency number takes all information via a voice mail, which will then be sent to the on-call as an escalating alert through the current on-call engineers.

As soon as the Engineer Receives the Alert, we will need to verify if the customer is the person they claim to be - If a ticket has been created, or we got in touch via slack in a shared channel we generally trust the customer that there are valid reasons to call the emergency line.

After verification, we start working with the customer on the case they opened.