Rhythms and Rituals#
This page describes the Rhythms and Rituals of the Platform Operations team. It is a document that is updated as we learn and evolve. Our team is in different time zones, so we have a few rituals that help us stay in sync. We strive to work as asynchronously as possible to lower the impact on our team members' personal lives.
Following are our rhythms - Every part is documented in more detail below. The list gives an overview of the rhythms and their frequency.
- Weekly Platform Notes (Friday)
- Maintenance Window Planning (Monday)
- Maintenance Window Execution (Tuesday)
Rhythms in Detail#
We run three daily handovers - one in each region. The standups are 30 minutes long. You can find more information on the page outlining the meeting structure Handovers/Standups.
Monitoring Checks are done daily by the on-call engineer. The on-call engineer also looks into improvements to the current checks and takes action where needed.
Platform Operations Weekly Notes#
The weekly platform notes are a summary of the week's activities. This helps us to know the essential things ahead of time and keeps the team updated across regions.
The document currently covers four topics:
- Load tests
Maintenance Window Planning and Execution#
Maintenance planning is done on Monday, and the maintenance window is executed on Tuesday. In the planning phase, we take a look at the current updates that need to be done to the infrastructure (e.g. OS updates, security updates, etc.) and plan the maintenance window accordingly. Additionally to that, we queue in the changes that are requested by the customers (e.g. changes to instance sizes or updates to the configuration).
Maintenance windows happen after 22:00 in each respective Timezone. The maintenance window lasts for 6 hours.
More information can be found in the documentation on our maintenance process Maintenance Process.
Platform OKR Planning#
Platform OKR planning happens every quarter. We use the OKR methodology to set our goals for the quarter. The OKRs are placed in an asynchronous workshop with the whole team. The team lead facilitates the workshop. We usually start with three more significant Objectives which align with the general company OKRs for that quarter. After defining the 3 Objective Areas, we come up with the Key Results. The team lead usually drives a few of the key results, and the team can give inputs or things they see which might have been missed.
Platform OKR Updates#
Platform OKR Updates are usually done weekly but at least monthly to have a regular update on the progress of the OKRs.
Platform TLS Cipher Review#
This is a security-related task - As we run TLS via our platform either via the load balancers directly or via the amazee.io CDN, we need to make sure that we are using the latest and greatest ciphers. This is a half-yearly task that is done in the first half of the year. Sometimes we need to do it more often if there are security issues with the ciphers we are using, or we might have customers who inquire for more information about them.
Our global TLS configuration is used for all our customers that use the CDN; we strive to have broad availability of ciphers as we need to support a wide variety of clients and have backward compatibility. As we're running on mozillas intermediate configs since we started amazee.io this has shown that intermediate compatibility is good and secure enough for 99% of our customer base. We do not use the modern config as some older clients do not support it, and we do not want to break our customers' sites.
The Internal documentation of the TLS Configuration and the changesets can be found in the internal wiki under Fastly Maps and TLS Configuration.
Disaster Recovery Testing#
Certain Services are tested at least annually to ensure that we can recover from a disaster. Services that are tested and its results documented internally are:
- Kubernetes cluster objects
- Database clusters (point in time - to a specific date and time)
- Database clusters (daily backup - to the previous day)