How We Operate Our Software
Our squads are responsible for operating their own apps because delegating responsibility for operating an app removes the incentive to solve for operational pain caused by how the app was built.
We build 12-factor apps because the 12-factor app is the most comprehensive language-neutral paradigm to date for operating services on a PaaS.
We provision all app infrastructure through code because it makes testing, iterating, and recovery easier, better, and faster.
We practice CI/CD with a bias toward automatic production deployment of the repo’s main branch because integrating and deploying frequently improves quality and agility and there are healthier ways to assure quality than an artificial QA environment.
We decouple deployment and release via feature gates because it reduces deploy-time risk to acceptable levels for fully-automated CD and allows for high assurance QA in production and safe, incremental, instantly reversible releases.
We aim never to ship code that knowingly throws exceptions at consumers, even behind feature gates, because code that doesn’t have defined, reasonable behavior has no place in a production codebase.
We alert loudly with the intention of being interruptive because truly exceptional events should be triaged and addressed promptly.
We alert with purpose, only on actionable events because when everything is loud, nothing rises above the noise floor.
We prefer to alert app teams on application behavior like errors and latency rather than systems metrics like CPU utilization because CPU utilization is a measure of efficient use of resources and may be a false positive for batch workloads.
We alert on software quality issues once per class of bug via tools like Sentry to dedup and focus effort on resolving the underlying issue.
We alert on issues requiring operational intervention every time they occur via tools like Datadog alerts, ticketing systems, and in-house software workflows to ensure incidents are recorded and provide tracking to resolution.