The Art of a Great Rollout

We live in the era of high-frequency software deployments, where mass-market software products update several times a day, sometimes delivering hundreds or even thousands of changes. Take Facebook, for example: it pushes thousands of code changes daily to billions of users around the globe (Continuous Deployment at Facebook and OANDA). The trend is persistent and is likely going to accelerate with the advancement of AI and LLMs.
Why the rush
High-frequency rollouts are not just speed for speed’s sake; they are about learning faster, experimenting quickly, and adapting to ever-changing user demands. Companies can rapidly test ideas and stay competitive in the fast-evolving market.
Rolling out code or configuration at that frequency is a serious business, though. Most production issues are not caused by fiber cuts or wind storms – they’re due to bugs slipping through our systems and hitting end users (How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service). And the cost of reliability breaches is high.
The Stability Myth
So what do we have? On the one hand, businesses want to deploy more frequently and adapt to fast-growing markets. On the other hand, the cost of faulty changes impacting reliability/end users is high. How to optimize both? It feels like a trade-off: speed vs. stability.
The quick reality check shows the opposite, though DORA’s research has repeatedly demonstrated that speed and stability are not tradeoffs. In fact, more frequent, smaller changes are typically less risky than large, infrequent ones.
How to roll out more frequently
Whenever software rollout topics come up, you’ll often hear words like “DevOps” and “CI/CD.” The demand for better deployment pipelines is so high that entire businesses have built around these concepts. Platforms like GitHub Actions, GitLab, and CircleCI enable millions of developers to automate builds and deployments (see: What is CI/CD?).
There’s more to it than just tooling. The core idea behind pushing more frequently is likely to be: stop treating deployment as a risky event, and start treating it as a repeatable, low-risk, automated routine.
To get there, both engineering practices and infrastructure need to evolve to empower engineering teams to own the changes end-to-end. That means the engineer working on a feature is often in the best position to roll it out safely. This pattern is often adopted by mass product companies, e.g. Spotify, Meta, Netflix, and others.
Move fast and break things with stable infrastructure
“Move fast and break things” was a Facebook motto around 2014 before it shifted to “Move fast with stable infrastructure”. This reflected a deep shift in engineering culture, recognizing that breaking things doesn’t actually speed you up when the operational cost is high.
The industry has come a long way since then and developed a solid set of practices and patterns. Today, safe rollout practices are well established. Let’s walk through the key ingredients.
Feature Ownership
Engineers are expected to own rollouts. That practically means three things: test things, monitor things, and have a rollback strategy.
How do you validate feature works? A great test plan should have both a manual component explaining the exact steps you took (that a new engineer on your team could easily follow) and an automated component.
As this code rolls out, how will you be monitoring its health? Link to specific dashboards you’ll be looking at as you roll out. This should be specified in the test plan section of the diff.
How do you know if something goes wrong? Have metrics ready, setup alerts. The key here is that the condition is of your choosing. You decide what are normal and abnormal conditions for your component, and when a condition warrants just opening a task or actually calling on-call.
Progressive rollout / Canary
How do we limit the blast radius of failures? Roll out new changes to a small percentage of users or servers (e.g., 1%) before expanding gradually. For example, typical Meta product upgrades involve the dogfooding stage, i.e., canary release on employees first to catch issues before they hit the rest of the users.
Feature Flagging / Gating
Traditionally, deployment and release happen together; you ship new code, and the feature goes live immediately. But as change throughput increases, this coupling doesn’t scale well. Feature flags solve this by decoupling deployment from release. You can deploy code safely, then use config switches to control who sees the feature (e.g., employees, beta users, or specific regions).
A/B Testing
How do we measure impact, not just correctness? A/B testing is often used to quantify the impact of the change by assigning traffic to control and test groups with statistical tests.
Everything is a change
Code? Change. Config? Change. ACL update? Also a change. In fact, config changes are often even more risky than code changes.
That means all changes, no matter the type, should follow more or less the same principles:
- Canary first
- Validate with metrics
- Auto-rollback on failure
The End Goal
The goal isn’t just to ship more. It’s to ship more safely. With stable infrastructure, automated checks, and strong ownership, fast iteration becomes the norm, not the exception. That’s the art of a great rollout.