The OfferZen development team are attending ScaleConf this year. There were many excellent talks, but one that inspired us particularly was Michael Gorven’s talk on Continuous Delivery at Facebook. We took home a number of lessons we plan on applying at OfferZen.
Michael is a production engineer at Facebook (on the Instagram team) and he shared how his team deploys backend code 40 times a day, to millions and millions of users.
The Instagram team run fully fledged Continuous Delivery (CD) - any commit on master gets automatically deployed to production by their build and release system.
Why do Continuous Delivery?
The Instagram team decided to implement Continuous Delivery because it:
Allows developers to move faster
They are not beholden to fixed release schedules, and can ship code as soon as it’s ready
Enables multiple iterations per day
Many problems are best solved iteratively. With CD, it’s easier and much faster to tackle a problem in small iterations.
Makes it easier to identify bad commits
Releasing small batches of commits at a time makes it trivial to identify which commit caused a problem.
Avoids having an undeployable mess of code
Bundling many commits means that a single broken commit can create an undeployable mess of code that is very hard to untangle. With CD broken changes don’t delay other commits.
How Instagram got to full Continuous Delivery
Instagram didn’t try to implement CD overnight. Instead they took a gradual approach that improved their process with every step.
Make the manual process easier
Instagram started out with a defined, but manual, deployment process. The first step towards automation was scripting the manual process - basically just laying out the steps for a developer and asking “yes” at points.
Add a Canary
Automated checking started with a basic “Canary” script that monitored the performance of a release on the first “test” server it gets deployed to. All the Canary did (and still does) is look at HTTP response codes and make sure that they remained within hard coded thresholds. For example; a rule could be that 95% of all requests should return 2xx.
Testing each diff
The team set up Jenkins to run the automated tests on each diff, and report the results to Sauron (their custom deployment system).
Adding rollout states and abort
They added states to Sauron for each rollout (running, done and error) and tracking to show the number of servers a release has been deployed to.
Once these tools were in place, they started automating things. At first they automated some of the rollout decisions. The system would only deploy commits which pass all the tests. If a release fails on more than 1% of hosts, automatic rollouts would pause and wait for human intervention.
At this point the release process was basically automated. The only human interaction required was babysitting it and responding “yes” to confirm releases.
The last step was automating the “yes”. Initially engineers supervised the automated system, until they were happy that the system was reliable.
Michael recommends following a few key principles when embarking on the journey to continuous delivery.
High quality tests
- Fast - entire test suite should take less than 5 minutes
- Thorough - the entire codebase should have good test coverage
- Frequent - tests should be automatically run on every commit
An automated system that identifies the really bad commits.
Handle the normal case
Aim to have the normal case be completed automated and released quickly. Anything abnormal should be checked by humans manually.
Make people comfortable
People get worried if they aren’t sure what the automation is doing. Make the system easy to understand and provide good visibility into the automated system.
Plan for failure; assume bad commits will still get out. Provide a simple stop mechanism for if (when) things go wrong.
Michael’s talk was inspiring to us, because he showed that continuous delivery is achievable at any scale. You don’t need to start with a complex system on day one. Start simple and then evolve your approach gradually.
Our next steps at OfferZen will be to focus on fast rollbacks and an automated canary, the two biggest gaps in our journey to continuous delivery.