If your product gains popularity and you’re caught off-guard without a scalable system, then you won’t be able to deal with the demand as your company grows. Building a distributed system - that allows you to scale, deploy and respond to customers fast - is hard and there are a lot of unique challenges that I learned about at Netflix and Reddit.
In this talk, Jeremy Edberg helps audience members avoid the mistakes he made, with actionable ideas on how to implement a well-architected distributed system with the right monitoring.
[00:09] If it won’t scale, it’ll fail. This is the philosophy that I start with on whatever I do. A little bit about me, I worked for PayPal and eBay. I worked for Reddit, I was the first employee there. I worked for Netflix and was the first reliability engineer there, so that’s the quick just. This is a graph of… It’s a diagram of the traffic of Reddit through the growth of the time that I was there. And I love this visualization, because you can see how it basically doubled every quarter. And then that’s the graph of the number of countries Netflix was in at the time that I worked there, So a lot of what I was doing there was launching countries.
[00:47] So I’m going to start by talking about using the cloud. Luckily I don’t have to spend a lot of time on this anymore, but back in the day I used to have to convince a lot of people about that. I just wanted to tell a quick story about the cloud. This is when Reddit was using the cloud, and it was useful because that yellow line is the traffic and the blue bars are the cost. So you can see we launched Reddit Gold, which are the subscription program. Our costs went way up because we had to support that program, and then we were able to optimize our usage and level off our costs. And all of that was possible because of the cloud.
[01:20] So I’ve been a cloud advocate for 10, 11 years now. And luckily most people agree with me on that one. But one of the important things about the cloud is building your software to be cloud native, making it so that it works well in the cloud. So some people call this a microservices, and the reason this is important is because the microservices and micro teams work really well together, so I’m going to talk about that a bit.
[01:49] This is an example of the Netflix ecosystem, so you can see there’s a bunch of devices. It hits a front end API then, it hits a bunch of services behind it so it’s divided up like that. So each of those services is built by a different small team. So the really nice thing about doing that is that there’s technical and human advantages to that, right? So technical advantages include things like making it easier to auto scale and capacity planning, and diagnosing problems and things like that.
[02:22] And then the human aspect to it is because the services are built by different teams with different owners, that API is the door between those teams. So each little team can operate however they want to operate, and it doesn’t matter as long as they are providing an API that the other team can use. So it makes it really easy for each team to work differently. So some teams might have very rigorous deployment models and spend a lot of time testing, and other teams will just take whatever is get pushed into master and just deploy it right out. And it doesn’t matter, because as long as each team is responsible for their own service and making it work and the API is working, they can communicate.
[03:06] So it makes it really easy for teams to communicate, because they’re just small teams and they still work together, they still work in the same place, so they can still talk to each other and say, “I need your API to do this or this.” But by making that API the middle, it becomes easier to coordinate between these small teams. So I’ve done a bunch of surveys, I’ve talked to a bunch of companies that run microservices, and on average it’s about 25% of their engineering resources are being spent on that platform. Either a quarter of everybody’s time or a quarter of their engineers are doing that.
[03:45] So that team becomes very important, because they’re building the tools that enable all the developers to move rapidly and use this model. This services, cloud, dev ops model. So talking about a dev ops model, my tenants of the dev ops model is, one is building for three. So it’s this idea of redundancy. You have three of everything so that if you lose one, you still have two, you can do rights, et cetera. As much automation as possible, and I’m going to talk about this in a little bit.
[04:19] Automation is key in everything that you do, because nothing is going to scale if you have to have a human in the loop. The moment you put a human in the loop, you’re slowing yourself down. And also, independent teams being responsible for both dev and ops. So teams being fully responsible from beginning to end so that they are responsible when things break, and then having a team that builds the tools to enable that. So I think that works best. So when two people talk about the dev ops team, in my mind that means the people who are building the tools to enable developer velocity. And this is how you end up scaling your organization, by having a team that’s dedicated to making tools making everyone else’s life easier.
[05:06] At Netflix, we had a this idea of freedom and responsibility. So the idea was that you have as a developer, the freedom to do what you want to do, and then you have to be responsible for for whatever happens. Everyone was really good at the freedom part not so good at the responsible part, but accountability ended up happening sort of naturally. And then the nice thing is that by doing this, we provided a platform so the developers could change their code whenever they wanted. They could deploy whenever they wanted, they can manage their own auto scaling and so on and so forth. And they had to fix their stuff at four in the morning when it breaks.
[05:48] So, oops, there we go. And that was really helpful, because it meant that they belt built better software, because they were the ones who were getting woken up at four in the morning. So it was important to have the developers own the product from beginning to end. If the customers weren’t happy, then the developer wasn’t happy. And this was great for accountability, because this meant that everybody was accountable to happy customers, and that was a great way to run an engineering organization.
[06:20] Another thing that we did to really scale the organization was to get rid of policies. So at most companies, policies come about because something bad happens and somebody says, “We need to put a process in place to make sure this never happens again.” And they’re very prescriptive and inflexible and nobody ever follows them, because the new person doesn’t know the policy or they don’t have time or they don’t want to deal with the paperwork.
[06:44] So at Netflix, we did the opposite. We would remove policies. If a policy caused a problem, we would get rid of it. When I first started at Netflix, we had change control. For example, you had to file a change control ticket to make a change. And it didn’t work because a lot of people just didn’t file the ticket. Some people filed the ticket after they made the change just as a record keeping, but it wasn’t great. The tickets were sometimes filed and the change never happened, and that was just confusing. So we just got rid of change control.
[07:16] We said, “Forget it, we’re not even going to bother with change control tickets. We’re just going to make changes. And instead we’re going to build a tool that simply reflects reality.” So we built a tool that logs all the changes in production, and anybody could look at that tool in any time. And that was far more useful, because it told you exactly what happened when it happened at the right time and it was automatic. And that was the key, it was automatic. The person making the change. It didn’t have to do anything to get that change recorded.
[07:46] So let’s talk about setting up a scalable infrastructure. As I mentioned, automation is key everywhere. So infrastructure is code is a great way to manage things. Put everything in change control, like Github, change control or whatever, enable your continuous delivery and so on. So the more automatic things are, the faster you can deploy and the faster that you can roll back. If you’re changing your infrastructure with automation, it’s easy to roll it back. And being able to roll back is super key because then you feel much more comfortable making changes. If you believe that you can quickly roll a change back, then you feel a lot more comfortable making a more drastic change, because you know that you’ll be able to roll back quickly and it will have a small effect.
[08:36] So automating everything, application start up, code deployment, system deployment, all of that stuff as much as absolutely is possible. And there are some challenges to this, and you lose track of resources and there’s individual snowflakes. But the biggest one of all is fear. A lot of people do not believe in their own automation, they fear their automation. They will set all this stuff up and then still run it manually, because they don’t trust their own system to be successful. And that’s probably one of the biggest holdups of automated deployment is fear. You see that in a lot of companies where they’re like, “Yeah, we can totally do an automated deployment, but we don’t. We have someone who has to hit enter.” “Why?” “Because we’re worried it’s not going to work.” “Why don’t you fix your systems? You don’t have to worry about that.”
[09:26] And how do we get to a point where we trust our automation? We have to have actionable metrics. These are some graphs from Reddit actually, these are really old graphs, they’re probably 10 years old at this point. But looking at this graph, you can see that something bad happened. There’s HDB status on the row on the right side, and something bad is going on. But we have no idea what, these are not actionable metrics. We can add in a bunch more metrics, and that makes them even less actionable, because now we really don’t have any idea what’s going on.
[10:01] So how do we fix this problem? Well, at Netflix we developed a monitoring system, which is open source and called Atlas. And what it had was math built into the monitoring system effectively. You can also get this commercial by the way, from like a company like Signalfx or something, or Datadog. And what it did is, you’ll see here the blue line is actual data. The red line is predicted data. So it’s called smoothing on the line, double exponential smoothing, excuse me. And it’s essentially predicting what the graph should look like. And the green bars are the difference between actual and predicted, that’s the alarm effectively. So you can graph the math on the graph and then alarm on it.
[10:49] So on this metric, for example, you have alarms on the green bars. So there’s no static thresholds or anything like that, it’s all about monitoring the changes, right? So it’s dynamic thresholds. And that was super key for making all of this stuff work, to make everything actionable because things are changing so often you need to have the dynamic thresholds. And another type of actual metric is actually user metrics. So this is an example from Reddit where we had the search bar and everybody would complain, search sucks, it’s terrible, blah, blah, blah. And we’re like, “All right, let’s get some data around how much search sucks.”
[11:29] So we added that button. It just says, “Was this useful? Yes or no.” And we ran it for a while and we found out 70% of the people would click yes. So we’re like, “Okay, that’s not bad. 70% of the people say that they’re finding what they want.” Then we changed our search backend, and we didn’t tell anybody. And all of a sudden, that number went up to like 92%. So we’re like, “Okay, that change was definitely useful, that was a good change.” And then eventually we told everybody, “Hey, we changed the backend.” And then that number went up to like 95%, so some people were actually convinced when we told them that it was changed, that it was better. But what we knew was that it was already better, because we’d already seen the change.
[12:08] So this was a good example of an actual metric, it was super easy and user feedback. And the important thing with the monitoring system is self-service. So one of the things we built was a system that was totally self-service, where developers can put any metric they want into the system, they can put an alert on any metric in the system, they can put an alert on a combination of metrics. They can put alerts on other people’s metrics. And that was super important, because sometimes a team would want to monitor their downstream or their upstream service to see if that’s healthy. And if they started getting alerts for an upstream service, it could potentially indicate a problem in their own system.
[12:47] So the self-service was really important, making it possible for anyone to put any metric into the system. It was also very hard, because it meant that the metrics and monitoring system could randomly get a huge influx of new metrics if somebody deploys some code with metrics, so that presented its own challenges. But the nice thing was that it allowed us to put business metrics in or the monitoring system, and that was super important, because as I’ll talk about in a second, monitoring business metrics is more useful than machines and hardware.
[13:24] But once you have a good monitoring system, it allows you to move up. So this is from a company called Armory. They essentially have commercialized Netflix’s deployment tool, Spinnaker. And they have this chart that they like to use that I blatantly stole from them, and it talks about the different stages that you can get to as a company. And stage three is the whole continuous delivery and so on and so forth. And to get beyond stage three and four, you need to have good automation and you need to have good metrics and monitoring. Because you need to be able to have the machine know about changes automatically. To be automatic, to have automatic resolution and so on and so forth.
[14:09] So as I said, choose business metrics, not machine metrics. That’s super critical, because at the end of the day your customers don’t care if server A is failing. All they care about is that they can’t use your website or whatever it is. And then if you’re in the cloud especially, don’t even bother with monitoring individual machines, it just doesn’t matter. We had that data, but we almost never used it. The only reason we kept that data was so that we could see if one machine in a cluster was not behaving the same as the rest, and then we would just delete the machine. We wouldn’t even look at it. Unless it repeatedly happened over and over and over again, then we might dig into that machine to figure out why. Otherwise, we’re just like, “Meh, it’s an anomaly. We don’t even care.” As long as it doesn’t effect the user, it doesn’t even matter.
[14:56] Another important tip was to alert on an increase of failure, not a lack of success. So if you’re monitoring website traffic, you don’t want to monitor if there’s a huge drop. One of my favorite stories was we had alerts on drops in traffic. And all of a sudden in Latin America, the traffic just dropped off. And we had no idea why, we started investigating on the machines. We’re like, “Oh crap, what’s going on? What’s happening?” Eventually, I don’t even know why I looked at the news site, but I found out that there was this huge football match between Mexico and Brazil. And we’re like, “Oh, that’s why no one’s watching Netflix in South America.”
[15:32] And if we had been alerting on the opposite, then the graphs would have followed along with that. So it wouldn’t have been a lack of success that we were alerting on. Because there was no increase in failure. And when we’re talking about monitoring, how many people does this mean something to you? Oh, awesome. All right. So this is percentiles basically, right? So look at this graph. This is a graph of the same data with three different percentiles. The blue line at the bottom is the 50th percentile. The next one is the 90th and the top is the 99th.
[16:08] So what are we gleaned from here? The average customer is having a great experience. They’re having a couple of millisecond response time. This is response time for a website. So they’re having a great experience at the 90th percentile. They’re having a pretty good experience too. But the 99th percentile, they’re having a terrible experience. And this is meaningful, because it depends on your business whether you care about the 99 percentile. Do you care that 1% of your customers are having a terrible experience? Sometimes you don’t. Sometimes you’re like “There’s nothing we can do, and that 1% is going to have a terrible experience,” and sometimes you do care. So it’s really important when you’re looking at your metrics to think about not just averages, because averages can hide a lot of information.
[16:53] You can estimate your percentiles. Somebody developed an algorithm fairly recently to estimate percentiles as you go, so you don’t need all the data at once. And it’s called T-Digest. And the errors actually are interesting, because the error on the percentile calculations get worse in the middle. Which is great, because usually you don’t care about the middle, you care about the extremes. So this is a good way to calculate percentiles as you go, which is a good way to do your monitoring.
[17:24] But I want to show you this real quick. So this is called Anscombe’s quartet. Every one of those graphs fits these constraints. But as you can see, the graphs look very different. So even though you might be monitoring and you think you’re monitoring the right things, you have to think about how is the way that you are displaying your data affecting your decisions and your perception of the data? Because you can display it in very different ways to make it look really good or really bad. So you have to make sure you’re not tricking yourself into thinking that everything is great, because you’re displaying it to yourself in a great way.
[18:02] Another really important thing that I like to tell everybody is to use cues as much as possible. Queues are a great way to balance out load, and they’re a great way to do monitoring because you can monitor queue lengths. But if you’re monitoring queue lengths, there’s something important that you need to know, right? If you have a queue, then you have a queue depth graph such as this one. But you can see something has gone wrong here. All of a sudden, the queue depth is getting deeper. Why is it getting deeper? We don’t know. So this is where I like to talk about cumulative flow diagrams.
[18:37] These are really important if you’re monitoring anything that increases linearly, like queue depths. And if you look at the graph now, essentially this is the number of inputs and outputs to a queue. And now you can see exactly where the problem lies, you can see that the departures have slowed down. So the input into the graph is the same rate, but it’s processing slower. So now you know where to focus your efforts. You know to look at whatever it is that’s taking things off the queue as the place where something might be broken.
[19:07] So a cumulative flow diagram, if you take nothing else away from this talk, take that. And also useful is the fact if you know very basic queuing theory, then you know that capacity utilization increases queues exponentially. What does this mean in practice? What that means is that if you take away machines from your queue processing, and your graph does not grow exponentially, than you are over-provisioned. If you start growing exponentially without doing anything, then you’re under provisioned. So knowing the shape of the line of the curve helps you predict how much infrastructure you need to process that queue.
[19:54] And then variability in ACU is also very important. So this is two ways to design a queue. You could look at a bunch of machines and have one queue for each. You could have one master queue. The way we used to do things at Reddit was at first we had one master queue and that sucked, because you’d get a slow request to the head of the queue and it would block the whole thing. So we divided it across multiple machines, and then you’d get a slower crest at the head of every queue, and now you had a bunch of machines that were blocked. So eventually we solved this problem by creating multiple queues of different speed requests. So we would monitor each type of API call, how fast they were. We put fast ones over here, slow ones over here. So the way you divide up your cues can really have an effect on what you’re doing.
[20:42] Then another thing, really important, chaos engineering. This idea of simulating that go wrong and finding things that are different. So the two most important things to test in a distributed system are instance loss and increased latency or slowness. The first one is… So I’ll talk about the Simian Army which you may have heard of. All system choices assume that something is going to fail at some point, so we created the Simian Army. You may have heard of the chaos monkey. There was a bunch of other monkeys that my team created. The chaos gorilla, the chaos Kong killed larger and larger things.
[21:23] So the chaos monkey which you’ve may have heard of killed machines in production randomly. This is a great way to make sure that people were writing good software that could handle instance failures. We just turned it on and did that to them. And then they would say, “Hey, my stuff’s breaking.” And we’re like, “Good, you should fix your stuff.” And it worked, people start writing better software because they knew it was coming. And the same with all the other monkeys. They knew it was coming, they would write good software. But it kept everybody honest, because if we had turned it off, people would have gotten lazy. So we just kept it going.
[21:54] And after a while, a chaos monkey was meaningless to everybody, because it would keep destroying stuff and it didn’t affect anybody. So we had to build a bigger monkey. We had to kill more stuff, make people get better at figuring it out. Build even bigger, kill even more things. But the really important one was the latency monkey, because detecting slow is a lot harder than detecting down. So we built the latency monkey to induce slowness between services, and this really helped each service tune what does slow mean to me? Because every service slow could be something different. One service might need five millisecond responses and other service might be fine with five second responses, and it totally depends on the service.
[22:39] We had to build a bunch of other stuff too, because we were in Amazon. Like janitor monkey to clean up extra services. A howler monkey that told us when Amazon was doing stupid things that we couldn’t fix, and stuff like that. Quick, mostly I like this slide just because I met Mr. Data, so I got to put this picture up. But quick tips on data, having multiple copies of the data and keep those in multiple places, and don’t put it all on one machine or one person’s head.
[23:09] And then the last thing I want to really quickly talk about is incident reviews, and making sure that when you have an instant review at your company, it is a collaborative environment. One place that I worked, the incident review was to figure out whose bonus was going to get docked for the outage, so of course nobody wanted to come. So then they made the rule that if you didn’t come it was you, so then everybody would come and be quiet. And it was a terrible place to have an incident review, because nobody would ever want to claim responsibility for anything.
[23:39] At Netflix, it was the exact opposite. People would come running into the room saying, “I know what happened, what went wrong, and know why. And this is how I’m going to fix it, and I need these people’s help.” And everybody got really excited about helping people fix problems. And that was the environment that we really liked. That collaborative, “I know what went wrong, I know how to fix it, and let’s do this together and let’s build tools that find classes of problems.”
[24:04] My company right now is all built on serverless. So I like serverless, because it means I don’t have to admin machines. Which I really don’t like to do despite the fact that I’ve been doing it for 25 years. But machines suck and I hate them, so I’m happy to let Amazon take care of it for me. So that is how I’m scaling my company now. The other way I’m scaling my company is via remote work. So really quickly, some of the benefits of a remote work are number one, you must have a culture of remote work. You can’t just have the one remote person, you’re going to need to have the remote. Everything should be a synchronous as much as possible. There should be no difference between being in an office and not being if you have one. We don’t have one.
[24:49] But one of the nice things about this microservices, distributed computing is it really lends itself nicely to remote work. Because then your team can… Small teams maybe are clustered in one or two time zones and they can work together, but if you have another small team in a way different time zone, it makes it easier. Going back to the whole API’s are the key of communication. What people have generally found is, nine times zones is the max that you can handle and a company. But you know maybe you can stretch that to 10, which is how far South Africa is from the west coast of the United States.
[25:27] Large open source projects are a great model for how to handle all of this, because they do everything asynchronously and distributed. So look at the Linux kernel or something to see how they handle it. So quickly wrapping up, putting it all together, use the cloud, microservices and dev ops. Empowering your engineers with self-service, automating everything, monitoring the right things, using cast testing, breaking things on purpose. Serverless, so don’t bother with machines. And for me at least, remote first culture.