Using Infrastructure as Code to Scale Months Down to Minutes

Building server systems manually is slow and inconsistent. However, by allowing administrators to rapidly provision hardware with consistent configuration, Infrastructure as Code (IaC) can easily scale-down a six-month process into 20 minutes. Here's how I got started with it, and how it's helped me.

Marcus_Bulding-server-systems_inner-article

Let me set the scene for you: It's November, and I've just "agreed" with my boss's boss that we would automate the build of our entire Internet Banking infrastructure and application stack by February – and by automate, he meant that he should be able to build it at the click-of-a-button, on stage, and in front of 500+ people.

With the cloud and tooling available today, this probably wouldn't be such a big deal. However, at the time, and with the resources we had available, this was not such a straightforward task.

Our application stack had two components: A web front end, and a Java 2 Enterprise Edition (J2EE) backend. We would need to deploy multiple instances of these components for High Availability (HA) in production, but also have multiple deployments for testing. We were given the freedom of choosing what software and operating system we wanted to run, so our full stack was as follows:

	Web Component	App Component
Application	HTML & JS	J2EE backend app
Software	Apache HTTP Server	Jboss
Operating System	Redhat Linux	Redhat Linux

Firstly, we needed a way to provision machines automatically: Usually this would take about six months to get a new server to host something, so we wanted to automate and remove the wait time. A side effect of this was that we were never really worried about breaking a server, because if it broke we could just provision a new one.

Once the machines were provisioned, we needed a way to configure them and install all the software required: Automatically provisioning a server is great, but you generally need to install stuff on it to make it useful.

Lastly, although automation is cool, we needed verification that our automation was giving us the desired outcome, and not breaking the environment.

This is how we addressed those three things.

Automated provisioning

At this point, every step of the server provisioning process in the organisation was done manually by an "expert" in that particular field. Whether a physical or virtual server, Windows or Linux, you would log a ticket and anywhere between three and 30 people would scurry off and work on building you a machine. The people involved could look something like this:

A person to give you an IP address
A person to then create a DNS entry for that IP address
A person to create an empty shell of a machine (if virtual)
A person to install the operating system
A person who always had to get involved even though I'm not entirely sure what they did
A person to create user accounts so that you can access your machine
A person to install any baseline software you needed (e.g. Anti virus, backup software, monitoring etc)
A person/people to install whatever software you needed
A person to sacrifice a goat to ensure all the previous people had done their piece correctly
A person to check everything was done correctly (even though most times it wasn't)
And then other people that I've probably forgotten too!

All the people in the list above were really intelligent, but they were also doing things the way they had always been done. They had never really been given the freedom to innovate or do things differently. Automating a 30-odd person process that had a rigid, conventional way of being done, was definitely the hardest piece of the puzzle.

How we did it

I'd love to say we got this perfect the first time (spoiler alert: we didn't). Having said that, our solution for automatically provisioning machines did work pretty well for our use case.

In our set-up, we were building virtual machines on KVM, so we automated VM creation by creating a Python script to interface with KVM We did this because the XML API was not the most straightforward to use. You would feed the script some parameters, like ServerName and IP Address, and it would then call virt-install remotely on KVM. This would then build a stock standard Linux VM.

All-in-all, the process would take roughly 15 minutes, and was limited so that we could only build one VM at a time. Like I said, not perfect. It was, however, still a big dent in the VM provisioning process, which could take anywhere from three weeks (if you knew the right backs to scratch) to six months. And, while I had never experienced it, I'd heard horror stories of it taking up to 18 months to get servers.

What we learned

Don't reinvent the wheel if you don't have to.

While Terraform was still a twinkle in Hashicorp's eye (the first stable release only came out while we were mid-way through automating all of this), there were other free open source options out there at the time. While our Python scripts worked great, they had to constantly be maintained. We also spent time solving problems that had already been solved. This time could have been better spent elsewhere.

Build a Minimum Viable Product, and then iterate on it.

One of the other teams in the organisation at the time were trying to build an enterprise solution for provisioning servers. The problem was they wanted to support every operating system under the sun (intended pun incoming...) including SUN Solaris, every flavour of Linux you can think of (Ubuntu, RHEL, SLES, OL ettc), multiple releases of Microsoft Windows and even zOS on the mainframe. They struggled to get something usable working as their focus was too wide. We were successful as we only had to automate one version of one operating system.

Always start small and try and get one thing working. It might not be perfect, but if your users are getting value, that's ok.

If you can't solve something the conventional way, cheat!

Some things are just not easy to automate. An example of this in our case was DNS: In our environment, our DNS solution had no API or command line. The only way to create a DNS entry was to manually click around a GUI. So, we worked around it until we could find a better solution, and preallocated IPs with generic DNS names. We found that most people/teams that wanted a VM didn't really care what it was called, as long as they had a working VM. Calling it http://chop17.example.com was a lot nicer than http://10.17.93.86/.

Configuration Management

Configuration Management is a complex topic, which could be 12 blog posts on its own. Our problem was that there was always back-and-forth communication between the different teams when installing software, which made it more tedious than productive.

In our case, we needed two main pieces of software to be installed, Apache HTTP Server and Jboss. These then needed additional modules and custom configuration, and our mandate was for it all had to be automated. It didn't make sense for us to have to ask someone to do it for us every time we wanted a new Jboss or Apache instance.

This unnecessarily long process would look something like:

Web Server Admin needs Apache installed.
He logs a call for the Operating System team to do it, as he does not have access.
The OS team installs Apache, but doesn't give him access to configure Apache.
He logs another call to get access.
He gets access and now configures Apache, but realises he needs a certificate to do SSL/TLS.
He logs another call with a different team to get a certificate.
He finally gets his certificate but realises he doesn't have the SSL module installed.
He logs yet another call with the OS team.
OS team installs the module but the Web Server Admin needs a different version.

I could go on, but you get the picture. Again, this isn't because the Web Server Admin didn't have the skills to install the software himself; this was simply how things had always been done in big corporates.

How we did it

We got the nice people at CHEF to help us. The alternative at the time would have been Puppet, but through some team discussion we just decided to go with CHEF. Using some cookbooks from their supermarket, and some we wrote ourselves, we could now deploy Apache, Jboss and all the other components that made up our application stack.

The benefits were that our configuration was now human readable and a lot less error prone.

We also found some awesome new tools that we added to our stack from checking what cookbooks were out there and what other people were doing. HAproxy and Sensu were two that made a huge impact to our stack: We used HAProxy to set up Blue/Green deployments, which gave us really great flexibility, and with Sensu we could easily define what monitoring we needed through code.

I encourage you to look at both of those if you're looking for an awesome load balancer or a monitoring tool that doesn't suck!

What we learned

Reuse existing community cookbooks.

There are some really good cookbooks (and manifests/playbooks) out there, and if you understand them, you should use them. Later on, we wasted a lot of time moving to community maintained cookbooks from custom cookbooks we had written ourselves, and in hindsight that's something I would recommend doing from the offset.

Again, build a Minimum Viable product and iterate.

If you have to write your own cookbook (or manifests/playbooks if you're using Puppet/Ansible) to do something, remember: MVP, MVP, MVP.

Rather get something out that people can use, and then take their feedback into account to develop it further.

In a different instance, I worked with a team that spent six months writing a flexible cookbook with lots of customisable options. This automated one component of their application, which they only ever deployed once. It would have been far more useful to put together a simpler static cookbook, work on something else, and add the customisation later if they needed it.

Use the tooling already out there to make your IaC better.

For CHEF, I use Foodcritic/cookstyle, and I wish I'd known about them when I started because it would have made my code so much better. They're also really good for picking up errors in your code, which saves time and prevents frustration.

Build verification

Why do you want to test and verify what you've built? I mean, you built it, so you should know what you've built... right?

When our servers were still built manually, we found they were often inconsistent. For example, if I asked for two hosts with Apache, server one would have two CPUs, 4GB memory and be running Apache 2.2; and server two would have one CPU, 4GB memory and be running Apache 2.4. Now, imagine what would happen if I asked for 20 servers. Chaos.

Even in our case, we had built all this automation but still wanted that assurance that our automation was working as intended. We wanted to know that the Jboss cookbook installed Jboss, that it was the right version, and that our Apache Web Server was listening on the right port.

How we did it

We found an awesome infrastructure testing framework called ServerSpec. Anytime we built and configured a server, we then ran our test scripts to make sure things we had declared in code were in fact what was on the server. Our initial tests were simple, but they did manage to catch a few bugs along the way. For example, we had tests that:

Checked that apache 2.2 with the SSL module was installed: This was useful because if the SSL module wasn't installed, Apache would not start. Because of this test, we knew exactly what was wrong and didn't have to waste time debugging it.
Checked that extra disk had been added to the VM for storing logs: If we didn't provision the extra disk, we would run out of log space. The application would still run, but wouldn't be able to log any errors which made debugging application errors impossible.
Checked the machine was running the latest OS patches available to us: Keeping our hosts patched means we wouldn't be vulnerable to exploits like heartbleed and shellshock.

What we learned

Infrastructure testing is valuable for upgrades.

From an IaC perspective, testing didn't add that much value in the beginning. But where it was super handy, was for upgrades to software because we could verify, in minutes, rather than hours or days, and with little effort if things worked. For example, when going from RHEL 6 to 7, does my Apache still work? Or when going from Apache 2.2 to 2.4, is my config still valid?

Infrastructure testing is valuable for config changes.

One of the members of our team reduced the memory of our application servers from 4GB to 2GB, which prevented the application from starting. If we had tests check that the application software started, we'd have picked this issue up sooner. Instead, we spent a few hours trying to figure out why the application wouldn't start, as no one had noticed the config code that had reduced the memory.

Did we make it? Absolutely. Did we cheat a little? Sure. Was it one button? Obviously.

The three months of non-stop work culminated into condensing an otherwise six-month process into building our full production in under 20 minutes, at the touch of a button.

Early in February, the boss's boss stood up on stage in front of an auditorium packed full of people and gave a speech about Agile, DevOps, IaC, the 1995 Springboks (don't ask) and other stuff, while on a cinema-sized screen behind him our production stack was building. And yes, he had clicked only one button. Success!

A TL;DR checklist

Use what's out there: Don't try and write your own provisioning system/software, and use community cookbooks/manifests/playbooks rather than writing them from scratch.
Start small and automate one thing at a time (see the links below for getting started): Rather use something that is a bit rough around the edges and make it better over time than trying to get it perfect before using it.
Hack it: If something can't be done the usual way, get creative - even if it's a bit of a hack.
Infrastructure code is still code: So, use the relevant lint/syntax checkers.
Always write tests: This will save you plenty of time in the long-run.

Ready to get started? Here's your crash-course resource list:

Automated Provision:

Configuration Management:

Infrastructure Testing:

Marcus Talken is an Automation Engineer and has presented at both local and international devops conferences. He is passionate about infrastructure as code and configuration management. When not trying to automate the impossible or hanging out at a local JHB meetup, he can be found on the internet - like here.

Automated provisioning

How we did it

What we learned

Configuration Management

How we did it

What we learned

Build verification

How we did it

What we learned

A TL;DR checklist

Ready to get started? Here's your crash-course resource list:

Recent posts