Zero downtime Clojure deployments

We're heavily into microservices at uSwitch, with many of them being Clojure Ring applications, and our infrastructure is hosted on Amazon AWS.

One of the advantages of microservices is that horizontal scaling, especially with EC2 hosting them, is simple: add more machines! Unfortunately the use of Clojure, or more specifically the requirement of the JVM and the associated poor startup performance, means that deployments can take an unreasonable amount of time. To resolve this we run a remove-upgrade-add style of deployment: a host machine is removed from the corresponding ELB; the service is upgraded; then the machine is returned to the ELB.

So upgrading a service for us, at the moment, goes something like this:

Current upgrade situation

The steps of this system are:

  1. Initially running with nginx as a reverse proxy to service v1;
  2. Remove first host from the ELB;
  3. Stop service v1 on host;
  4. Update to service v2 on host;
  5. Put host back into the ELB;
  6. Remove next host from the ELB;
  7. Stop service v1 on host;
  8. Update to service v2 on host;
  9. Put host back into the ELB.

Although this works in the majority of cases we've been unhappy with this as a solution for several reasons:

  1. If the service is on a single box then we will lose that service for the period of deployment;
  2. The remove-deploy-add deployment means that the overall deployment time is linear with respect to the number of hosts;
  3. If the newly deployed service fails to start properly we can, potentially, lose our entire service infrastructure from the ELB;
  4. Removing a host machine from an ELB may remove more than one service and hence degrade our system.

The solution we decided to investigate, as part of our recent hack day, was based on a simple decision: if a service started listening on a random port then we could run two instances, and therefore two different versions, of the service at the same time. The complications are then the port assignment being random and how to deal with this when it is being reversed proxied by nginx, as well as how to tidy up previous running versions of the service. These issues can be solved though by using a service registry, such as etcd, where our services can store the port and the PID (process ID), and watching for changes with a process like confd.

The hack day was about trying to create a deployment system more like this:

Zero downtime deployment situation

The steps of this are:

  1. Initially running with nginx as a reverse proxy to service v1;
  2. Service v2 starts & signals the previous v1 port and the new v2 port to etcd;
  3. confd detects port change, regenerates nginx configuration & reloads nginx, disconnecting service v1 & connecting service v2;
  4. Service v2 signals the previous v1 PID and the new v2 PID to etcd;
  5. confd detects PID change, generates & executes kill script, killing service 1.

Because of the behaviour of nginx reload, where the master process starts new workers and then kills previous workers, should mean that downtime for the service will be essentially zero.

The solution

The service

We're going to use Stuart Sierra’s excellent component project to manage the lifecycle of our service which, for the moment, will simply store a random number on initialisation and serve that back in response to any request. Getting Jetty to start on a random, operating system assigned, port is simply a matter of passing zero as the desired port number. If we then communicate this port number in some way that nginx can pick this up we have the ability to run multiple instances of our service at one time and switch the reverse proxy.

Our etcd will be running on the host machine and not clustered: there is no need to communicate this service information outside of the host machine. To communicate the port number from our service to etcd we will employ etcd-clojure, and use a known key, uswitch/experiment/port/current.

We'll build a component that will store the PID of the service into uswitch/experiment/pid/current and ensure that it is dependent on the service component itself.

We will also retain the previous values for both of these keys in uswitch/experiment/port/previous and uswitch/experiment/pid/previous, which is supported in our code by etcd-experiment.util/etcd-swap-value

The advantage of random port assignment is not only the ability to run the same service multiple times but also the port number is only available after the service has started. Hence we will only write the port and, by the component dependency, the pid information to etcd after the service has been successfully deployed and started.

The infrastructure

The use of etcd might seem overkill except that it allows for the reactions to a newly deployed service to be separated from the service itself: we can watch the etcd keys and react to them in any way we desire, without tightly coupling this into the service itself. confd uses configuration files to react to etcd key changes in order to generate files and run commands, and it’s this that we'll be using.

Our service will have an nginx configuration file associated with it, written in /etc/nginx/sites-enabled/experiment.conf to enable multiple services to run on an individual host. To achieve this based on the changes to the information in etcd we add a configuration file /etc/confd/conf.d/experiment-nginx.toml to our system that will watch uswitch/experiment/port/current, generating our nginx configuration file and causing nginx to reload its configuration when the value changes. The template for the nginx configuration file is simple, requiring only that we set the randomly assigned port in the output file.

nginx reload causes the master process to start new workers and then kill the old ones, which means that we have zero downtime, from the perspective of client applications. Because of this we do not need to remove the host machine from the ELB in order to update our service and therefore we can drop the remove-upgrade-add deployment in favour of parallel deployment to all machines.

We can clear up the previous service by watching uswitch/experiment/pid/previous with confd and generating a script that can be executed to kill the process with the associated PID:

With all of this in place, and with confd periodically checking, we can start our service for the first time, seeing that the nginx configuration gets generated and nginx restarted. If the service is started a second time the nginx configuration is regenerated & nginx reloaded; the previous service is killed; and, far more importantly, the number that the service returns changes!

If you're interested on trying this on your own machine there are instructions included in the hack day project.

Conclusion

Hopefully this fairly in-depth walkthrough of the system has convinced you that we have:

  • effectively zero downtime for a deployment, where we used to have a reduction in availability;
  • the ability to deploy across multiple machines in parallel meaning we have a near constant deploy time, where we used to have a linear one;
  • improved reliability as services are replaced only after successfully starting, where before we would have to rollback;
  • isolation of services so that they are unaffected by deployments on the same machine, where before we would degrade more than the service being deployed.

The next real step in this, and one that is really at the core of the microservices architecture, would be to cluster etcd and remove nginx completely: if client applications used the registry to locate the service then none of this would really be necessary. In fact, we would also look to drop etcd for a full service registry, such as consul or zookeeper, the latter already being employed in some of our other projects. This, however, requires much more effort from our many client applications, so it’s a way off!

At the moment this remains a hack day piece of code: it works but it is yet to be truly battle tested. Given that we have many services running across many hosts, and we deploy regularly, this solution would save us a considerable portion of our time and may end up being used in our production systems.