Evolving our infrastructure

At the start of 2010 uSwitch was a monolithic .NET application running on physical infrastructure controlled by centralised IT operations and DBA teams. Releases were made through a careful coordination of effort across release managers, developers, testers and product owners; if something broke everything would break.

Following its acquisition by Forward later that same year, we gradually started to replace systems. AWS was the perfect place for new services: scalable, self-service, on-demand infrastructure (and in stark contrast to the existing system). New services were built and operated directly by teams on AWS. Another team of people had the thankless task of looking after the legacy services until they were strangled away.

Self-sufficient teams emerged: as development teams owned more of their services, it became less necessary to coordinate the day-to-day work across the whole company. The organisation changed its structure around different verticals (reflecting the different products we compare for our users), and within each vertical teams organised around products and/or services. This team-based self-sufficiency has been historically successful for us, bringing agility and resilience. But self-sufficiency has also come at a cost.

How many ways can we design a hammer?

Over time our verticals have diverged in how they deliver their services. Although they all had the same goal, we ended up with: Capistrano, Debian packages, Puppet-managed machines, Terraform-managed ECS clusters, etc. Although the majority of teams are now packaging into containers and deploying into ECS clusters, we still have a wide array of duplicated supportive tooling. At one point, we had six different CLI tools for talking to ECS, with large overlaps in functionality.

We also use several different tools to help with operating our services depending on the team. We have at least four different ways of collecting metrics from our services, and a couple of different logging aggregators.

Teams are having into invest a lot of time into building and maintaining these different ways of delivering and operating their services. Problems found and solved by one team often aren't easily portable to another, and our inter-team communication channels haven't traditionally been great.

Explosions in complexity

Back in 2010, the AWS offering was a lot smaller than it is today, providing EC2, ELBs, S3 and RDS. Today, they are releasing thousands of new features each year, and we seem intent on using all of them. We have seen the amount of knowledge required to build, run and operate infrastructure on AWS rocket.

We did some analysis using our Amazon CloudTrail data (that we’d been collecting since 2014) to see whether the data would show this also. We’ve seen approximately a 3x increase in the median number of services each person is using in just two years. The plot below covers data from the end of 2014 to the start of 2017.

Chart showing increase in the median number of services each person uses.

Although this does confirm people are using more parts of AWS, this doesn’t control for Amazon releasing new services. Further, you would hope to see a growth in higher-level services like Redshift and ECS, and potentially a reduction or limited growth in the lower-level services like EC2, ELB, IAM, etc.

We categorised the services shown in CloudTrail and looked at the activity for the lower-level plumbing-type services. This shows an enormous growth: from a few hundred in 2015 to hundreds of thousands in 2017.

Chart showing growth in low-level service usage.

This growth in use of the lower-level services is also compounded with different ways of using them. For example, this leaves our security groups looking a little like a dense jungle.

When infrastructure comes last

Our teams are focused on delivering value to our users, but this often means doing just enough infrastructure work to survive.

Take even this most basic example of how we can rush infrastructure work: choosing an instance type. The vast majority of services we run are memory bound rather than CPU bound, but we used to mostly operate on compute-class instances. An r4.2xlarge is 3x less expensive per-GB of memory than a c4.2xlarge (and 3.3x less expensive if you reserve) for the same number of cores.

Our inefficiency goes beyond choosing instances poorly. Self-sufficiency and conservatism have resulted in a grossly over-provisioned and underutilised infrastructure. We did some analysis to control for growth in products or features and even then our costs were increasing; we’d been finding more expensive (and arguably more complicated) ways of achieving the same result.

When a team decides on a new direction to take their infrastructure, it is rare for their focus to stay on the migration through to completion. This has left behind a trail of old and rarely touched infrastructure, with knowledge of how it is setup and works slowly disappearing. This is a big burden as it tends to be the more complicated, and business-critical services that get left behind.

How are we evolving?

We've formed a team to help set the direction of our infrastructure, developing common patterns with teams and working with them to get there. We've identified the following areas we would like to improve upon:

  • Problems being solved repeatedly across uSwitch;
  • Teams having to continually invest more time and develop deeper specialisms to run their infrastructure;
  • Inefficient use of the infrastructure we have.

In order to reduce the the duplication, we are moving to a consistent way to deploy and run services across all of our teams. Given we want to reduce the amount our verticals engage with the lower level services on AWS, we looked at systems we could use to provide a better level of abstraction. We've ended up with Kubernetes at the core, as this seemed to be the most stable, understandable and healthy system we could underpin with. This consistency doesn't just stop at how we deploy and run our code. We are also rolling out a unified CLI tool that aggregates the functionality of the variety of tools we had out there.

Making better use of our infrastructure will be easier when we have consistency. By reducing the number of distinct clusters and machines - down to a handful - we can increase utilisation of each machine. There is also better tooling out there for Kubernetes to handle scaling clusters based on demand, so we don't have to include the capacity of nightly job execution in our usual overhead.

While we think having a team pushing for consistency and a better standard of infrastructure is important for uSwitch, we never want to block other teams. Our infrastructure is owned by all of us, not just by one team. Through consistency we can build tools and processes that speed up teams and reduce their burden, but we will do this in collaboration, not through dictation.