Maintenance like an airplane, not a car

If a tree falls in the forest and there is no one there to hear it, does it make a sound?

If a load balancer detects a failed health check and sends the next request to a different machine and the autoscaling group replaces the failed one, did we have an outage?

If maintenance activities happen on our infrastructure, but no one had to perform the maintenance, did we actually do it?

When machines were in data centers, unless you were a high performer, building a brand new machine was not an easy task. When we moved workloads onto the cloud, new machines became an API call away. Yet, I often work with teams that talk about maintenance windows, after hours work, patching, KTLO (keeping the lights on), and O&M (operation and maintenance). These are holdovers from the days when new machines were a chore.

One of the first things I did when joining companies that operated infrastructure at scale (this has happened a number of times) was make it so that being able to build a new machine was a trivial exercise. Yes, even in the data center. When this is easy, we stop thinking about machines as something that will run for a long time (pets) and instead will serve a purpose for as long as necessary (cattle). We can start thinking about the life cycle of a machine:

  • How long do we expect this to live?
  • What purpose does it serve? (Protip: machines should only serve one purpose. Running multiple services on a machine is because it used to be hard to provision machines.)
  • What do we do if there is a problem?
  • How hard would it be to get an exact replica?

In this way, running a machine is no longer something that needs to be built and then maintained, but built, destroyed, and built again, over and over. Each new version of the machine can be “better” than the version before.

I’ve seen countless companies where the job of the Security team is to build a list of machines that are “out of compliance” with the latest patches and then throw the work over the wall for Ops to start patching all the infrastructure. Surely we did not invent the DevOps movement to prevent Developers from throwing work over the wall to Ops only to have the Security folks do it!

If we have well defined tests, and acceptance criteria to measure an OS image the same way we do other software, then it doesn’t matter who has inputs into that new image. If a new release of the software is developed, and that can be baked into the image, then that can be a unit of deployment that can be consistent from development all the way into production. Most of us just call that a container. If security patches are enabled by the Security team that are incorporated into the next image/container/etc that will be deployed, and it passes all the tests, then you’ve now performed maintenance.

This time however, nobody scheduled a maintenance window, or threw work over a wall. This time, it’s part of the life cycle of our machines. This time, operating on the latest, secure, performant infrastructure is how we normally do business and is not an exceptional event that requires the input of project managers, and Change Advisory Boards, and client relationship managers.

Is that maintenance? I don’t know. But it sure doesn’t make a sound.