Automation is the answer, right?
I began my career as a developer. When I became a SysAdmin and needed to manage records for 50,000+ domain names, I reached for code to solve the problem. Defects dropped precipitously. Automation is a great way to solve many problems. Automation is one of the foundations of the DevOps movement. Misunderstanding automation can multiply problems exponentially.
There are two beliefs I often see when talking to executives of portfolio companies:
- Automation can do what humans can do, only faster
- Automation will replace humans, thereby saving me money
I was working with a health insurance company where the executives told me how they were fully bought into automation. Being a big fan of automation, I thought this was a great idea. They’d had a very manual process for pushing out new releases of their code and were going to replace it with the RunDeck software by PagerDuty (for the record, I think RunDeck is great and know the founders personally). They’d spent months getting all the various steps of releasing new code into RunDeck but now every release was a very long series of stop and go failures. This directory didn’t exist, that permission was wrong. Each time, they needed an operations person to get on the host and correct the incorrect assumption about the machine.
“How could this be happening?”, they asked. “We invested in automation!”. It was true, they had invested in automation, but on top of what? This client was building on top of a house of cards.
They thought they’d taken all the work that had been done manually before and automated it. In reality, they’d taken the work as documented, not the work as done. The operators had always fixed those little issues along the way so this effort had not really solved anything. Merely pushing buttons faster is not automation.
One of the first goals of any operations team needs to be the ability to build any host, from scratch, at any time. For them to truly benefit from automation, they needed to develop the ability to build a host from scratch at any time and then use Rundeck to deploy the software on top of a known good system. They could also have built the hosts from scratch with the software already deployed on the hosts and swapped them into the load balancer.
Automation allows for high performance because it reduces toil. The repetitive, low value work which is necessary for the operation of production systems. Automation frees up people to work on hard problems, instead of toil. People are active, creative, pattern matching, problem solvers. They are not a problem to be solved. Despite what your friend’s AI startup says, computers are not good at these things. But computers are good at toil. We don’t want to pay our humans to work on easy problems. We want our humans to be engaged strategic differentiators for our businesses.
Finding the balance between what computers should do, and what humans should do is difficult. Google SREs have metrics to help them plan their work when their toil ratio is too high. Computers are supposed to help humans as an assistant, not as a replacement. This is a big challenge. It’s what differentiates the merely good “DevOps tools” from the great ones.
If you find your engineers following the instructions of what a computer tells them to do similar to: follow this next step in the runbook; they are probably working for the computers. If you find that your engineers are able to quickly and effectively deploy software to production, diagnose problems, forecast costs, etc., computers are probably their able assistants.
There is no question that automation already plays a huge role in what makes high performing engineering organizations high performing. I’ve argued that it is actually impossible to run a secure modern infrastructure without Infrastructure as Code as well. Automating the way that we build and run our production systems where the computers are there to assist the humans, not merely replace the humans, is surely yeilding better outcomes for all.