Yes Felicia, even with Serverless, someone has to do the DevOps

I’ve watched a lot of trends happen in the industry over the years. When AWS launched, it was said we wouldn’t need operations folks anymore, developers would just call APIs. It didn’t work out that way. Just like in the data center, cloud customers are exposed to failures and developers can’t do everything. Then the Infrastructure as Code tools like Puppet, and Chef and eventually Terraform became popular. It was said that when developers write code, we’ll have no more operations work! It didn’t work out that way either.

Now there is Serverless, and soon “no code”. Serverless means no more operations, right?

Recently we were assessing a company’s Service Delivery capabilities. Lots of cloud functions, database as a service, observability providers, managed CI/CD services, utopia wrapped up in services. Part of their philosophy was to delay as long as possible hiring operations folks. They had done a great job. By outsourcing so much of the work, by default they were following many best practices. The service providers know the proper patterns to encourage when providing those specialized services.

We turned to resilience engineering:

Me: “What if there is an outage in your cloud region?”

A: “We’d stand up in another region.”

Me: “Have you done that before?”

A: “No.”

Without practicing building the essential components in another region, without even simulated failures, is there a possibility those components would be stood up in a timely manner? Possibly. In this case, the client was receiving a constant stream of real time data from their customers. Would the regional outage affect receiving that data? Manipulating that data? Storing that data? Maybe? Without breaking a cardinal rule of SaaS operations: don’t lose any customer data? In practical terms, impossible.

We recommended they provision dormant infrastructure in another region that was updated with their code releases as part of normal operation. The incremental costs would be marginal and they would be a DNS TTL away from storing customer data once more in an extreme outage situation.

In this case, with so many outsourced services, they were already extremely resilient to failures because the cloud provider could adjust if its own systems were functioning properly and not caught in a Strange Loop.

The benefits of outsourcing to services were plain to see in this environment. The team made very smart choices. We found some additional minor issues with single points of failure, and unnecessary security exposures but where serverless is the right choice, it is very effective. However, even with serverless, someone has to do the DevOps.