That's MTBF Thinking

“This would be so much easier if we could just use VMWare Motion.”

The words hung in the air after I heard them. I was on a call with a number of engineering leaders at a private equity portfolio company that was having stability issues in their production environment. They, and their customers, were frustrated over the number of incidents they’d been experiencing recently and we hadn’t even gotten into the issue of how to scale the platform by 10x. Without changing the way they approached the problem, they would never really solve the stability nor scalability challenges. I recognized it immediately: it was MTBF thinking.

MTBF vs. TTR

I was brought up in the industry on MTBF (Mean Time Between Failure) thinking. I got pretty good at Veritas Cluster Server. I’d built cold standbys and tested their failover. I worked for companies with mature disaster recovery, business continuity, etc. plans. These were a colossal waste of capital. I also got pretty good at VMWare, even coding against its Perl SDK. By that time, however, I’d embraced TTR (minimizing Time to Recover) thinking.

MTBF is great for disk drives. Not for production SaaS services. In service delivery, especially with complex distributed systems, we need to be comfortable with the fact that failures happen. Once we are comfortable with that fact, we can begin to operate as if those failures are expected, because they are. Artificially trying to suppress entropy is folly. Jez Humble explained it well back in 2013 when quoting Nasim Taleb’s Antifragile book:

“the problem with artificially suppressed volatility is not just that the system tends to become extremely fragile; it is that, at the same time, it exhibits no visible risks… These artificially constrained systems become prone to Black Swans. Such environments eventually experience massive blowups… catching everyone off guard and undoing years of stability or, in almost all cases, ending up far worse than they were in their initial volatile state” (p105)

Without dwelling on whether antifragility is appropriate for distributed systems, the Black Swan events are real. Teams that try to artificially suppress entropy will be unprepared to deal with events as they happen. This is why Netflix invented the Chaos Monkey. This is why I teach portfolio company teams to practice Gamedays and GoLive (Production Readiness Review) exercises. We need to prepare for failures, they will happen.

VMWare Motion is MTBF thinking. It promotes the idea that we should strive to keep systems running as long as possible. To try to suppress volatility and create the illusion of stability. This is bad for security, stability, and scalability.

DIE model

How should we operate instead? Sounil Yu popularized the idea of the DIE model when talking about cybersecurity, but it comprises a great way to think about modern architectures beyond that specific realm.

DIE stands for Distributed, Immutable, Ephemeral. This is the opposite of MTBF thinking. We are not trying to keep everything running all the time. We are not trying to use VMWare Motion to keep the machine image “alive” even as the hardware underneath changes. Instead, we build systems to be distributed to minimize the blast radius of any one individual failure. Our systems are immutable so that instead of modifying a system over and over and trying to perpetuate its existence, when we need a new one, we build a new one, and that process is easy. Finally, considering our systems to be ephemeral emphasizes quick replacement of lost resources, understanding that entropy is real, and the best way to ensure availability is to recover quickly, rather than trying to futilely prevent failure. It’s Cattle vs. Pets.

So how does this address the main problems of our portfolio company above?

Stability: No longer are we worried about failure or preventing failure. Failure will happen. But instead of a failure being an all-hands-on-deck firefight which celebrates heroic efforts, failure is a run of the mill activity that is either remediated by automated systems, or if not yet at that maturity level, results in a nonchalant simple replacement of the failed resource by a human operator.

Scalability: Along with great techniques I teach like feature flagging and dark launching, the ability to scale a platform is done through a series of experiments to understand how a system scales, where its bottlenecks are, etc. By following the DIE model, we naturally create a collection of smaller horizontally scalable services, while also giving ourselves the ability to quickly and safely iterate on hypotheses and determine how to move bottlenecks to more scalable parts of the system.

Go DIE

Failure is a learning experience in life, and it’s one of the ways that we come to form more complete mental models of our systems. Once you come to recognize the artificial suppression of failure (MTBF thinking), it jumps out at you like a big red flag. Moving away from MTBF thinking has given us robust distributed file systems, Kubernetes, S3, and a host of other things that power the modern world.

Learn to embrace TTR thinking. Design your systems for DIE. Give your teams the ability to deliver with security, stability, and scalability. Recognize MTBF thinking, and help it die.