The Problem with Maintenance Windows and Change Freezes

Different companies experience increased traffic at different times of year. For a retail site, it may be in December. For an accounting site, perhaps during tax season. Often companies will put special rules in place during those times, like change freezes. I’ve seen estimates that say that more than seventy-five percent of all incidents occur at a change boundary, like a release. Many companies only release during maintenance windows, because it’s supposed to be “safer”. Similarly, having change freezes would therefore be a good idea. Or is it?

Maintenance Windows

Back in 2001, I was working at a dot com startup. Myself and Jerome, the QA manager, would do releases at 1 o’clock in the morning. If there was a problem with the release (there was almost always a problem with the release), we would roll back to the previous release and try again another time. This meant that releasing software was almost always multiple attempts (in the middle of the night), until all the problems were resolved.

We thought we were doing it the right way. It turns out that we were wrong. Maintenance Windows encourage large batch size delivery, and large batch sizes are an anti-pattern for stability. Take a bunch of unrelated changes, batch them up, and roll them all out at once, and you’ve created an interesting puzzle to figure out what broke and why. Relevant to the work I do trying to make engineers’ lives better, these changes often happen in the middle of the night or on weekends. I remember an engineer lamenting in Slack that they’d never finish their SCUBA certification if they could never get back in the water on a weekend.

Moving to a “follow-the-sun” model does not solve this, despite the more advantageous time of day for the deploying engineer. It is still a large batch size, so it is still more likely to have unaccounted for failures.

Additionally, if there is a problem in the middle of the night, people are not at their best to be able to resolve it. Netflix is famous for running the Chaos Monkey which “randomly” destroyed infrastructure inside their production architecture. Many don’t realize they only ran it during business hours so that people would be at their desks if there was an issue.

Change Freezes

Change freezes are really just another kind of maintenance window. They encourage a big batch release as soon as the freeze is over. The fact that this release can happen during the day (like the follow-the-sun example above) does not change the fact that small, regular, well-tested, changes are safer and easier to test and reason about.

How do we have a change freeze, but not create these big batches? We can work on things are the customer doesn’t see (yet), but still continue to exercise our deploy pipelines like dark launching. We can use this opportunity to do research on new algoritmic or architecture improvements in production behind feature flags or in pre-prod environments. We can improve our ability to test in production. We can ship things beside features like defects, risks, and debt. We can write documentation to make onboarding of new engineers easier.

Maintenance windows were fraught in 2001, and they are no better now. Get good at delivering software.