Check lists, not check off lists
While watching the 2.5 Hour Symposium by IT Revolution, a number of things jumped out at me. One of them was the importance of checklists. A checklist is something we’re all familiar with. Some may have read The Checklist Manifesto and I’m not arguing that checklists don’t make us safer. The thing I found interesting was the framing around the checklist.
The panel was discussing aircraft pilot checklists in the context of Deming and the ideas of variance. The idea wasn’t simply to get through the checklist and make sure everything was checked off, the idea was that it was a list of things to check to see if anything was outside of expected tolerance, and if so, would it need to be corrected in order to have a safe flight. In essence, they were discussing the idea of a list of things to check, not a list of things to check off.
I thought this was a really important point. If you’ve seen my SRECon talk you’ll remember the story of the team that was supposed to edit a file called /etc/bar. Instead they could only find a file called /etc/foo/bar. Even within that file they needed to edit a section called baz and wound up editing something “close enough”. This was because they felt they needed to get through their checklist (i.e. check off list) and move onto the next host. The team probably did not feel empowered to stop in the face of evidence outside of tolerance. From a safety perspective, we would not want our surgeons operating on something that was close enough to a kidney or gallbladder.
The beauty of working on a list of things to check, is that it can help build our shared mental models inside the teams. I was working with a client who was migrating products and services from the data center to Azure Cloud. As I do with many clients, I was teaching them to do destructive testing on their service. Many people call this GoLive testing, or production readiness assessment. The idea is to come up with a list of failures to test, what the expected outcome was, what the observed outcome was, and then any resolutions. In essence, the scientific method. The great thing about this checklist is that it highlights things that are outside expected tolerances. Are we supposed to throw 500 errors to customers when the database is unavailable for 5 minutes?
As the team moves through the exercise, they get a better mental model of how the system behaves. This not only prepares them for things that they test, but also helps to prepare for unexpected failures they may encounter down the road. “Remember how the cache hit rate really dropped unexpectedly during testing when there was a problem with garbage collection? Could this be related?”.
The idea of a checklist, instead of a check off list, is incredibly powerful. The goal is not to get through the list. The goal is to recognize that by understanding the acceptable variance in the system, we can make better decisions about our incidents and situations.