20120802 Thursday August 02, 2012

DevOpsDays 2012: "Event Detection" Open Space

A few weeks ago at DevOpsDays we were given the opportunity to propose topics to be discussed in the afternoon "open spaces". I was lucky enough to have my proposals chosen, on the condition that someone write a blog post to detail what was discussed during the session. This is the second of those posts...

The second discussion I facilitated was one on event detection (also known as anomaly detection). This is a project that had been started many times at Tagged, and like logging from the previous open space, there were no rock-solid recommened answers from the community. The discussion really broke along 3 different lines: thresholding, complex event correlation, and more advanced signal processing/analysis.

The overriding theme of the discussion was looking for ways to realize state changes as events. These events could trigger alerting systems and be shown as annotations in Graphite, Ganglia, etc.

Thresholding

The discussion started off around the idea of automated thresholding as a first step toward event detection. There are many examples of this, and the first one mentioned was the auto-baselining Etsy does with Holt-Winters in Graphite. For alerting, with Nagios as an example, the consensus was that individual plugins could do the thresholding themselves and even pull the data from RRD or Graphite, or be instrumented with something like ERMA . It was proposed that monitoring the derivatives was superior to simple absolute values. With something like Nagios (and presumably other systems), it could take passive checks from a CEP in addition to the thresholded alerts.

There are a number of other programs, like Reconnitor, and Edgar that have built in facilities for trending/prediction and forecasting instead of relying on individual checks to implement this themselves. This seems to be a more common feature in the industry now, with forcasting even making it into the latest versions of Ganglia-web. With these types of systems, you can do forecasting on a given set of time series data.

People we also very high on Riemann which is written in Clojure. The constructs of the Clojure language make it ideally suited for being able to perform the types of functions (combining, filtering, alerting) you would want in this kind of monitoring and alerting system. The big question people had about this system was how well it would scale vs. other approaches.

Complex Event Processing (CEP)

When discussing complex event processors, the discussion immediately fell to Esper. The idea behind Esper is that unlike a traditional database where you run your queries against stored data, with a stream processor, you run your data against your stored queries. You can define windows of time over which you would like to look for specific events. It can be run "in process" or run as its own instance. Many people were in favor of the latter approach so you did not need to restart your application when changing rules. It was also suggested that you run Esper in an active-active configuration so that when restarting, you don't lose visibility into your environment.

We spent some time discussing the history of CEPs and a few people pointed out that most of the innovation here was driven by the financial industry and a language called K). There have also been a few other CEPs in the past like SEC (simple event correlator), but as it was implemented in Perl, it had scaling problems and couldn't handle any significant loads.

We talked about the idea that a CEP could actually take many forms. Instead of having fancy algorithms for determining errors states, it could simple be as simple as the codifying of tribal knowledge so that a human does not need to sit and watch the state of the system.

We arrived at the conclusion that many had expressed at the beginning, which was that complex event processing is hard. There were only a few people in the group who had made a serious stab at it with something like Esper, and there were many that were looking for answers.

Signal Analysis

The discussion then turned to other ways of being able to detect events. Because this was a DevOps heavy crowd who was accustomed to bridging gaps between disciplines, they started looking for answers elsewhere. People started to question whether or not this was simply a digital signal processing (DSP) problem. Should we be involving data scientists, signal processors, or mathematicians? Who could help us look at a pattern over the long term to be able to detect a memory leak?

One idea was to be able to look at a baseline and apply filters to it to be able to find deviations. Someone asked that there be a blog post written describing how you could apply a filter to a stream of data to be able to show the failure.

Then John Bergman proposed an interesting idea. What if we were to make a dataset of time series data with known failures available to the academic community? A little bit like a Human Genome Project for operations failures. Would that be enough to attract interest from the academic community? The hope would be that if this data were to be made available, those same data scientists and statisticians from academia would have a large corpus upon which they could test and develop those filters and analysis tools that the DevOps community needs in order to be incorporated into our analysis systems. Much like basic auto-thresholding, and trending/prediction, are being incorporated now. In effect, this would be an extension of that same effort. We felt that in order to get a project like this off the ground, we would need a partner in the academic community who would be able to curate this collection of data in order for it to gain a critical mass of scholarly adoption.

Conclusion

Ultimately, this open space was really a confirmation of many of the concerns and fears that many who participated had already felt. Complex event processing and correlation is a hard problem, and nobody is doing it extremely well, yet. By coming at the problem from a variety of different approaches, we are getting closer all the time to something workable.

The notion of an Operations Failure Database was some great "outside the box" thinking that could bring two different yet overlapping communities together for a common purpose. If we could get enough people to contribute their data in the proper form, and enough interest from those who would like to analyze that data, there could quickly be some major advances in our currently primative tooling for this purpose.

Posted by Dave Mangot in General at 20120802

Comments:

Post a Comment:
Comments are closed for this entry.