DevOpsDays 2012: "Logging" Open Space
A few weeks ago at DevOpsDays we were given the opportunity to propose topics to be discussed in the afternoon "open spaces". I was lucky enough to have my proposals chosen, on the condition that someone write a blog post to detail what was discussed during the session. This is one of those posts...
We started out the discussion when I gave a short history about my experiences with logging over the years. It basically boiled down to the fact that there used to be a website and mailing list associated with LogAnalysis.Org (run by Tina Bird and Marcus Ranum). I remembered reading on the website or the mailing list, I don't remember which, about the fact that when you go looking for log analysis tools, you find logsurfer and swatch. Eventually you come to the realization that there are no good open source tools for this purpose, and to the conclusion that all that is left is Splunk. Unfortunately I can't find this even in the Internet Wayback Machine.
The topic of Splunk came up many, many, times during the course of our discussion. I even found it researching this blog post when I found the mailing list post by Tina Bird talking about how Splunk has graciously accepted the role as maintaining both the loganalysis mailing list, and managing loganalysis.org. Sadly or curiously, after Splunk took over managment of the domain and mailing list, they have dissappeared and the loganalysis.org domain has been taken over by a bunch of squatters. The consensus around Splunk was that it is great. I mean really great at mining the data. But there need to be more competitors in the space than just one. The pricing model for Splunk actually punishes you for being sucessful with logging and thus discourages people from doing lots of logging. This seemed wrong.
So what are the alternatives? We were lucky enough to have Jordan Sissel the author of LogStash join as as part of the discussion (he also made a pitch for this same open space). He began talking about Open Source alternatives to Splunk like Logstash, ELSA, and Graylog2. For more ideas, you can check out this Delicious Stack. He also described the problem space as being broken down into two main areas as he sees it, the Transport Problem, and the Unstructured Data problem. The group spent the rest of the time discussing each of these areas as well as a third which I'll call the Presentation Problem.
The Transport Problem
This aspect focused on the idea that it would be great to both transport and process logging data in a similar format like JSON. In fact, many projects do this sending their logs over Scribe or Flume. The nice part about this is that you can still grep through the logs even if there have been changes to the JSON fields because it does not cause a fundamental change in the log structure. Basically, it will not break your fragile regexes. Also, the logs that are sent have to make sense and have value. There is no point in sending logs over the wire for no purpose. What a lot of companies have done to ensure this, is to build standardized logging functions into their code so that each developer is not creating their own. This is an attempt to at least give some structure to the data while it is being transported so that it is easier to handle when it reaches its destination.
The Unstructured Data Problem
"Logs are messages from developers to themselves"
A topic that was brought up repeatedly revolved around the question of why each company was doing this themselves. Why are there no standards about what is logged and in what format that should be? Is there a potential to standardize some of these things? If so, how? Whose standard should we adopt? Should we choose some IITL nomencature? The purpose of this would be so that if someone logged something with a level of ERROR or WARNING or INFO, everyone would actually know what this means. The problem is that it is hard for everyone to agree on the same standard. You can call it a style guide problem, or a people problem, but it all comes down to the fact that we are currently dealing with completely unstructured data.
With all that unstructured data to be handled, you come to realize that "logging is fundamentally a data mining problem", as one of our participants commented. Even if you're able to store the data, where do you put the secondary indices? Assuming you are indexing on time, if that is even a safe assumption, what's next? Application? Log source? "What do you do with apps you don't control?" How are you going to get their data into your structured log database?
Once the data is stored, how do we know what is actionable? Project managers only know one severity, URGENT!
The Presentation Problem
"Sending a CS person Postfix logs is actively hostile"
Once you've figured out how to transport the logs and store them, the final problem is presentation. How do you create something that is consumable by different end users? The folks at Esty have come up with ways to try and make the data they are mining more meaningful. They have a standard format that allows for traceability just like Google's Dapper or Twitter's Zipkin. Getting these logs in these kinds of formats are useful is not just for developers. There was consensus that there needs to be feedback from Ops to the developers as well. Ops needs to to have ways to know what is really an error. Having first hand knowlege of this situation, where the logs are filled with errors, and we were supposed to memorize which ones were real and which could be ignored, I can safely say this was not an ideal situation. Ops also needs to be able to specify what THEY want in the logs for an app (latencies?).
"Holt Winters and standard deviation are your friends"
The final part of the presentation problem focused on what do with the data. Etsy contributed Holt Winters forecasting to the Graphite project because they felt it was so important to be able to make sense of the data you had collected. There were also suggestions to alert on rates over time, not on individual events. With all the disjointed tools out there, and the lack of any consensus of what forms logs should take, being able to present the data poses even more of a challenge.
The Future
There seemed to be a fundamental feeling within the group that the tools we have now for log transport, collection, and analysis were just not sufficient, unless you were willing to buy Splunk. Also as you can tell, the discussion raised many more questions than it did answers. But depsite that general tone to the space it was not all dour or dire. Jordan made a really big pitch for his vision of Logstash in the future. Luckly he's reiterated that same sentiment in a recent gist, so you don't have to hear it from me!
Logstash actually tackles a number of these problem areas, so the future is potentially not as dark as it seems.
- The Transport Problem
- Logstash provides the logstash log shipper which is basically logstash run with a special config file. Alternatively, there is the same idea in Python provided by @lusis.
- The Unstructured Data Problem
- This is the main problem that Logstash fixes. Logstash recognizes many common logfile formats and can translate them into the appropriate JSON. If it doesn't recognize yours, you can write your own. It can take many types of unstructured inputs, and send the now structured data to many different types of outputs. You can think of it like a neuron where the dendrites take input from multiple axons, and the axon can send the data to multiple dendrites across the synaptic cleft.
- The Presentation Problem
- Most of the time, you will send your log data into Elasticsearch (ES). Once in Elasticsearch, it can be queried using standard ES methods (e.g. REST). The is a great FOSS interface to ES called Kibana which allows you to search, graph, score, and stream your Logstash/Elasticsearch data.
The community is potentially at a turning point. Accept the juggernaut that is Splunk and live with the currently lacking status quo, or get together and change it. Which path will we choose?
Quotes in this blog post are unattributed statements made during the discussion
Posted by Dave Mangot in General at 20120717Search This Site
Recent Entries
- DevOpsDays 2012: "Event Detection" Open Space
- DevOpsDays 2012: "Logging" Open Space
- Ode to the External Node Classifier (ENC)
- I'm speaking at Velocity 2012!
- Host-based sFlow: a drop-in cloud-friendly monitoring standard
- Graphite as presented to the LSPE Meetup 16 June 2011
- The Graphite CLI
- Back on the Blog Gang
- A framework for running anything on EC2: Terracotta tests on the Cloud - Part 1
- A Trade Show Booth: Part 2 - The Puppet Config
- Intstalling Fedora 10 on a Mac Mini
- A Trade Show booth with PF and OpenBSD
- EC2 Variability: The numbers revealed
- Linksys WET54G, a consumer product?
- Choosing Zimbra as told to ex-Taosers@groups.yahoo
- Information Security Magazine Chuckle
- A SysAdmin's impressions of MacOS Leopard
- Worlds collide: RMI vs. Linux localhost
- Hello World