Jonah Kowall

A member of the Gartner Blog Network

Jonah Kowall
Research Vice President
2 years with Gartner
18 years IT industry

Jonah Kowall is a research Vice President in Gartner's IT Operations Research group. He focuses on application performance monitoring (APM), event correlation and analysis (ECA), network management systems (NMS), network performance management (NPM), network configuration and change management (NCCM), and general system and infrastructure monitoring technologies. Read Full Bio

Coverage Areas:

Monitoring software sucks so I use Nagios, what’s a better approach?

by Jonah Kowall  |  February 6, 2014  |  43 Comments

Speaking to our clients, and other people at conferences and industry events I attend, Nagios is always top of mind. This is a battle covered many times, many people want to use or reduce the usage of Nagios. The question always comes up, what else is good for free? The answer to this question depends on how much expertise you have in managing infrastructure, and what level of monitoring you’d like to do. Open source monitoring requires the use of configuration management tools (chef, puppet, salt) to scale and control the consistency. This requires some level of expertise.

Most users of Nagios use it for basic health monitoring of servers and applications, and I’ve spoken about other low cost tools which build on the open nature of Nagios and leverage the massive and vibrant community. There are plenty of great open source alternatives out there which work, here are a few options:

Quick and easy:

  • PandoraFMS – This project out of Spain is growing in popularity amongst Gartner clients with an easy to implement and configure product. The solution is open source and free, but also has commercial support options if desired. The UI is modern and fresh along with agents or agentless monitoring capabilities.
  • Icinga – Most often compared with Nagios the product shares many open source components, but also includes a more advanced web interface, search capabilities, and better enterprise integration for permissioning and authentication. It’s a bit more complex in terms of getting reporting and other capabilities, but this is free software, work is required. The product is shipped as software or via virtual appliance, it’s worth checking out.
  • Spiceworks – Windows only product, but this freeware provides good basic functionality in the monitoring space, which should serve the needs of many in monitoring of servers, network devices, and other components. The product can’t scale very high, but for SMBs this is a good option.
  • Zabbix – This popular server monitoring product is also free, with commercial support options. The product has more legacy components due to it’s age, but is under active development. This is an improvement over Nagios, but there are better options available.

Needs more time in the oven:

  • Naemon – If you like the Nagios model and configuration (I have no idea why people like it…) then Naemon is the next generation, it’s a new project with time before its mature enough. The offering will include an enhanced GUI (Thruk), removal of legacy components, and a highly scalable engine for the future. OP5 is behind this project, (a Swedish company with a greatly enhanced commercial version based on Nagios) and is funding much of the development. UPDATE: OP5 lets employees work on many open source projects on company time, so the sponsorship is not as direct as it may sound. The most important contributor will be Andreas Ericsson the talent who wrote over 69% of Nagios code in the last 12 months, and works for OP5. This project is one to watch!
  • Munin – This open source product has promise, but needs a bit more development effort to catch up with those above. The advantages of the product are a fully functional and easy agentless implementation.

Cutting Edge:

If you are operating a web-scale infrastructure and dealing with monitoring of large numbers of devices, and wish to have a fully extensible monitoring system to collect not only system metrics, but also custom application metrics I would suggest the following technologies:

  • StatsD – Generic metric collector (can easily collect application metrics or even real user monitoring metrics directly)
  • Collectd – System metric collector
  • Graphite – Backend for metric storage

Some of my favorite visualizers for this data:

  • Descartes
  • Graphsky
  • Graphene
  • Giraffe
  • Orion
  • Tasseo

Please level comments or chat on twitter.

43 Comments »

Category: Analytics APM Big Data DevOps ECA IT Operations Monitoring     Tags:

43 responses so far ↓

  • 1 Monitoring software sucks so I use Nagios, what's a better approach? | All that Cuteness   February 6, 2014 at 8:01 pm

    [...] By Jonah Kowall [...]

  • 2 Milos Gajdos   February 7, 2014 at 4:10 pm

    Try sensu man -> http://sensuapp.org/ Sensu is awesome for not only Cloud monitoring!

  • 3 Jonah Kowall   February 7, 2014 at 5:50 pm

    I’ve heard of it, but I have yet to speak with anyone using it in a production environment. If you have someone I can speak to about that please let me know, I’m interested to learn more and also test it in my lab.

  • 4 Jason Dixon   February 8, 2014 at 10:46 am

    We should grab a beer sometime and brain dump on Monitoring. Any chance you’re heading out for http://monitorama.com/ this May?

  • 5 Jonah Kowall   February 9, 2014 at 11:04 pm

    I would have gone, but I learned of it in January (too late). My calendar books up 6 months in advance and I am already committed. If I can possibly know in advance I would block off the 2015 dates…

  • 6 Jason Dixon   February 10, 2014 at 2:15 am

    Sorry, we don’t have any events planned beyond PDX right now. Is there some other monitoring event happening that week that I’m unaware of?

  • 7 Jonah Kowall   February 10, 2014 at 2:19 am

    Nope, I have other client commitments which require travel, and it will not be in that area.

  • 8 GP   February 10, 2014 at 5:03 am

    Jonah – thanks for this nice collection of open source monitoring software. I wonder though if we really need both monitoring and logging? Given the recent advent of stream processing I wonder if real-time event collection and processing = monitoring? Do we really need both monitoring and logging? These add a tremendous cost and complexity. See blog post below…
    http://openopsiq.com/2014/01/07/does-real-time-log-data-collection-and-analysis-monitoring/

  • 9 Andreas Ericsson   February 10, 2014 at 8:46 am

    Heya Jonah.

    Nice of you to mention Naemon as a project to watch :-)

    There are a few fact faults with the article though, so if you could correct that, it’d be awesome.

    op5 is and isn’t behind the Naemon project. They sponsor it by letting me and my colleagues work on it on company time (just as we work on other opensource projects on company time), but Naemon would have kicked off even without op5′s support.

    The twitter account linked to isn’t mine. I haven’t got one yet and probably never will.

    The name of the user interface is Thruk (not Thurk). I suppose typos happen even professionals.

    In terms of commits, I wrote more than 96% of the changes that took Nagios from v3 to v4, and not 69%. I have no idea how much that turns out to be in terms of actual code though, so perhaps I’m wrong in correcting that.

    On behalf of the 4-man stron Naemon team, I hope we won’t disappoint you :-)

  • 10 Jonah Kowall   February 10, 2014 at 1:13 pm

    Don’t agree with some of the concepts in that blog post, but yes you do need monitoring and logging. With monitoring agents you can parse and handle major exceptions in logs. A better approach is using centralized collection and analysis for real time alerting and troubleshooting combined with lightweight monitoring (no agents). You still need the logs regardless of the complexity involved, in many way handling a heavy monitoring agent is more overhead than logs.

  • 11 Jonah Kowall   February 10, 2014 at 1:17 pm

    No problem at all, I have clarified the op5 item. Of course we understand many of the Nagios based products will eventually switch to Naemon.

    I have fixed/removed the twitter link, and the typo.

    The 69% is the current commit count based on the stats, of course you haven’t been working post Nagios v4 and others have been working on the v4 fixes.

  • 12 GP   February 10, 2014 at 9:49 pm

    Jonah – thanks for your comments back related to logging and monitoring. Just curious on what specific points make you disagree – here is the way I understand it -

    1. A log is a localized persistence mechanism for collecting events.

    2. If a real-time event collection mechanism that is light weight and efficient is implemented, then there should be no need to have a separate logging construct.

    3. The event capture and processor, will determine whether the event requires to be persisted into a HDFS or like store.

    The LinkedIn engineering team did a good in explaining the concept of the log based data pipeline construct implemented using Kafka and SAMZA a stream processor.

    Overall, I agree with your assertions – 1) the monitoring construct needs to be lightweight 2) there is a need for logging — however currently most IT organizations implement 2 separate solutions. But the overall point is that you don’t have to implement two separate systems – one will do it.

    You can read more…
    http://openopsiq.com/2013/12/29/the-log-what-every-software-engineer-should-know-about-real-time-datas-unifying-abstraction/

    also
    http://peter.gillardmoss.me.uk/blog/2013/05/28/monitor-dont-log/

    Thanks again,

  • 13 Jonah Kowall   February 10, 2014 at 9:59 pm

    Here are the basics:

    Log != event
    Logs can contain many non-event based data points which are useful in the future, or may become useful in the future.

    Engineering your own log collection and analysis system covers the top .5% of users who need that technology. Most clients I speak with cannot engineer their own systems, hence they rely on log analysis products which are purchased versus developed. You are also assuming that users have developers writing the apps which are logging, and that’s very often not the case.

    The reason why monitoring and logging are separate in most cases is the monitoring tools don’t do the type of log analysis people want today, they do the log/event analysis people wanted in 1995.

  • 14 GP   February 11, 2014 at 3:39 am

    That is well said. The current crop of monitoring software have some serious challenges hence the interest.

  • 15 Open Source Software for Monitoring – Nagios and other options | OpenOpsIQ   February 12, 2014 at 3:45 am

    [...] Gartner Analyst Jonah Kowall has shared an interesting list of monitoring software alternatives to Nagios. Read more by clicking here…http://blogs.gartner.com/jonah-kowall/2014/02/06/monitoring-software-sucks-so-i-use-nagios-whats-a-b… [...]

  • 16 Felix Egli   February 22, 2014 at 10:33 am

    Icinga and Naemon are no alternatives to Nagios. They are Nagios with a new web-gui, and some other stuff added. The problem with Nagios is not only the web-interface. The major problem with Nagios is the so called Nagios Core, which is still present in Icinga and Naemon.
    It’s a bad approach to add things to something which is broken. It’s much better, to replace the broken part, which is Nagios Core.

  • 17 Jonah Kowall   February 23, 2014 at 11:04 pm

    Incorrect Felix, have a look at the core changes to the Nagios engine in both projects. I would also read the op5 blog on Naemon as well.

  • 18 Felix Egli   March 1, 2014 at 10:57 am

    Jonah, at least with Naemon I’m right for sure. Maybe you’r right with Icinga, but to me it looks like the changes are mainly in the GUI and not in the core.

    This is the Naemon changelog, and the only big change is that the CGIs are replaced with Thruk:

    Changelog
    0.8 – 14 Feb 2014

    Based on nagios 4.0.2
    Rename a lot of things, replace build system, etc.
    The CGIs are gone – use Thruk instead.
    Remove the upstream version check – use your package manager instead.
    New NEB callback, NEBATTRCHECKALERT, when a check generates an alert.
    Allow contactgroups without members but having contactgroup_members.
    No longer spam Naemon log when checks time out.
    All positive values for ACKNOWLEDGE_{HOST,CHECK} means TRUE.
    Check output parsing rewritten.
    Fixes crashes, bugs, and improves performance.
    Log rotation is done by logrotate instead of in-core log rotation.
    Fix misc crashes, speed up misc areas, and other bug fixes.

  • 19 Jonah Kowall   March 2, 2014 at 1:20 pm

    Felix, give it some time… new project.

  • 20 Michael Friedrich   March 6, 2014 at 10:27 pm

    Regarding the broken part of Nagios Core – there are many different opinions in the outside world how to handle events / check results / performance data / etc, and also move away from the state based alert model towards something new.

    In the end, you’ll decide for the 2 sides of the “let’s do something new” story: 1) throw away old crap and introduce new stuff 2) stay compatible with your product line & users

    Icinga 1.x Core couldn’t throw away much old crap without breaking compatibility. Rewriting the code inherited from Nagios as a fork – well, that code base is a glory mess. Andreas is a hero for rewriting THAT ;) In terms of better UI (even Classic UI with multiple commands and live search … ) and usability it’s imho much better than Nagios ever was. You may read here, what’s different between Nagios and Icinga (beware – I wrote it): https://wiki.icinga.org/display/Dev/Bug+and+Feature+Comparison

    Still, it’s not satisfying in terms of improvements, and also throwing away old crap. In terms of “old crap” I’d just say: Try to setup notifications for a service for a user with 3 different types of notification methods: mail, sms, jabber with different notification options & additional escalations.
    It’s just one of those examples, and it’s not only the configuration syntax which sucks (well, it’s easy to write/parse, but hard to fix imho) but rather the handling inside the core too.

    I’m not saying that Icinga 2 Core will solve all the problems we encountered in the past 5 years after forking Nagios, but still someone gotta do it and start from scratch. I find C++ and Boost very convenient to focus on a real architecture rather than cutting my fingers with bloody C, but that’s a religious discussion.
    Users won’t love the new configuration format, or they will. The native cluster stack targets large scale environments, but also introduces capabilities for a better protocol (SSL, IPv4/6, JSON-RPC) among checkers and agents.

    Still, the state change alert notification model is implemented as such. That’s the matter of compatibility and supporting interfaces like status file, DB IDO and livestatus. And not to forget – the plugin API. Without that one Nagios would have never gained so much attraction in the first place – plugins need to be written, run & developed.

    In the end, the community will decide. Chose whatever fits best. And if your Icinga 2 Core writes natively to graphite, Puppet & Foreman generate the configuration, Logstash triggers additional alerts, and Sensu is running there too. Why not – it’s all open source, and the better competition we keep the more the community will benefit from it. Grab some beers at conferences (OSMC 2014!) and have a chat about your systems – you’ll truly find new friends & ideas :)

  • 21 Jonah Kowall   March 7, 2014 at 2:16 am

    Good stuff here, sorry I’m going going to be at OSMC, but I will be at OSCON and Velocity this year… Feel free to email or tweet to get in touch.

  • 22 gilgamezh   March 15, 2014 at 9:22 pm

    Hi! great post and great comments. :)

    One question. What tool would you use to generate alerts from graphite data?
    For example, you have a graphic with info from the login page of an application (success and errors), then you want to fire an alert (in your monitoring panel, send mails, etc) when the Q of errors is over definite threshold.

  • 23 Jonah Kowall   March 15, 2014 at 9:31 pm

    Cabot is the most common tool for that purpose.

    Other options:
    http://blog.gingerlime.com/2013/graphite-alerts-with-monit/
    http://riemann.io/

  • 24 Got Nagios? Get rid of it.   March 16, 2014 at 4:00 am

    [...] Monitoring software sucks so I use Nagios, what’s a better approach? [...]

  • 25 George   March 17, 2014 at 6:06 pm

    Has anybody tried/evaluated (or even better installed and used in a regular basis) PandoraFMS? If yes, what’s your opinion about it? I ‘m seriously considering it for our company instead of the typical Nagios/Nagios based solutions (Icinga, Opsview, etc) plus addons (graphite, PNP4nagios, etc) or Zabbix, ZenOss, OpenNMS, etc

  • 26 Jonah Kowall   March 18, 2014 at 4:40 am

    Yes, I’ve spoken to several Gartner clients using PandoraFMS. It’s a newer product and hence has lower adoption, but the feedback I have gotten is positive generally speaking. The company behind it is rather small, and being based in Spain there are some language issues with the product and documentation. I would suggest evaluating the product as I have, but I also suggest looking at OpsView, OP5, Zenoss. I’m not the biggest fan of OpenNMS for various reasons, but I’ve spoken to happy clients using the product.

  • 27 George   March 18, 2014 at 4:24 pm

    Hello Jonah. Thanks for the answer. Good to know that generally there is some positive feedback by other IT colleagues with regards to PandoraFMS. I know that the Pandora team is located in Spain, but for me this is not a problem (I ‘m located in Europe as well :p). Knowing this I expected things to be much worse when it comes to the language barrier, etc. but it doesn’t seem to be an issue after all. I ‘m really keen on evaluating the product by setting up some kind of a Proof of Concept.

    From the rest of the names, I contacted the Opsview and the Zabbix people (yes, I know Zabbix is open source, but I preferred to have an actual meeting with the Zabbix people. we ‘ve taken really seriously the monitoring infrastructure thing in my team). OpsView is really expensive for our case and our needs, so I would consider out of the picture. I guess that OP5 and Zenoss are not very different with their pricing policies, but are considered as well. Let’s see where the ball will land at the end :p

  • 28 Jonah Kowall   March 21, 2014 at 12:16 am

    George, cool. I’m surprised they are too expensive considering the pricing is quite reasonable, I guess you’re hoping for something free. Be warned that you’ll have to do a lot of lifting with config mgmt (chef, puppet, salt, etc) to get the other tools to work properly at scale.

  • 29 Marius Polished   March 24, 2014 at 12:23 pm

    Hi everyone.

    Congratulations Jonah for the post.
    I took some time surfing the net to find the tool easy and comfortable monitoring to monitor my hosts. The first place where I met the Pandora FMS tool was in this blog, downloaded from their website and in the 2 weeks that I testeandola far I’ve managed to configure everything I needed without take me headaches. I recommend it. My previous experience had been Nagios, and all icmp checks were in critical condition, when the machines were ok, the use of “agents” in Nagios is very difficult, in order not convince me. Then I tried Zabbix and the experience was better, but the solution did not offer me everything I needed and now with Pandora FMS experience proved to be correct and final. We are thinking of implementing the Enterprise option to have available their characteristics as some we can be useful and the price is really good.

  • 30 Tom Kahl   March 28, 2014 at 4:04 pm

    Hi Jonah,

    Have you heard of and or used LogicMonitor? If yes, I what did you think of their stuff?

  • 31 Jonah Kowall   March 29, 2014 at 1:28 pm

    Tom, we’ve included them in other research. If you are a client I can go much deeper via a written inquiry or a phone inquiry.

  • 32 George   April 1, 2014 at 10:13 pm

    Hi Jonah,

    Regarding the price issue: It doesn’t need to be necessarily free as in beer. In any case you need to invest time and effort, so you need to balance at the end what you want. And apparently I ‘m not saying anything new here. Anyway, I wouldn’t like to comment any further in this open forum why we considered the price offered from the vendor I mentioned in my previous post too expensive.

    Coming to PandoraFMS now: I ‘ve set it up as a demo and I have the same feelings as expressed by Marius above. I think that even the open source version would suffice in many cases. If one needs only one tool to do the job then the enterprise version may be worthwhile to pay for some of its features (eg, VMware, NetFlow monitoring, etc).

    At the end of the day, more or less you can achieve the same results with any of the popular monitoring tools. I read the religious wars about Nagios (and forks, eg Icinga, Fully Automated Nagios, etc) here and in other sites with some people referring to it as the holy grail, but I ‘m not into it. There are other good tools out there open source or not. Fortunately, we are not in the 90s :)

  • 33 Timir Karia   April 8, 2014 at 8:45 am

    Anybody have any experience with Monit? We’re trying it out now and it’s fairly simple and seems stable but we have yet to put it through its paces. Any feedback appreciated.

  • 34 Jonah Kowall   April 8, 2014 at 12:57 pm

    Timir, I would suggest reading this.. http://blogs.gartner.com/jonah-kowall/2013/11/12/unified-monitoring-note-presentation-and-client-interest/

  • 35 Michael Rojek   April 15, 2014 at 2:18 pm

    Give NetCrunch from AdRem Software a try. It’s fully automated, and will identify, configure and begin monitoring your network out of the box. It’s all-in-one, with no separate modules for network performance monitoring, server and app monitoring, NetFlow, etc. It’s agentless, and has an embedded SQL database. The goal first and foremost is ease-of-use.

  • 36 Jonah Kowall   April 15, 2014 at 3:30 pm

    It’s on my list for lab eval already. Might be another month or so… Thanks Michael.

  • 37 Patrick   April 28, 2014 at 1:43 am

    Thanks for this great post, Jonah!

    May I know their comparison in term of multi-tenancy support, says running under different projects/tenants of AWS/Openstack etc and allows invidivual projects to monitor only their resources under a centralized monitoring system?

  • 38 Jonah Kowall   April 28, 2014 at 3:15 pm

    Thanks for posting Patrick, I can’t provide custom advice on my blog. That’s what clients pay us for :)

    There are tools out there which have multi-tenant support specifically for CSPs and MSPs.

  • 39 Gerry Johnson   May 14, 2014 at 10:33 am

    Sorry – seems as though you are conducting a vendetta against Nagios

  • 40 Jonah Kowall   May 15, 2014 at 2:39 pm

    Gerry, see my other reply. I do believe Nagios needs to die in it’s current form as do many others!

    This is another great presentation on the topic from this year:
    http://www.slideshare.net/superdupersheep/stop-using-nagios-so-it-can-die-peacefully

  • 41 OpenOpsIQ – Most popular links on logging, monitoring, cybersecurity | OpenOpsIQ - Intelligent Cloud Ops   June 8, 2014 at 10:10 pm

    [...] blogs.gartner.com/jonah-kowall/2014/02/06/monitoring-software-sucks-so-i-use-nagios-whats-a-better-a… [...]

  • 42 Pablo Huiza   June 27, 2014 at 7:08 pm

    Hi, We are evaluating some monitoring tools and we prefer Groundwork. I would like to know if somebody from this forum is using it or know something about it ..Thanks

  • 43 Jonah Kowall   June 30, 2014 at 1:16 am

    Pablo, we would be pleased to speak with you about GroundWork and the Unified Monitoring space if you are a client. If you are looking to get suggestions from GroundWork users I’d suggest asking them for a reference or you can look at the GroundWork forums here : http://www.gwos.com/forums/ or on LinkedIn.

Leave a Comment