Gartner Blog Network

Nagios : Let the religious wars continue

by Jonah Kowall  |  July 10, 2013  |  14 Comments

I posted a write-up of “getting rid of” Nagios previously, and it’s generated a staggering number of responses by those who are attached to Nagios and have used it effectively. This has ignited a similar discussion to what you see in the Mac vs PC or *nix vs Windows discussions pervasive across the internet (complete with personal attacks, I thought we were all adults here…). Those who know me, know I’ve been there and done it in my past, and in speaking with thousands of users of monitoring tools over my time at Gartner I can share some common threads.

Many of the readers who commented on the story were managing hundreds of servers, this is typically not the size of enterprises I speak with (although sometimes we do speak with those groups). We often speaking with an enterprise or IT architect who is trying to homogenize the management of the IT landscape. This is not just servers, but applications, network devices, storage systems, and other components. If you do not treat your IT infrastructure with uniformity, the notions of industrialization and standardization are elusive.

The typical Gartner client is medium to large companies, they typically have grown through the years by organic and acquisition of other businesses. This makes their IT landscape quite diverse, and often managed in isolated groups. In the discussions with these isolated organizations, left to their own devices, have selected a wide array of tools to manage their infrastructure and applications. These tools overlap with other business units within the organization using similar or dissimilar tools. For example a client I spoke with last week had over 65 monitoring tools in place, the size of the environment was 75,000 servers, let alone a vast array of other components not in scope for phase 1. As you can imagine there was everything from Icinga, Nagios, Zabbix, Graphite, etc (on the open source front) to what we call the big-4: IBM, CA, HP, BMC, and even islands of VMware vCenter Operations Suite, Oracle Enterprise Manager, and Microsoft System Center Operations Manager. I tried to offer some prescriptive advice, but the task at hand was daunting to say the least. The CIO who created this project was responding to the massive cost of licenses and people the business was essentially wasting by not managing this centrally, and not leveraging it’s scale.

Nagios plays a big role in these organizations, often being implemented several different ways in a single enterprise. Trying to standardize the implementation is a challenge especially with most Nagios users selection various components from the open source and Nagios communities to build their personal preferred implementation. By utilizing aspects of open source the vendors who leverage Nagios, but build standardized ways of implementing and managing the footprint create a likely more successful implementation at scale. (EX: Centerity, Centreon, Groundwork, OP5, Opsview).

Additionally a common issue that is pervasive across monitoring is the over extension of the platform to do more than the core tenants of monitoring. The capture of metrics, notification of issues, and the analysis and correlation of metrics to determine root cause for problem isolation. Monitoring should not run jobs needed to operate and correct infrastructure, yet we see this happening consistently regardless of the platform. This creates lock in to the monitoring tool regardless of the vendor or technology used. When the business is forced to transform for example due to acquisition, new business demands, data center relocation, consolidation, or newer technology which enables easier management they are unable to detach the monitoring from the infrastructure or applications without extensive reverse engineering, which often times is not possible.

So I leave you with a question, how do you avoid these typical scenarios regardless of the tool in place?

Category: analytics  eca  it-operations  monitoring  

Jonah Kowall
Research Vice President
3.5 years with Gartner
20 years IT industry

Jonah Kowall is a research Vice President in Gartner's IT Operations Research group. He focuses on application performance monitoring (APM), Unified Monitoring, Network Performance Monitoring and Diagnostics (NPMD), Infrastructure Performance Monitoring (IPM), IT Operations Analytics (ITOA), and general application and infrastructure availability and performance monitoring technologies. Read Full Bio

Thoughts on Nagios : Let the religious wars continue

  1. NagiosUser says:

    I agree with you. But is there any better tool than Nagios XI at this price range???

  2. Arie says:

    Good Question.

    It is not possible imho to avoid the situation you have mentioned.

    When one company takes over another one, and another one etc. Monitoring is not the question at that time. The It landscape can be completely different an fully incompatible.

    Getting a homo-gene IT landscape can require a complete redesign of this IT landscape’s, and monitoring. To avoid this (monitoring) as much as possible one has to look for a system than can grow with the landscape if needed (Opsview can do this), but.. does one know in advance how far this growing goes on? We do not. Facebook started on one server, now a few years further they need thousands to run their business.

    And why should infrastructure monitoring not take care of the correction of problems, as long as we are notified about those actions? In SCOM for example it is not only about monitoring your IT, but all-so about controlling it these days.

  3. Jason says:

    I’ll toss two simple methods that are required to get any monitoring solution to work LONG term:

    1) Automation.
    2) Standard configurations.
    3) Community support

    Automation with chef/puppet/etc. is going to be a MUST if you want a large scale, solution that scales. That includes the automation of various servers and services on those servers. SO when “Tomcat” gets deployed to a server, the automation piece automatically picks it up and adds monitoring and alerting.

    Standards are required as well. A simple example is that if your “production” servers aren’t “tagged” as such in some way, it’s hard to automate and manage those servers. You can use a central management system, but often your automation solution starts taking ownership of that. For example, you’d tag a server as a “production” server, and then in your automation “all production servers get X monitoring for hardware”. But if you have “this is a one off server that’s not really production” that’s where things get nasty. And here’s the kicker – ANY monitoring solution has to have an API or easily to update configuration option.

    A last piece – COMMUNITY. I’ve seen a lot of closed source solutions that have HORRIBLE community support. And instead of making monitoring easy, these teams often will setup complicated monitoring configurations that end up not keeping up with the various product releases. This is the problem with the closed source monitoring solutions I’ve seen.

    An example here – MongoDB. The available monitoring tools for mongo from the community are all around for the various open source solutions (Zabbix/Nagios/Check_MK or OMD/etc. But it’s extremely difficult to find monitoring of this database in closed source solutions, because there’s no community around those solutions. If you’re core business runs on open source systems, look to open source technologies and community support. If ALL of your infrastructure is proprietary then it’s possible a closed source solution MAY work. But I’ve yet to seen an entirely closed source shop. Which means that an open source solution of some sort will almost always be easier to maintain than a closed source solution.

  4. Paul Karman says:

    Hi Jonah,
    The answer seems to be in your story already. Standardization.

    I know standardization to be a difficult art. I know about this “beauty of standardization because you can pick any standard you like”. So for me warnings are in place, I do not take standardization lightly.

    To me standardization is the art of knowing what *not* to standardize. Sometimes I see a department that matured enough to understand we need standards but sometimes those standards do more harm then good. Instead of increasing productivity they put hard breaks on it.

    Is it practical to standardize everything up to the last bit if the end result is that no one has ever enough time or mental power to study all the standards before adding a metric in the correct way?

    Is it practical to standardize on a worldly level or even at company level when the standards are intended to improve cooperation between a team of 5 people?

    Are there not too many and not too few standards between different departments?

    Whatever monitoring tools one adopts, it seems to me that the right level of standardization determines success more then what bells and whistles are included in the software.

    Kind regards,
    Paul Karman

  5. Bruno says:


    I think what you are seeing in these companies is not just a technology problem. Although I am not well versed in all the monitoring technologies out there I can abstract a solution by asking a few questions.

    Question 1

    What single tool is currently available on the market that could monitor all 75,000 servers that currently are being monitored by the 65 monitoring tools?

    Question 2

    How much money is that company willing to spend to resolve this problem?

    Question 3

    Are people willing to outsource monitoring from their teams into a central team?

    These three questions tackle technology, finance and people all of which are part of the problem, of which I think the last 2 are the biggest problem.

    My solution is as follows:

    > I would go for an open source solution and save half my budget.

    > I would hire a genius with the experience and expert know how to setup a single monitoring system with maybe a distributed architecture and provide whatever output people want to see.

    > I would appoint someone with great people skills to manage the outsourcing of the monitoring from the other teams and to manage business expectations whenever further functionality is needed.

    I am not sure what the above might cost but what I am sure off is that it isn’t a high price to pay especially when you take the licensing costs into account and the staff time spent setting up and maintaining those systems ( new checks, upgrades, etc )

    Although the above does not factor in the costs of consultants which my company will hire because they don’t trust their staff to do my job as a consequence of not being able to assess how technically able any of their them are.

    Centralize it, standardise it and revolutionise it!

  6. […] Nagios : Let the religious wars continue […]

  7. Matt says:

    Nagios/Icinga is perfect for alerting.

    If you want to graph stuff, use graphite.

    If you want to alert on your graphs, use Skyline.

    Nagios is not a “one size fits all” and I have had far too much experience of systems which promise to roll up alerting, trending and visualisation into a “one size fits all” solution to know that it’s usually “one size fits nothing”

  8. NagiosAbuser says:

    The problem is not that you have an opinion on the state of Nagios vs other options, it’s that you’re basing your opinions as facts.

    1.) Nothing works in a vacuum. Most entities running nagios are running it with other software to accomplish various things that nagios wasn’t designed to do. ( Puppet/chef for automation, Munin for a front end, etc.)

    2.) The “your way sucks” mentality is very easy to have, and really not a good fit for the open source community. There are thousands of ways to do everything, and OFTEN things are done in a way to maximize the current environment. People that shout “this is always the way it’s been done” are in my top 5 classes of annoyances, but slightly behind the camp that says “this way is new, so we should do that simply because newer is better,” and way behind the people that boast. “*my* way would toatally solve ALL your problems.” (becuse it won’t, and arrogance is ugly.)

    If we wanted to work in a one-way-fits-none, but-we’re-gonna-do-it-anyway world, we probably wouldn’t be working in IT.

    That being said, in my experience, here’s three ways to look at solutions:

    A.) How would I do it if money were no object? (Dream way..)
    B.) How would I do it if I were in my garage on my own servers? (Cheap way OR most fun way OR which way would I learn more…)
    C.) How would I do it if my job depended on it? (What’s the cheapest way I can do it *right*.)

    Since we don’t always lie in the dream state of life, B and C are WAY more common…and the way I’d do something if I had my druthers is almost NEVER the right way to do it in my enterprise.

    Fun thought to keep in mind.

    • Jonah Kowall says:

      Agreed on most of these counts, I’m just reflecting in my writing and research what I’ve heard more than enough times from clients. By adding lots of components and complexity into solving a basic problem (is my server healthy) we are over engineering stuff which matters less to the actual function of applications on that server. Simplify availability monitoring and focus on performance monitoring.

  9. Fletch says:

    The real problem is that everyone wants something easy and ready to use out of the box. Windows, and the software that runs on it, does this very well. But with that, you have to put up with a lot of bloat from all the unnecessary addons and applications, as well as the parts that you might have needed which are missing. People like me who have old enough know that nothing is easy or straight forward about computers, and you can’t have it all with any system or software you write.
    Nagios is a base. That’s what you are missing. Kind of like Linux itself. You build onto it with the applications and resources that fit your needs. I’ve seen enough MSP software, to tell you that nobody makes a product that will satisfy everyone, and to get one that will satisfy you without doing it yourself, won’t happen, unless you have some VERY deep pockets. My advise to anyone reading this thread, learn *nix. Get familiar with its infrastructure, and open source applications. Contribute to the opensource community when you think you’ve found something that nobody else has thought of yet. It will save you and your clients money in the long run, and that is a good thing for everyone.
    I for one refuse to give the money I earn supporting my clients, to anyone else, just so I can sit on my ass and not have to do any work building my business. The way I provide support is proprietary, so my software solutions should be too. Throwing money at a problem until it goes away, is just lazy and wasteful.

  10. Antony says:

    Surprisingly, peoples react to this post even the guy create this have left.

    There isn’t rocket science here.

    To pay and forget about it, goes IBM, that’s how DoD works, but you really need a deep pocket.

    Don’t want to spend a penny on it? goes for the freeware that most people using. It may not working now, but it will work in later. At the beginning, Apache does not support event-driven like Nginx, but the community made it happen.

    The out of the box solution would be: hired some good guys and pay them well to keep them in the company. Who need monitoring if the server never goes problems?

  11. Thank you for your great article

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.