Jonah Kowall

A member of the Gartner Blog Network

Jonah Kowall
Research Vice President
3.5 years with Gartner
20 years IT industry

Jonah Kowall is a research Vice President in Gartner's IT Operations Research group. He focuses on application performance monitoring (APM), Unified Monitoring, Network Performance Monitoring and Diagnostics (NPMD), Infrastructure Performance Monitoring (IPM), IT Operations Analytics (ITOA), and general application and infrastructure availability and performance monitoring technologies. Read Full Bio

Coverage Areas:

Nagios : Let the religious wars continue

by Jonah Kowall  |  July 10, 2013  |  11 Comments

I posted a write-up of “getting rid of” Nagios previously, and it’s generated a staggering number of responses by those who are attached to Nagios and have used it effectively. This has ignited a similar discussion to what you see in the Mac vs PC or *nix vs Windows discussions pervasive across the internet (complete with personal attacks, I thought we were all adults here…). Those who know me, know I’ve been there and done it in my past, and in speaking with thousands of users of monitoring tools over my time at Gartner I can share some common threads.

Many of the readers who commented on the story were managing hundreds of servers, this is typically not the size of enterprises I speak with (although sometimes we do speak with those groups). We often speaking with an enterprise or IT architect who is trying to homogenize the management of the IT landscape. This is not just servers, but applications, network devices, storage systems, and other components. If you do not treat your IT infrastructure with uniformity, the notions of industrialization and standardization are elusive.

The typical Gartner client is medium to large companies, they typically have grown through the years by organic and acquisition of other businesses. This makes their IT landscape quite diverse, and often managed in isolated groups. In the discussions with these isolated organizations, left to their own devices, have selected a wide array of tools to manage their infrastructure and applications. These tools overlap with other business units within the organization using similar or dissimilar tools. For example a client I spoke with last week had over 65 monitoring tools in place, the size of the environment was 75,000 servers, let alone a vast array of other components not in scope for phase 1. As you can imagine there was everything from Icinga, Nagios, Zabbix, Graphite, etc (on the open source front) to what we call the big-4: IBM, CA, HP, BMC, and even islands of VMware vCenter Operations Suite, Oracle Enterprise Manager, and Microsoft System Center Operations Manager. I tried to offer some prescriptive advice, but the task at hand was daunting to say the least. The CIO who created this project was responding to the massive cost of licenses and people the business was essentially wasting by not managing this centrally, and not leveraging it’s scale.

Nagios plays a big role in these organizations, often being implemented several different ways in a single enterprise. Trying to standardize the implementation is a challenge especially with most Nagios users selection various components from the open source and Nagios communities to build their personal preferred implementation. By utilizing aspects of open source the vendors who leverage Nagios, but build standardized ways of implementing and managing the footprint create a likely more successful implementation at scale. (EX: Centerity, Centreon, Groundwork, OP5, Opsview).

Additionally a common issue that is pervasive across monitoring is the over extension of the platform to do more than the core tenants of monitoring. The capture of metrics, notification of issues, and the analysis and correlation of metrics to determine root cause for problem isolation. Monitoring should not run jobs needed to operate and correct infrastructure, yet we see this happening consistently regardless of the platform. This creates lock in to the monitoring tool regardless of the vendor or technology used. When the business is forced to transform for example due to acquisition, new business demands, data center relocation, consolidation, or newer technology which enables easier management they are unable to detach the monitoring from the infrastructure or applications without extensive reverse engineering, which often times is not possible.

So I leave you with a question, how do you avoid these typical scenarios regardless of the tool in place?

11 Comments »

Category: Analytics ECA IT Operations Monitoring     Tags:

11 responses so far ↓

  • 1 NagiosUser   September 6, 2013 at 12:00 pm

    I agree with you. But is there any better tool than Nagios XI at this price range???

  • 2 OMDUser   September 12, 2013 at 8:57 am

    Yes, here it is: The “Open Monitoring Distribution”
    http://www.consol.de/open-source-monitoring/open-monitoring-distribution-omd/

  • 3 Arie   September 27, 2013 at 6:54 pm

    Good Question.

    It is not possible imho to avoid the situation you have mentioned.

    When one company takes over another one, and another one etc. Monitoring is not the question at that time. The It landscape can be completely different an fully incompatible.

    Getting a homo-gene IT landscape can require a complete redesign of this IT landscape’s, and monitoring. To avoid this (monitoring) as much as possible one has to look for a system than can grow with the landscape if needed (Opsview can do this), but.. does one know in advance how far this growing goes on? We do not. Facebook started on one server, now a few years further they need thousands to run their business.

    And why should infrastructure monitoring not take care of the correction of problems, as long as we are notified about those actions? In SCOM for example it is not only about monitoring your IT, but all-so about controlling it these days.

  • 4 Jason   September 30, 2013 at 9:16 pm

    I’ll toss two simple methods that are required to get any monitoring solution to work LONG term:

    1) Automation.
    2) Standard configurations.
    3) Community support

    Automation with chef/puppet/etc. is going to be a MUST if you want a large scale, solution that scales. That includes the automation of various servers and services on those servers. SO when “Tomcat” gets deployed to a server, the automation piece automatically picks it up and adds monitoring and alerting.

    Standards are required as well. A simple example is that if your “production” servers aren’t “tagged” as such in some way, it’s hard to automate and manage those servers. You can use a central management system, but often your automation solution starts taking ownership of that. For example, you’d tag a server as a “production” server, and then in your automation “all production servers get X monitoring for hardware”. But if you have “this is a one off server that’s not really production” that’s where things get nasty. And here’s the kicker – ANY monitoring solution has to have an API or easily to update configuration option.

    A last piece – COMMUNITY. I’ve seen a lot of closed source solutions that have HORRIBLE community support. And instead of making monitoring easy, these teams often will setup complicated monitoring configurations that end up not keeping up with the various product releases. This is the problem with the closed source monitoring solutions I’ve seen.

    An example here – MongoDB. The available monitoring tools for mongo from the community are all around for the various open source solutions (Zabbix/Nagios/Check_MK or OMD/etc. But it’s extremely difficult to find monitoring of this database in closed source solutions, because there’s no community around those solutions. If you’re core business runs on open source systems, look to open source technologies and community support. If ALL of your infrastructure is proprietary then it’s possible a closed source solution MAY work. But I’ve yet to seen an entirely closed source shop. Which means that an open source solution of some sort will almost always be easier to maintain than a closed source solution.

  • 5 Paul Karman   October 28, 2013 at 8:39 am

    Hi Jonah,
    The answer seems to be in your story already. Standardization.

    I know standardization to be a difficult art. I know about this “beauty of standardization because you can pick any standard you like”. So for me warnings are in place, I do not take standardization lightly.

    To me standardization is the art of knowing what *not* to standardize. Sometimes I see a department that matured enough to understand we need standards but sometimes those standards do more harm then good. Instead of increasing productivity they put hard breaks on it.

    Is it practical to standardize everything up to the last bit if the end result is that no one has ever enough time or mental power to study all the standards before adding a metric in the correct way?

    Is it practical to standardize on a worldly level or even at company level when the standards are intended to improve cooperation between a team of 5 people?

    Are there not too many and not too few standards between different departments?

    Whatever monitoring tools one adopts, it seems to me that the right level of standardization determines success more then what bells and whistles are included in the software.

    Kind regards,
    Paul Karman

  • 6 Bruno   November 2, 2013 at 2:06 pm

    Jonah,

    I think what you are seeing in these companies is not just a technology problem. Although I am not well versed in all the monitoring technologies out there I can abstract a solution by asking a few questions.

    Question 1

    What single tool is currently available on the market that could monitor all 75,000 servers that currently are being monitored by the 65 monitoring tools?

    Question 2

    How much money is that company willing to spend to resolve this problem?

    Question 3

    Are people willing to outsource monitoring from their teams into a central team?

    These three questions tackle technology, finance and people all of which are part of the problem, of which I think the last 2 are the biggest problem.

    My solution is as follows:

    > I would go for an open source solution and save half my budget.

    > I would hire a genius with the experience and expert know how to setup a single monitoring system with maybe a distributed architecture and provide whatever output people want to see.

    > I would appoint someone with great people skills to manage the outsourcing of the monitoring from the other teams and to manage business expectations whenever further functionality is needed.

    I am not sure what the above might cost but what I am sure off is that it isn’t a high price to pay especially when you take the licensing costs into account and the staff time spent setting up and maintaining those systems ( new checks, upgrades, etc )

    Although the above does not factor in the costs of consultants which my company will hire because they don’t trust their staff to do my job as a consequence of not being able to assess how technically able any of their them are.

    Centralize it, standardise it and revolutionise it!

  • 7 Got Nagios? Get rid of it.   November 12, 2013 at 7:43 pm

    [...] Nagios : Let the religious wars continue [...]

  • 8 Matt   March 21, 2014 at 1:54 pm

    Nagios/Icinga is perfect for alerting.

    If you want to graph stuff, use graphite.

    If you want to alert on your graphs, use Skyline.

    Nagios is not a “one size fits all” and I have had far too much experience of systems which promise to roll up alerting, trending and visualisation into a “one size fits all” solution to know that it’s usually “one size fits nothing”

  • 9 Gartner: Got Nagios? Get Rid Of It!   August 20, 2014 at 2:29 am

    [...] Wrote Jonah, [...]

  • 10 NagiosAbuser   September 4, 2014 at 5:28 pm

    The problem is not that you have an opinion on the state of Nagios vs other options, it’s that you’re basing your opinions as facts.

    1.) Nothing works in a vacuum. Most entities running nagios are running it with other software to accomplish various things that nagios wasn’t designed to do. ( Puppet/chef for automation, Munin for a front end, etc.)

    2.) The “your way sucks” mentality is very easy to have, and really not a good fit for the open source community. There are thousands of ways to do everything, and OFTEN things are done in a way to maximize the current environment. People that shout “this is always the way it’s been done” are in my top 5 classes of annoyances, but slightly behind the camp that says “this way is new, so we should do that simply because newer is better,” and way behind the people that boast. “*my* way would toatally solve ALL your problems.” (becuse it won’t, and arrogance is ugly.)

    If we wanted to work in a one-way-fits-none, but-we’re-gonna-do-it-anyway world, we probably wouldn’t be working in IT.

    That being said, in my experience, here’s three ways to look at solutions:

    A.) How would I do it if money were no object? (Dream way..)
    B.) How would I do it if I were in my garage on my own servers? (Cheap way OR most fun way OR which way would I learn more…)
    C.) How would I do it if my job depended on it? (What’s the cheapest way I can do it *right*.)

    Since we don’t always lie in the dream state of life, B and C are WAY more common…and the way I’d do something if I had my druthers is almost NEVER the right way to do it in my enterprise.

    Fun thought to keep in mind.

  • 11 Jonah Kowall   September 4, 2014 at 7:36 pm

    Agreed on most of these counts, I’m just reflecting in my writing and research what I’ve heard more than enough times from clients. By adding lots of components and complexity into solving a basic problem (is my server healthy) we are over engineering stuff which matters less to the actual function of applications on that server. Simplify availability monitoring and focus on performance monitoring.

Leave a Comment