I posted a write-up of “getting rid of” Nagios previously, and it’s generated a staggering number of responses by those who are attached to Nagios and have used it effectively. This has ignited a similar discussion to what you see in the Mac vs PC or *nix vs Windows discussions pervasive across the internet (complete with personal attacks, I thought we were all adults here…). Those who know me, know I’ve been there and done it in my past, and in speaking with thousands of users of monitoring tools over my time at Gartner I can share some common threads.
Many of the readers who commented on the story were managing hundreds of servers, this is typically not the size of enterprises I speak with (although sometimes we do speak with those groups). We often speaking with an enterprise or IT architect who is trying to homogenize the management of the IT landscape. This is not just servers, but applications, network devices, storage systems, and other components. If you do not treat your IT infrastructure with uniformity, the notions of industrialization and standardization are elusive.
The typical Gartner client is medium to large companies, they typically have grown through the years by organic and acquisition of other businesses. This makes their IT landscape quite diverse, and often managed in isolated groups. In the discussions with these isolated organizations, left to their own devices, have selected a wide array of tools to manage their infrastructure and applications. These tools overlap with other business units within the organization using similar or dissimilar tools. For example a client I spoke with last week had over 65 monitoring tools in place, the size of the environment was 75,000 servers, let alone a vast array of other components not in scope for phase 1. As you can imagine there was everything from Icinga, Nagios, Zabbix, Graphite, etc (on the open source front) to what we call the big-4: IBM, CA, HP, BMC, and even islands of VMware vCenter Operations Suite, Oracle Enterprise Manager, and Microsoft System Center Operations Manager. I tried to offer some prescriptive advice, but the task at hand was daunting to say the least. The CIO who created this project was responding to the massive cost of licenses and people the business was essentially wasting by not managing this centrally, and not leveraging it’s scale.
Nagios plays a big role in these organizations, often being implemented several different ways in a single enterprise. Trying to standardize the implementation is a challenge especially with most Nagios users selection various components from the open source and Nagios communities to build their personal preferred implementation. By utilizing aspects of open source the vendors who leverage Nagios, but build standardized ways of implementing and managing the footprint create a likely more successful implementation at scale. (EX: Centerity, Centreon, Groundwork, OP5, Opsview).
Additionally a common issue that is pervasive across monitoring is the over extension of the platform to do more than the core tenants of monitoring. The capture of metrics, notification of issues, and the analysis and correlation of metrics to determine root cause for problem isolation. Monitoring should not run jobs needed to operate and correct infrastructure, yet we see this happening consistently regardless of the platform. This creates lock in to the monitoring tool regardless of the vendor or technology used. When the business is forced to transform for example due to acquisition, new business demands, data center relocation, consolidation, or newer technology which enables easier management they are unable to detach the monitoring from the infrastructure or applications without extensive reverse engineering, which often times is not possible.
So I leave you with a question, how do you avoid these typical scenarios regardless of the tool in place?