I’m doing my second greenfield OpenNMS deployment, and so far it’s going a lot smoother than last time. This time it’s server-centric which is a whole new can of worms to figure out. I don’t yet know how feasible it will be to monitor services like Exchange or SQL Server.
My last setup never really got off the runway because I didn’t have enough random write I/O bandwidth to do massive graphing on all of my switch ports and I left before I could beg for new hardware. Since then I’ve learned there are just flat out physical limits to disks, you can only do so many things per second. Even SSDs have absolute limits on how many operations/second they can do. The way to win is to persist everything to a giant ramdisk, then copy all that to your real disks every X minutes you feel safe living without. All of a sudden those thousands random 8k writes to JRB files every second turn into one giant sequential write to disk. It copies surprisingly very quickly. It’s not that far fetched, you can easily put 64 GB of RAM in a server for less money than buying a new 16 drive system and it’ll give you hella performance over safety.
Anyways, this time I’m going from 1.8.3 to 1.9.92, and it’s considerably better! Not only were two of my long standing gripes (adding relativetime to graphs and TableTracker writing to stderr) fixed, and a slew of bugs fixed, but it just feels like a much nicer piece of software now. I kept track of the release notes over the summer and it’s obvious there were a lot of good updates going in and progress being made. I’m sure there are many things that are fixed that weren’t completely obvious to me as being broken. Props to all those guys. One of the big things I wanted then was IPv6 support, but now I don’t have a v6 network to monitor. :(
A few things I’ve learned either through experience (or things that were obvious to other people finally made sense to me):
– You’ll be happier and scale larger if you use ramdisks
– The services (e.g. ICMP, HTTP, SNMP) listed in the node ‘Availability’ table is a completely independent operation of the process for data collection (i.e. graphing). You can graph things not listed there, and you can monitor availability without graphing it. I’ve known this for a while but it seems like it’s lost on a lot of people still.
– Maybe I’m stupid but I was always lead to believe there were two types of categories being used, one in categories.xml (the home page) and another in surveillance views. I think this was because of poor examples. The default categories configuration refers to things as rules instead of filters, and there’s nothing in there that leads you to believe you can do “categoryName == Servers” and/or even “nodelabel like ‘%city%'”.
– If you want to ping/monitor external sites: create a new provisioning group, strip down all the collectors down to ICMP and HTTP, add a policy to categorize everything as ‘external-monitors’. Go setup a poller package to enable StrafePing for “<filter><![CDATA[(IPADDR != ‘0.0.0.0’) & (categoryName == ‘external-site-monitor’) ]]></filter>”. You’re a good citizen for not repeatedly over-discovery-scanning some poor website and your NOC peeps will understand from the category why amazon.com is in the node list.
– Poller packages are additive, it doesn’t have to be all-or-nothing. If an IP address or category matches more than one poller package, both get applied. This was another fascinating realization (by accident of course) because one of my big peeves was figuring out a easy way to get really custom polling without cut-n-pasting huge swaths of configuration.
– Automatically adding categories to new nodes via a foreign source policy is the way to go. You can rig it up so that if you have different provisioning groups per city/site, everything can be tagged with a site category by default without having to select from a drop-down. Then, based on the hostname/node label, add more categories to indicate its role, etc. The OpenNMS Provisioning PDF is an outstanding reference for doing policies, it is much clearer and has better examples than the provisiond wiki page.
– Tip: when creating a foreign policy and you want to use a regex, prepend the expression with a ~ (tilde), e.g. “~.*” or “~^hostname\d+.*”. I don’t think this was explained in any documentation and I happened to stumble upon it on a random mailing list. Otherwise your stuff will not match and you won’t know why!
– Integrating the web UI to auth/authorize off of Active Directory via LDAP looks hairy and makes me want to drink more before trying it. It looks like there were a few examples thrown into the wiki but who knows if they work, they certainly don’t look simple and easy. A simple “hello world” example to just to prove the thing works in a basic capacity would be nice.
– When all else fails, think how Java would think.
All this being said, realize it’s definitely not an application you casually install with “yum install opennms” and walk away. It takes quite of bit of knowledge to get it working and thought on how you want to organize things to benefit both operational and business-level needs. Even then, it seems to take several iterations of “I want to do this” to “the OpenNMS way is impossible or has to be done this way” and then “okay, redesigned the need with that in mind, try it this way”. Otherwise you’ll be doing hacky complicated things that get out of hand and are impossible to manage.
My new big gripe is that OpenNMS as a concept has serious shortcomings when it comes to monitoring disk and storage resources. Maybe the big kids are monitoring 50,000 cable modems (which it works really well for) plus a token web farm and there just isn’t much need for an extensive storage interface. I mean I can certainly graph space and disk I/O, but I would really love to look at a node’s page and be at a glance able to see a short numerical list of space used/free on all my partitions and individual RAID statuses. Some of us have 50,000 hard disks and exibibabblebytes to monitor instead! Same with BGP peering, I’d love to have been able to look at the router’s node page and see at a glance how many of my BGP sessions were up/down (peering routers easily had hundreds). Server load balancing too, now that I think about it more. You can improvise custom Availability checks for a few one-offs, but they’re usually static to one device’s configuration. Even if I was motivated to take a crack at writing my own, it doesn’t seem like a place with any hooks to do something custom.
I don’t think that OpenNMS itself is the problem, but just that all the tools for getting data about storage and RAIDs suck balls. Just this week I learned that there’s a patch in net-snmp 5.3 on CentOS 5.7 that lets it support filesystems > 4 TB, but that patch is not in net-snmp 5.4 on OpenIndiana. SNMP access to PERC status? ha ha. ZFS metrics via SNMP? ha ha ha. NFS service poller? doesn’t exist. iSCSI or FCoE ping? what what
The only way to get around this is to rig up some middleware solution to do custom aggregation of this information and then present it in a standardized way to OpenNMS. Even then, there still needs to be an interface (like IP addresses/interfaces) to show it.
The other long standing gripe is the need for an easy way to aggregate a bunch of performance data from multiple nodes into one large graph. Suppose I have a datacenter full of power strips and I want to have an graph that shows total power consumption, or multiple transit links to the same provider, total disk throughput across many disks. KSC reports can’t do this (that I know of) and the alternative is rigging up a virtual node and symlinking things all over the place then creating a custom graph report. blah.
The last time I looked at thresholding, alarms and notifications it was so complicated and dreary it made me want to drink. It’s most certainly not as easy as throwing a 15 line file in /etc/nagios/objects. In fact, I’m keeping Nagios around for exactly that reason. Here’s to hoping it makes more sense this time!