I got inspired this week to fiddle with OpenNMS. Reading the website makes me think it does * and has great potential. Looking at the modular design and the event system, it just “feels right”. It is Java-based with all XML-based configs, which really doesn’t bother me. I like the idea that it has built in syslog and SNMP trap collectors that automatically associates all active/passive events with the host/node.
Getting started was deceptively simple. There’s a step-by step guide to installing PostgreSQL and the OpenNMS RPMs. I think I had it running within 30 minutes. I gave it my home subnet to scan, and it did a great job of providing sensible resource monitoring (with clear, attractive graphs) of discovered things out of the box including OIDs I didn’t know existed. I don’t like network-wide scanning, but I really love the node-based service discovery mechanism. It tries to find every monitorable OID in a list based on the sysObjectId
. It also figured out if a system had multiple L3 interfaces, it would do ping/ssh response time monitoring to all public IP addresses. Every single thing it monitored, it graphed automatically, which is a huge annoying shortcoming with Nagios and other monitoring systems. I’ve recently been evaluating Science Logic’s EM7 and Nimsoft’s solutions, and neither lured me in to get real things done like OpenNMS did.
I’m really disappointed it doesn’t have any IPv6 support. I figured since it’s supposed to be a grown up, modern product that this would already be there. I’m not even sure if I can hack in anything to do even simple IPv6 ping monitoring even if I wanted to. There’s not that many people talking about support, but looking at the changelogs + single wiki page, support is gradually being added. Now that I’m more familiar with the product, I see that v4 addresses are all over the place and full support will take real work to pull off.
Immediately after it was installed, it felt like they gave me enough rope to hang myself. It was definitely working, but the web interface seemed like it wasn’t the whole story (it’s not). Coming from a Nagios environment, the normal sort of things I’m used to looking at weren’t there or not immediately obvious. How do I know when a down device will be polled next? How can I force a re-check? How can I ignore a check? How can I see all checks on one screen? I basically had to spend five hours one night forcing myself to read over all the configuration how-tos even though a lot of the concepts were foreign to me, then I was able to start constructing a mental model.
Eventually I got comfortable enough with the design and configuration, and started looking for individual things to expand so that it mirrored my Nagios functionality. I started taking new things one a time, Googling for each one. I was pleasantly surprised I found detailed answers to each thing I wanted so far (most were found on their wiki):
- How do I monitor every interface (L2 & L3) on a system? Answered on the FAQ.
- How do I add an Uptime graph to my PIX/ASA firewall? Add a
sysUpTime mibObj
undercisco-pix
indatacollection-config.xml
& adding a section tosnmp-graph.properties.xml
. That was it. It just started working. - How do I monitor a running MySQL process? Here’s a how-to. I kinda wish it would connect directly to the MySQL daemon, but rigging up a web-based stat page is probably a good idea.
- How do I add UDP/TCP connection rate monitoring to my PIX/ASA firewalls? Easy as copying from another similar item in
datacollection-config.xml
andsnmp-graph.properties.xml
. - How do I automatically populate the host asset database via SNMP? Here’s another how-to. This is going to take some work, but doesn’t look bad.
There’s still several things to do before it mirrors what I’ve already implemented in Nagios. One thing I do is change monitoring based on the OS of the device. Some things like connection rates on PIX/ASA aren’t available until 7.x code. I realized last night that it’s OK to have a OID in a template that’s not supported by the device, such as cufwConnRate1udp
on PIX 6.3(5). OpenNMS will figure out the OID doesn’t exist on re-scan and not try to poll that OID ever again. Another thing I want to do is change connection limit thresholds based on the type of PIX/ASA license, but I don’t know how I’d do this yet.
Troubleshooting Java is annoying. My first time installing it, I had pages and pages of exceptions and JDBC errors saying it couldn’t connect to the database. It turns out localhost
in /etc/hosts
was pointing to ::1
, so either JDBC didn’t support IPv6 or there was some missing argument. Fixed it back to 127.0.0.1, things started working. That one was truly annoying to figure out. Last night I was trying to add my ASA connection rate stuff and had 593 lines (wtf!?) of runtime exceptions. The “can’t find datasource” error was completely non-obvious that my datasource name was too many characters long. I wanted to dump the jRobin RRD files to check them, but the jRobin inspector is X-based and I didn’t immediately find a way to it from a shell.
The RANCID integration is pretty hot. I like having device configuration details automatically available from the OpenNMS web interface. It took me longer to configure a sample RANCID + viewc (similar to cvsweb) than it did actually turning on the support in OpenNMS. The notification patch for rancid is a nice touch, with it rancid will tell OpenNMS when a config refresh has happened. From the howto, I’m lead to believe that somehow it can manage the .cloginrc
password file, but I don’t see how/where this happens. It may not be implemented yet.
So far I really enjoy working with it. It’s taken literally seconds to add new things to check, things that would’ve taken hours of bailing wire and perl to add to my Nagios install. It’s such great relief that interface/service discovery just works. My RTG targetmaker
setup is atrocious, takes hours to run, and not the way to do things. I could easily see this replacing the NMS I hacked together. I haven’t dived into notifications yet, so not sure what’s involved there.
Offhand, here’s my to-do and to-figure-out list:
- Graphing of packet loss when doing interface pinging. It sounds like StrafePing monitoring is what I want, which is sort of like Smokeping. The docs warn of extra load, I haven’t checked into this.
- Asset field population from SNMP, including modules/PICs from things like Cisco 6500s, Juniper systems, Cisco CSSes
- ifStatus monitoring of up/down interfaces
- BGP peer state monitoring
- Configure e-mail notifications
- A more sensible graph report structure, i.e. let me click to drill down to more detail about interfaces, not show me octets/packets/errors/discards together (this becomes fail quickly on switches)
- Fetching data from SSH (e.g. log in, run a command, parse counters out of a result, store & monitor the data) or even XML. I need to be able to gather tech-support statistics from Cisco CSMs which is only available via CLI.
- SLB serverfarm/virtualserver stats gathering. Some support for CSS already, but need CSM, Netscaler, and ZXTM too.
It’s interesting that there’s not a lot of blog chatter like Nagios. The bulk of the good information is on the OpenNMS wiki, not much good random info out there. I’m guessing OpenNMS is overkill for a lot of people, and enterprises do their own in-house development to plug into the product. Since it’s all modular and Java, I can easily see it being deeply integrated with other systems and/or do some advanced things. Seeing as how writing my own NMS made me very fluent in perl, working with OpenNMS may finally motivate me to dive in and learn Java.