OpenNMS has a sharp learning curve. While very useful in specific cases, the OpenNMS wiki lacks continuity and in some cases authoritative answers. It seems like every article is where somebody worked out the details of some specific problem and just dumped all the details into a wiki entry. The in-between bits aren’t clear and take a good amount of detective work to figure out. I’ve seen several articles written in the form “well it looks like OpenNMS does X, I guess they intended for it to do foo”, as if there’s no maintainers to help clarify.
The really useful information I found was in the few whitepapers, case studies and OpenNMS Conference presentations. Even if some of the papers are in German, the included screen shots and configuration examples are still in English and still provide useful ideas.
These are the things on my wish list/todo list and what I’ve figured out so far:
- There’s two completely different processes that use SNMP to talk to a host. One just checks to see if SNMP is alive by fetching sysObjectId. The other process is the actual data collection that winds up written to disk.
- Measuring packet loss to a device: use StrafePing
- Interface status monitoring (ifAdmin/ifOper): use the “SNMP Service Poller”.
- BGP peer state monitoring: The approach in “BGP Session Monitor” is annoying, and there’s no better way to do it because of the system design. You have to manually configure a capsd + poller item for each and every BGP session you have. I deal with hundreds of peering sessions, I want to know when they go down, and this gets unwieldy. This requires a separate external script to build the configuration. Fortunately I already have a script to generate the BGP configuration for Nagios that’s easily adapted. Ideally I think there should be a capsd process to find all of these like physical interfaces.
- Cisco module monitoring: same thing as BGP monitoring, you have to have an external script to figure out what modules are installed and configure the appropriate capsd+poller files. It may just be more useful to focus attention on catching traps instead of trying to poll status+deal with traps.
- Having OpenNMS run an external command (i.e. SSH) to gather data: don’t. I figured out this really is possible, but the Right Way is to abstract the exotic data collection so that OpenNMS just fetches a web page with the data to be monitored and doesn’t care about the details of how those data were collected. This abstraction will help even if you change to another graphing package such as Cacti.
- Recording serial numbers from devices: “SNMP asset adapter”. You’ll need to add configuration to map data from say, Cisco entPhysical, to the various asset fields like “description” or “serialNumber”. Unfortunately there’s no good way to clearly document serial numbers/hardware information of installed modules in say a Cisco 6500, Cisco CSS or a Juniper router.
- Automatic L2 topology discovery with Linkd. This blew my mind at how well it worked out of the box (it’s disabled by default). By default it collects CDP, bridge information and route tables from hosts, then provides a hypertext list of what’s connected to what on the web interface. I’m sticking to only using CDP collection on my switches, as I have hundreds of VLANs and thousands of routes, and I’m afraid of what this will do to switch CPU.
Tips & ideas:
- Interface status: Sometimes we just don’t care about interface status, or we need temporary exclusions (i.e. new turnup, testing, laptop ports) and don’t want to be constantly annoyed by alerts. Edit snmp-interface-poller-configuration.xml, change the interface criteria to include “and snmpifalias not like ‘%IgnoreStatus%” (attentive users will notice this is really SQL). Now when OpenNMS sees “IgnoreStatus” in the description, it’ll ignore it.
- Interface status #2: Sometimes we only care about port-channels and uplinks on switches, and not care about ports facing servers. Using the same logic above, you can configure the poller to only monitor physical interfaces with like “Uplink” or “Port-channel” in the description, ignoring everything else.
Unanswered questions:
- Does SNMP4J or OpenNMS try to aggregate collection of OIDs? Is it possible to exclude OIDs? In my SNMP data collection config, I’m trying to fetch hrSystemUptime, hrSystemUsers and hrSystemProcesses. Either OpenNMS or SNMP4J (I think the latter) tries to be smart and causes it to fetch the entire hrSystem table. This is a problem because I have some systems where fetching the whole hrSystem (1.3.6.1.2.1.25.1) tree causes the agent to stop responding to requests, consequently generating repeated SNMP timeouts. This is very reproducible and masking the extra OIDs in Net-SNMP doesn’t seem to help. Upgrading Net-SNMP hasn’t helped either. While Net-SNMP is probably the root cause, there’s no easy way that I can see to have OpenNMS to stop trying to collect too much data. Surely there’s enough buggy agents out there that people have solved this.
- Why does OpenNMS generate duplicate interface graphs on new nodes? Upon addition of a new node, it creates directories for RRD/JRD files using the interface name such as “Gi3/13”. Later, after a service scan happens, it apparently figures out the MAC address and creates a new directory using the interface name + MAC address such as “Gi3/13-0e63cafe153”, orphaning the old interface data. Why can’t it do this first?
I’ve been neck deep fiddling with this stuff and it almost feels like I could write a book about it.
You should probably open issues against the version you are running with detailed logs if you have not already done so. I doubt that just blogging about it, if not, will get the issues addressed.