Feed on
Posts
Comments

Against my strong hatred for the cold, wet, dark-at-4 PM PNW winter and contrary to everything I’ve been saying, I unexpectedly took a new and better job that’s going to keep me here for the time being (and at least another winter).  I took the opportunity of a week off between jobs and headed down to California to run around the Bay and Death Valley.

Hilary was in town for the weekend, so I left after lunch on Sunday to head to California. The drive went quickly as usual, and I got to Berkeley around 3-4 AM. One interesting tidbit I discovered along the way is that just outside Salem is the 45th North parallel, meaning the latitude is halfway between the north pole and equator.

Around a week or so prior to leaving I won a prize from the UC Botanical Garden in Berkeley. For being nearly the 2000th person to “Like” their page, I won a pair of passes and a lovely poster. I left before receiving the passes but went to the garden anyways on Monday afternoon to check it out.

The Botanical Garden was really nice! They had several representative collections of plant life from around the world and worked to make them grow in the bay climate. I was really impressed by the detail and sheer variety of plants in each collection. I’ve been to a few other gardens in Oklahoma, Texas and Iowa, but this one was by far the largest (34 acres). I only had a couple hours before closing so I took in what I could, but didn’t make it to the California collection.

The one thing that I wish I could take away (or adequately describe) from the place was the smell.  It was in the upper 50s, sunny, with a light breeze. The air was fresh, crisp, smelling of foliage and hints of the few plants that had started to flower. It’s completely seductive and would make for a wonderful Saturday afternoon of wandering. Right at closing the frogs in the Japanese Pool started croaking, adding another aspect of niceness to the environment. I met up with an employee who was enthusiastic to meet one of her Facebook fans and I claimed my poster+passes. I will definitely return to stay longer and recommend this to others!

The weather in the Bay was absolutely wonderful. This was a great week to have went. There’s a remarkable difference between 54 F in Seattle and 54 F there, it’s way more pleasant and sunny. All the recent rains there had made everything super green and lush. This was especially obvious when driving down I-5 to go to Death Valley. I’ve only been down the 5 in August/September and by then things are usually pretty dry and brown.  This time it was rolling fields of green and lots of white trees beginning to blossom on the massive farms. All of this made me crave to stay outside as much as possible to take it in.

After leaving the garden, I wandered around Berkeley and UC taking photos as the night fell. Despite students running around, it was pretty tranquil and quiet. I then wandered over to Oakland. I had read a lot about Lake Merritt and Jack London Square, and I wanted to see what they were like at night. For as much as I hear they’re popular areas, I was somewhat surprised to see both were pretty dead at night.  JLS looked sketchy and dark, at the edge of the industrial zone. Lake Merritt was nice, with a string of lights surrounding its entire perimeter, and a little bit of pedestrian traffic.

Tuesday morning I wandered back over to Oakland, then drove over to SF. Some major piece of the new Bay Bridge was being lifted up in place this week. On Treasure Island I ran into a guy that was making a painting of the construction work. It looked really cool (maybe a bit Impressionist-y?), I wish I had it hanging on my wall or at least a photo of it.

I drove around aimlessly in SF for a while and discovered Ocean Beach on the west side of the peninsula. It was awesome to see a sunny, clean beach with lots of surf. The sand was warm, the wind was cool and the water was frigid! Following the coast around, I dropped by to take touristy photos at Lincoln Park and the Presidio. Later that night I was trying to decide where to eat that might allow me to sit outside and people watch, Yelp led me to a brewpub (21st Amendment) that was busy with lots of 30-something tech nerds.

One thing I’ve noticed this trip was the lack of traffic in San Francisco. Maybe the memories I have of it were from a particularly busy time or I was more off the beaten path this time, but it seems like it’s way easier to get around now.

Thought

While driving around today at lunch enjoying the sun, my mind was surprisingly at ease and I had a profound realization: Everything I have is because of thought. At the purest level, I sit around, people give me problems to solve, I think through them and present a solution.  That’s it.  I don’t do any physical labor, I don’t build things, I don’t even really tell people what to do nor push papers around. Simply using my brain provides me with a roof over my head and food on the table.

By extension, this is how I have friends. Friendship is some sort of external manifestation of feeling, which comes from thought. I really can’t describe this concept, but that’s what it feels like.

It makes me wonder if thought alone is this powerful, what else am I capable of?

Fuel pump replacement

2003 Chevrolet Silverado in-tank fuel pumps

Ever since my fuel pump made a loud whining noise all throughout Arizona for Burning Man 2009, I’ve always said it’s going to die and strand me someday and I should really get it replaced before then.

This Friday was the night it decided to die.  I was driving across the I-90 bridge going into Seattle when the engine started missing and I lost a lot of power. So much that I was down to 35 m.p.h. going up the last hill.  I turned off on the first exit, Rainier Ave, and pulled into a side street parking spot. As soon as I stopped, that’s all she wrote and the engine died.  At first I wasn’t sure what it was, but concluded it was the fuel pump after the engine would clearly crank for a second or two before dying.

I called AAA and they sent a tow truck to me within an hour.  I had it towed back to Alex’s shop since it would’ve been unpossible to take it to the apartment parking garage. I decided it was the fuel pump, or at the very least it needed to be replaced anyways. The next day I studied the Haynes manual for all the parts and tools I’d need, then walked over to NAPA to buy everything.

The factory specs say there should be 55-62 p.s.i. at the fuel injector rail. I repeatedly measured 19-20 p.s.i. so it was clear that wasn’t right.  I had always dreaded replacing the fuel pump since it seemed like a giant hassle and a mechanic would easily want $1k for the job. Several people on the internets advocated just removing the bed or taking a Sawz-all to make an access hole in the bed, neither of which were really options.

It wasn’t too terrible of a job. I did it completely by myself in about 12 hours of work over two days. By far the longest part was just pumping all possible gas out of the tank. I got it down to ‘E’ with the fuel warning on. Unhook the EVAC canister, disconnect the two fuel+return lines, put the tank on stands, and undo two straps.  It turns out the plastic gas tank wasn’t heavy at all, maybe 20 pounds even with the little gas left over.  Putting it back in was a hassle, this is where a 2nd set of hands to align the tank while jacking it up would’ve been handy.

After that, I was getting 55 p.s.i. at the rail again and the truck has been running fine!

Food!

While in Redmond, I’ve discovered two delicious sources of food.  One is the medium especial salsa and restaurant-style tortilla chips from Trader Joe’s. It’s a great mix of heat and salty, mmm. The other is the coconut curry from Tandoori Fire just down the street. I’ve been trying new Indian places lately and have re-discovered their curry.  It’s so rich and creamy, and is amaaaazing with garlic basil naan.

I learned a hard lesson in PostgreSQL administration, in particular what happens when tables aren’t vacuumed. I come from a MySQL background and haven’t fully learned what it takes to keep Postgres humming along. I had recently moved the OpenNMS database to another host in an attempt to throw more spindles at a UI slowness issue. Most everything else worked well except the events page would time out while trying to render.  Moving Postgres to another box worked well until it started getting slow again over the holidays.

Looking at my resource graphs for the database server over the previous week, the amount of CPU time spent in I/O wait grew considerably until it eventually stayed pegged out doing nothing but read operations from the disks.  This didn’t seem “right” since I hadn’t added any new switches and the configuration was the same.  Since this was some random box with 2 GB of RAM and I had really been generous with the memory allocations to PostgreSQL, I thought maybe I was exhausting the Linux buffer cache as it was hovering around 6700 KB. Tuning down Postgres’ memory footprint, it didn’t affect buffer cache and I was still doing a ton of read ops off disk.

I fixed the pgsql stats collector and went hunting for what OpenNMS was doing to generate so many reads. It turns out that an UPDATE to the snmpinterface table was consistently taking several seconds to execute.  This is where I learned that PostgreSQL tables need to be vacuumed frequently, especially ones where there are many UPDATEs on that table.  Even after stopping OpenNMS and letting the box do its thing, it took several hours for a ‘VACUUM VERBOSE snmpinterface’ to finish. Even though I had auto_vacuum turned on, it wasn’t actually working because “stats_row_level” wasn’t set to on. Therefore, the tables were never vacuumed!

While that was running, I was reading the event maintenance documentation. I checked out the events table and discovered it too had become bloated.  I update the provisioning group hourly and I had hundreds of thousands of ‘nodeScanCompleted’ and reinitializePrimarySnmpInterface’ UEIs in my events table. Since I really don’t care about these events, I deleted them and added a statement to vacuumd-configuration.xml to purge them after 2 days.  This made a gigantic improvement in displaying events in the web UI, namely, it actually worked again.

Between cleaning up both tables and fixing vacuuming, the excessive read operations completely vanished. From what I can tell, because of the way PostgreSQL’s transaction stuff works, any time it did an update it was having to read all these old transactions off disk first and over time it was a vicious feedback loop.

Lessons learned:

  • sar -d 1 300 is your friend.  If “rd_sec/s” is stays around zero, that means your OS filesystem cache is adequate and it’s not having to go to disk frequently for hot blocks. Finally realizing how VM management and buffer cache works in practice in Linux was a great experience. This is something one of my old bosses tried to teach me a long time ago and it never really sunk in until now.
  • PostgreSQL’s stat collection tries to connect to a random high-numbered port on 127.0.0.1.  If you have deny-all, permit specific rules like I do, make sure you accommodate this with “-A INPUT -i lo -j ACCEPT” or something. Otherwise, the pg_stat tables are empty and there’s no indication why.
  • The I/O characteristics between the database and RRD storage are complete opposites.  The RRD filesystem deals with tons of tiny write operations. The database server does a modest amount of reading large chunks of data off disk (table scans?) with relatively few writes.
  • Keep the events table trimmed, apparently there’s a lot of event correlation going on in the background and extra rows can really slow it down.
  • Check pg_log/postgres-X.log to make sure auto vacuuming is actually working.

Now to see what the next learning experience is!  If I can figure out what’s causing all the “storeResult: org.opennms.netmgt.snmp.SnmpResult” to be written to my output.log, I’ll be happy. (changing collectd logging to FATAL hasn’t seemed to help)

California New Years

Bay Bridge at night from Treasure Island

For New Years, I had Thursday and Friday off work. I haven’t been on a good road trip since moving, and I was itching to go explore and see some sun. I had the loose idea of driving to SF, then following the 101 back up the coast since I’d only covered half way long ago. Thursday morning at 10:30 I headed south.

It was a cold, but really sunny drive through Oregon. There were several valleys topped with fog that I drove down into. Inside the valleys, everything was covered in frost and the trees had snow on their branches. The drive went amazingly fast, I was excited about going to the Bay. I arrived at Berkeley around 10:40, almost exactly 12 hours later. I kept on driving across the Bay Bridge and went to the Golden Gate Bridge (after momentarily getting lost downtown trying to follow the 101).

The sky over the bay was nearly perfectly clear and the stars were amazing. I don’t know if I’ve ever been to the GG park at night, much less on a clear night. I stayed out until 2 or 3 AM taking long exposure star photos. I was having a good time but I was absolutely freezing. Wearing two of my mountaineering coats and hat, I resorted to using my iPhone as a hand warmer by firing up the GPS app and letting it burn CPU cycles.

I drove back to east bay to find a hotel for the night. I landed at a hotel in Berkeley around 4:30 AM. The clerk was nice enough to consider my late arrival as an early arrival for Friday, essentially giving me a free day. yay! I was having an absolute blast driving around and I ditched my plan of leaving to drive up the coast.

East of Telegraph Ave in Berkeley

I spent little time in my hotel, instead spending a lot of time walking around Berkeley and driving around the Bay. The weather was similar to Seattle but much nicer and slightly warmer. The times it was cloudy, it was actually dry. I forgot what dry ground and pavement was like!  There were still periods of rain, but it was generally over quick. Most everyone I talked to said it was odd that it was the weather was so cold. Fine by me!

I didn’t know the Bay gets all crazy with fireworks for New Years. Had I actually thought to research it, I would’ve planned different.  Instead I was driving back from Oakland, tired with the cold and drizzle. I got back to my room just in time to hear fireworks going off at midnight everywhere but not able to see them anywhere. After a few luls I finally decided to drive up to Tilden to see if I could see anything. Unfortunately it was still rainy and the overlooks by the park were really too far away to see the small stuff going off (and I got lost).  Apparently the right thing was to go to Berkeley Marina to see the fireworks. Next time, maybe.

There was no plan at all. I visited some friends, ate lots of food, drove all around the Bay, took lots of photos, and sat on the internet at Peet’s. I was pretty excited and glad to be back in the Bay. It seemed like there was an energy I was picking up on, and it was really nice. I definitely want to go back and spend more time there. I want to visit the UC Botanical Garden (the foothills were super pretty) and should spend more time in SF proper.

I was disappointed to leave, but it was time to go back to work and I was tired of living in a hotel.  The drive back went fast, I only had to stop three times for 10 minute cat naps.

PNW darkness

I gotta move.  The winter solstice was yesterday and the sun just set behind the trees at 3:47 PM. It doesn’t help that Redmond is in a valley so it sets early. I made the 405 – 90 – Seattle – 520 – Redmond loop this afternoon to soak in the sun while it was out. The darkness and constant work from home is driving me crazy.  I’ve been out every single night for the past five days, just to go elsewhere. I leave for Oklahoma in the morning and I’m already pondering going to the U district this evening to wander.

I haven’t been on any good road trips since moving, and I just discovered how cheap it is to fly down the coast. I need to make some trips to the SF bay and LA to go exploring.

More OpenNMS

OpenNMS has a sharp learning curve.  While very useful in specific cases, the OpenNMS wiki lacks continuity and in some cases authoritative answers. It seems like every article is where somebody worked out the details of some specific problem and just dumped all the details into a wiki entry. The in-between bits aren’t clear and take a good amount of detective work to figure out. I’ve seen several articles written in the form “well it looks like OpenNMS does X, I guess they intended for it to do foo”, as if there’s no maintainers to help clarify.

The really useful information I found was in the few whitepapers, case studies and OpenNMS Conference presentations.  Even if some of the papers are in German, the included screen shots and configuration examples are still in English and still provide useful ideas.

These are the things on my wish list/todo list and what I’ve figured out so far:

  • There’s two completely different processes that use SNMP to talk to a host. One just checks to see if SNMP is alive by fetching sysObjectId. The other process is the actual data collection that winds up written to disk.
  • Measuring packet loss to a device: use StrafePing
  • Interface status monitoring (ifAdmin/ifOper): use the “SNMP Service Poller”.
  • BGP peer state monitoring: The approach in “BGP Session Monitor” is annoying, and there’s no better way to do it because of the system design. You have to manually configure a capsd + poller item for each and every BGP session you have. I deal with hundreds of peering sessions, I want to know when they go down, and this gets unwieldy. This requires a separate external script to build the configuration. Fortunately I already have a script to generate the BGP configuration for Nagios that’s easily adapted. Ideally I think there should be a capsd process to find all of these like physical interfaces.
  • Cisco module monitoring: same thing as BGP monitoring, you have to have an external script to figure out what modules are installed and configure the appropriate capsd+poller files. It may just be more useful to focus attention on catching traps instead of trying to poll status+deal with traps.
  • Having OpenNMS run an external command (i.e. SSH) to gather data: don’t. I figured out this really is possible, but the Right Way is to abstract the exotic data collection so that OpenNMS just fetches a web page with the data to be monitored and doesn’t care about the details of how those data were collected. This abstraction will help even if you change to another graphing package such as Cacti.
  • Recording serial numbers from devices: “SNMP asset adapter”. You’ll need to add configuration to map data from say, Cisco entPhysical, to the various asset fields like “description” or “serialNumber”.  Unfortunately there’s no good way to clearly document serial numbers/hardware information of installed modules in say a Cisco 6500, Cisco CSS or a Juniper router.
  • Automatic L2 topology discovery with Linkd. This blew my mind at how well it worked out of the box (it’s disabled by default). By default it collects CDP, bridge information and route tables from hosts, then provides a hypertext list of what’s connected to what on the web interface. I’m sticking to only using CDP collection on my switches, as I have hundreds of VLANs and thousands of routes, and I’m afraid of what this will do to switch CPU.

Tips & ideas:

  • Interface status: Sometimes we just don’t care about interface status, or we need temporary exclusions (i.e. new turnup, testing, laptop ports) and don’t want to be constantly annoyed by alerts. Edit snmp-interface-poller-configuration.xml, change the interface criteria to include “and snmpifalias not like ‘%IgnoreStatus%” (attentive users will notice this is really SQL). Now when OpenNMS sees “IgnoreStatus” in the description, it’ll ignore it.
  • Interface status #2: Sometimes we only care about port-channels and uplinks on switches, and not care about ports facing servers. Using the same logic above, you can configure the poller to only monitor physical interfaces with like “Uplink” or “Port-channel” in the description, ignoring everything else.

Unanswered questions:

  • Does SNMP4J or OpenNMS try to aggregate collection of OIDs? Is it possible to exclude OIDs? In my SNMP data collection config, I’m trying to fetch hrSystemUptime, hrSystemUsers and hrSystemProcesses. Either OpenNMS or SNMP4J (I think the latter) tries to be smart and causes it to fetch the entire hrSystem table.  This is a problem because I have some systems where fetching the whole hrSystem (1.3.6.1.2.1.25.1) tree causes the agent to stop responding to requests, consequently generating repeated SNMP timeouts. This is very reproducible and masking the extra OIDs in Net-SNMP doesn’t seem to help. Upgrading Net-SNMP hasn’t helped either. While Net-SNMP is probably the root cause, there’s no easy way that I can see to have OpenNMS to stop trying to collect too much data. Surely there’s enough buggy agents out there that people have solved this.
  • Why does OpenNMS generate duplicate interface graphs on new nodes?  Upon addition of a new node, it creates directories for RRD/JRD files using the interface name such as “Gi3/13”. Later, after a service scan happens, it apparently figures out the MAC address and creates a new directory using the interface name + MAC address such as “Gi3/13-0e63cafe153”, orphaning the old interface data. Why can’t it do this first?

I’ve been neck deep fiddling with this stuff and it almost feels like I could write a book about it.

Redmond weather

The changing season and weather here has been a trip.  For the first month or two here, it was partially cloudy, but still quite a bit of sun.  Even though it’s sunny, being outside has always felt weird. It took me a while to realize that because the sun is so low on the horizon, long shadows were being cast off of everything, even at noon. This made it feel like it was much later than it really was.

After daylight savings ended, the long darkness started setting in. The sun goes down behind the trees around 4:30 PM. The week and a half have been very cloudy, thick clouds at a low altitude.  I realize now why people say the sun comes out at the end of the day: it’s finally low enough on the horizon to shine under the cloud layer.  This usually lasts 15-30 minutes before it sets.  There’s still a month of shorter days ahead.  I went out to Target and bought several more lamps to keep the apartment bright.

I had the obvious, expected reaction to the clouds. I started feeling gloomy when I couldn’t even tell where the sun was for a couple of days. What I wasn’t expecting was how giddy I got when I could see the sun again.   The same goes for seeing the moon + stars too. It’s fairly dark in Redmond, there’s no excessive streetlamps, and the clouds blanket out the sky completely.  The other night I walked to Trader Joe’s and noticed the sky was clear. I could see the moon and quite a few stars.  The sky felt like a dark version of Montana big sky, just so expansive and big!

Today it finally snowed!  I was kind of disappointed it started off sunny, then got cloudy.  I wasn’t expecting it to snow, but it did. This made me terribly excited.  I drew all the blinds and sat in my office with my feet propped up on the windowsill watching it snow.  After a while the flakes got bigger, so I used TJ’s as an excuse to go walk around in it.  That was about 30 minutes ago, it’s since stopped snowing. :(

I need to plan my sun retreats.  I learned today that the passes may be closed to non-4x4s and those without chains, so this limits the domain to north/south.

I got inspired this week to fiddle with OpenNMS. Reading the website makes me think it does * and has great potential. Looking at the modular design and the event system, it just “feels right”. It is Java-based with all XML-based configs, which really doesn’t bother me. I like the idea that it has built in syslog and SNMP trap collectors that automatically associates all active/passive events with the host/node.

Getting started was deceptively simple.  There’s a step-by step guide to installing PostgreSQL and the OpenNMS RPMs. I think I had it running within 30 minutes.  I gave it my home subnet to scan, and it did a great job of providing sensible resource monitoring (with clear, attractive graphs) of discovered things out of the box including OIDs I didn’t know existed. I don’t like network-wide scanning, but I really love the node-based service discovery mechanism. It tries to find every monitorable OID in a list based on the sysObjectId. It also figured out if a system had multiple L3 interfaces, it would do ping/ssh response time monitoring to all public IP addresses. Every single thing it monitored, it graphed automatically, which is a huge annoying shortcoming with Nagios and other monitoring systems.  I’ve recently been evaluating Science Logic’s EM7 and Nimsoft’s solutions, and neither lured me in to get real things done like OpenNMS did.

I’m really disappointed it doesn’t have any IPv6 support. I figured since it’s supposed to be a grown up, modern product that this would already be there.  I’m not even sure if I can hack in anything to do even simple IPv6 ping monitoring even if I wanted to. There’s not that many people talking about support, but looking at the changelogs + single wiki page, support is gradually being added. Now that I’m more familiar with the product, I see that v4 addresses are all over the place and full support will take real work to pull off.

Immediately after it was installed, it felt like they gave me enough rope to hang myself. It was definitely working, but the web interface seemed like it wasn’t the whole story (it’s not). Coming from a Nagios environment, the normal sort of things I’m used to looking at weren’t there or not immediately obvious.  How do I know when a down device will be polled next?  How can I force a re-check?  How can I ignore a check? How can I see all checks on one screen? I basically had to spend five hours one night forcing myself to read over all the configuration how-tos even though a lot of the concepts were foreign to me, then I was able to start constructing a mental model.

Eventually I got comfortable enough with the design and configuration, and started looking for individual things to expand so that it mirrored my Nagios functionality.  I started taking new things one a time, Googling for each one. I was pleasantly surprised I found detailed answers to each thing I wanted so far (most were found on their wiki):

  • How do I monitor every interface (L2 & L3) on a system? Answered on the FAQ.
  • How do I add an Uptime graph to my PIX/ASA firewall?  Add a sysUpTime mibObj under cisco-pix in datacollection-config.xml & adding a section to snmp-graph.properties.xml.  That was it.  It just started working.
  • How do I monitor a running MySQL process?  Here’s a how-to. I kinda wish it would connect directly to the MySQL daemon, but rigging up a web-based stat page is probably a good idea.
  • How do I add UDP/TCP connection rate monitoring to my PIX/ASA firewalls?  Easy as copying from another similar item in datacollection-config.xml and snmp-graph.properties.xml.
  • How do I automatically populate the host asset database via SNMP? Here’s another how-to. This is going to take some work, but doesn’t look bad.

There’s still several things to do before it mirrors what I’ve already implemented in Nagios. One thing I do is change monitoring based on the OS of the device.  Some things like connection rates on PIX/ASA aren’t available until 7.x code. I realized last night that it’s OK to have a OID in a template that’s not supported by the device, such as cufwConnRate1udp on PIX 6.3(5). OpenNMS will figure out the OID doesn’t exist on re-scan and not try to poll that OID ever again.  Another thing I want to do is change connection limit thresholds based on the type of PIX/ASA license, but I don’t know how I’d do this yet.

Troubleshooting Java is annoying.  My first time installing it, I had pages and pages of exceptions and JDBC errors saying it couldn’t connect to the database. It turns out localhost in /etc/hosts was pointing to ::1, so either JDBC didn’t support IPv6 or there was some missing argument. Fixed it back to 127.0.0.1, things started working. That one was truly annoying to figure out.  Last night I was trying to add my ASA connection rate stuff and had 593 lines (wtf!?) of runtime exceptions. The “can’t find datasource” error was completely non-obvious that my datasource name was too many characters long.  I wanted to dump the jRobin RRD files to check them, but the jRobin inspector is X-based and I didn’t immediately find a way to it from a shell.

The RANCID integration is pretty hot. I like having device configuration details automatically available from the OpenNMS web interface.  It took me longer to configure a sample RANCID + viewc (similar to cvsweb) than it did actually turning on the support in OpenNMS. The notification patch for rancid is a nice touch, with it rancid will tell OpenNMS when a config refresh has happened. From the howto, I’m lead to believe that somehow it can manage the .cloginrc password file, but I don’t see how/where this happens. It may not be implemented yet.

So far I really enjoy working with it.  It’s taken literally seconds to add new things to check, things that would’ve taken hours of bailing wire and perl to add to my Nagios install. It’s such great relief that interface/service discovery just works. My RTG targetmaker setup is atrocious, takes hours to run, and not the way to do things. I could easily see this replacing the NMS I hacked together. I haven’t dived into notifications yet, so not sure what’s involved there.

Offhand, here’s my to-do and to-figure-out list:

  • Graphing of packet loss when doing interface pinging. It sounds like StrafePing monitoring is what I want, which is sort of like Smokeping. The docs warn of extra load, I haven’t checked into this.
  • Asset field population from SNMP, including modules/PICs from things like Cisco 6500s, Juniper systems, Cisco CSSes
  • ifStatus monitoring of up/down interfaces
  • BGP peer state monitoring
  • Configure e-mail notifications
  • A more sensible graph report structure, i.e. let me click to drill down to more detail about interfaces, not show me octets/packets/errors/discards together (this becomes fail quickly on switches)
  • Fetching data from SSH (e.g. log in, run a command, parse counters out of a result, store & monitor the data) or even XML. I need to be able to gather tech-support statistics from Cisco CSMs which is only available via CLI.
  • SLB serverfarm/virtualserver stats gathering. Some support for CSS already, but need CSM, Netscaler, and ZXTM too.

It’s interesting that there’s not a lot of blog chatter like Nagios. The bulk of the good information is on the OpenNMS wiki, not much good random info out there. I’m guessing OpenNMS is overkill for a lot of people, and enterprises do their own in-house development to plug into the product. Since it’s all modular and Java, I can easily see it being deeply integrated with other systems and/or do some advanced things. Seeing as how writing my own NMS made me very fluent in perl, working with OpenNMS may finally motivate me to dive in and learn Java.

« Newer Posts - Older Posts »