Feed on
Posts
Comments

OpenNMS Experience Redux

I’m doing my second greenfield OpenNMS deployment, and so far it’s going a lot smoother than last time. This time it’s server-centric which is a whole new can of worms to figure out. I don’t yet know how feasible it will be to monitor services like Exchange or SQL Server.

My last setup never really got off the runway because I didn’t have enough random write I/O bandwidth to do massive graphing on all of my switch ports and I left before I could beg for new hardware. Since then I’ve learned there are just flat out physical limits to disks, you can only do so many things per second. Even SSDs have absolute limits on how many operations/second they can do. The way to win is to persist everything to a giant ramdisk, then copy all that to your real disks every X minutes you feel safe living without. All of a sudden those thousands random 8k writes to JRB files every second turn into one giant sequential write to disk. It copies surprisingly very quickly. It’s not that far fetched, you can easily put 64 GB of RAM in a server for less money than buying a new 16 drive system and it’ll give you hella performance over safety.

Anyways, this time I’m going from 1.8.3 to 1.9.92, and it’s considerably better! Not only were two of my long standing gripes (adding relativetime to graphs and TableTracker writing to stderr) fixed, and a slew of bugs fixed, but it just feels like a much nicer piece of software now. I kept track of the release notes over the summer and it’s obvious there were a lot of good updates going in and progress being made. I’m sure there are many things that are fixed that weren’t completely obvious to me as being broken. Props to all those guys. One of the big things I wanted then was IPv6 support, but now I don’t have a v6 network to monitor. :(

A few things I’ve learned either through experience (or things that were obvious to other people finally made sense to me):

– You’ll be happier and scale larger if you use ramdisks

– The services (e.g. ICMP, HTTP, SNMP) listed in the node ‘Availability’ table is a completely independent operation of the process for data collection (i.e. graphing). You can graph things not listed there, and you can monitor availability without graphing it.  I’ve known this for a while but it seems like it’s lost on a lot of people still.

– Maybe I’m stupid but I was always lead to believe there were two types of categories being used, one in categories.xml (the home page) and another in surveillance views. I think this was because of poor examples. The default categories configuration refers to things as rules instead of filters, and there’s nothing in there that leads you to believe you can do “categoryName == Servers” and/or even “nodelabel like ‘%city%'”.

– If you want to ping/monitor external sites: create a new provisioning group, strip down all the collectors down to ICMP and HTTP, add a policy to categorize everything as ‘external-monitors’. Go setup a poller package to enable StrafePing for “<filter><![CDATA[(IPADDR != ‘0.0.0.0’) & (categoryName == ‘external-site-monitor’) ]]></filter>”. You’re a good citizen for not repeatedly over-discovery-scanning some poor website and your NOC peeps will understand from the category why amazon.com is in the node list.

– Poller packages are additive, it doesn’t have to be all-or-nothing. If an IP address or category matches more than one poller package, both get applied. This was another fascinating realization (by accident of course) because one of my big peeves was figuring out a easy way to get really custom polling without cut-n-pasting huge swaths of configuration.

– Automatically adding categories to new nodes via a foreign source policy is the way to go. You can rig it up so that if you have different provisioning groups per city/site, everything can be tagged with a site category by default without having to select from a drop-down. Then, based on the hostname/node label, add more categories to indicate its role, etc. The OpenNMS Provisioning PDF is an outstanding reference for doing policies, it is much clearer and has better examples than the provisiond wiki page.

– Tip: when creating a foreign policy and you want to use a regex, prepend the expression with a ~ (tilde), e.g. “~.*” or “~^hostname\d+.*”. I don’t think this was explained in any documentation and I happened to stumble upon it on a random mailing list. Otherwise your stuff will not match and you won’t know why!

– Integrating the web UI to auth/authorize off of Active Directory via LDAP looks hairy and makes me want to drink more before trying it. It looks like there were a few examples thrown into the wiki but who knows if they work, they certainly don’t look simple and easy. A simple “hello world” example to just to prove the thing works in a basic capacity would be nice.

– When all else fails, think how Java would think.

All this being said, realize it’s definitely not an application you casually install with “yum install opennms” and walk away. It takes quite of bit of knowledge to get it working and thought on how you want to organize things to benefit both operational and business-level needs. Even then, it seems to take several iterations of “I want to do this” to “the OpenNMS way is impossible or has to be done this way” and then “okay, redesigned the need with that in mind, try it this way”. Otherwise you’ll be doing hacky complicated things that get out of hand and are impossible to manage.

My new big gripe is that OpenNMS as a concept has serious shortcomings when it comes to monitoring disk and storage resources. Maybe the big kids are monitoring 50,000 cable modems (which it works really well for) plus a token web farm and there just isn’t much need for an extensive storage interface. I mean I can certainly graph space and disk I/O, but I would really love to look at a node’s page and be at a glance able to see a short numerical list of space used/free on all my partitions and individual RAID statuses. Some of us have 50,000 hard disks and exibibabblebytes to monitor instead! Same with BGP peering, I’d love to have been able to look at the router’s node page and see at a glance how many of my BGP sessions were up/down (peering routers easily had hundreds). Server load balancing too, now that I think about it more. You can improvise custom Availability checks for a few one-offs, but they’re usually static to one device’s configuration. Even if I was motivated to take a crack at writing my own, it doesn’t seem like a place with any hooks to do something custom.

I don’t think that OpenNMS itself is the problem, but just that all the tools for getting data about storage and RAIDs suck balls. Just this week I learned that there’s a patch in net-snmp 5.3 on CentOS 5.7 that lets it support filesystems > 4 TB, but that patch is not in net-snmp 5.4 on OpenIndiana. SNMP access to PERC status? ha ha. ZFS metrics via SNMP? ha ha ha. NFS service poller? doesn’t exist. iSCSI or FCoE ping? what what

The only way to get around this is to rig up some middleware solution to do custom aggregation of this information and then present it in a standardized way to OpenNMS. Even then, there still needs to be an interface (like IP addresses/interfaces) to show it.

The other long standing gripe is the need for an easy way to aggregate a bunch of performance data from multiple nodes into one large graph. Suppose I have a datacenter full of power strips and I want to have an graph that shows total power consumption, or multiple transit links to the same provider, total disk throughput across many disks. KSC reports can’t do this (that I know of) and the alternative is rigging up a virtual node and symlinking things all over the place then creating a custom graph report. blah.

The last time I looked at thresholding, alarms and notifications it was so complicated and dreary it made me want to drink. It’s most certainly not as easy as throwing a 15 line file in /etc/nagios/objects. In fact, I’m keeping Nagios around for exactly that reason. Here’s to hoping it makes more sense this time!

# zpool list
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
rpool    136G  32.0G   104G    23%  1.00x  ONLINE  -
tank    47.1T  45.9T  1.26T    97%  1.00x  ONLINE  -
tank10  48.9T  48.1T   870G    98%  1.00x  ONLINE  -
tank11  48.9T  48.2T   783G    98%  1.00x  ONLINE  -
tank12  48.9T  48.0T   922G    98%  1.00x  ONLINE  -
tank13  48.9T  47.8T  1.12T    97%  1.00x  ONLINE  -
tank14  48.9T  47.8T  1.15T    97%  1.00x  ONLINE  -
tank15  48.9T  47.8T  1.15T    97%  1.00x  ONLINE  -
tank16  48.9T  48.2T   783G    98%  1.00x  ONLINE  -
tank17  48.9T  47.9T  1.01T    97%  1.00x  ONLINE  -
tank18  48.9T  48.2T   783G    98%  1.00x  ONLINE  -
tank19  48.9T  48.2T   783G    98%  1.00x  ONLINE  -
tank2   48.9T  47.7T  1.22T    97%  1.00x  ONLINE  -
tank20  48.9T  47.8T  1.10T    97%  1.00x  ONLINE  -
tank21  48.9T  46.8T  2.13T    95%  1.00x  ONLINE  -
tank22  48.9T   320K  48.9T     0%  1.00x  ONLINE  -
tank3   48.9T  47.6T  1.38T    97%  1.00x  ONLINE  -
tank4   48.9T  47.9T  1.07T    97%  1.00x  ONLINE  -
tank5   38.1T  37.4T   697G    98%  1.00x  ONLINE  -
tank6   48.9T  48.2T   783G    98%  1.00x  ONLINE  -
tank7   48.9T  47.3T  1.61T    96%  1.00x  ONLINE  -
tank8   48.9T  45.6T  3.32T    93%  1.00x  ONLINE  -
tank9   55.9T  54.6T  1.38T    97%  1.00x  ONLINE  -

LIFE magazine

Sunday I was wandering around Pike Place Market and wound up in the paper store. I was flipping through their racks of LIFE magazine and was fascinated by the headlines/covers of the issues from the 60s. There were several issues about space, Vietnam, McNamera, Nixon, and Johnson. It’s one thing to read so much about those, but even more awesome to actually hold something in hand. It was also odd to realize this was one of the primary ways that news and entertainment was spread then, now it’s all on the web.

Time Capsule

I just dug my old Dell desktop out of the closet to see if there’s anything on it I want to keep before tossing it.

Looks like it was last fired up in September 2006 and I was in ham mode, messing with MixW. It’s running Windows 2000, Firefox 1.5.0.7, QuickTime 7.0.4, Yahoo! Messenger 7.0.0.120, Flight Simulator 2002. Pentium III 1 GHz, 512 MB RAM, 40 GB hard drive, fun times!.

I remember when Windows 2000 was cool, it’s soft Verdana font soothed my wounds from Windows 98. Now it looks all sorts of jagged and unrefined.

Felix

I have a new kitty in the family, 2 months old. His shelter name is Shaun, but I’m not sure if I’m keeping that. (Update 9/17: I’m calling him Felix) He’s very vocal, spent all night meowing like murder was happening. He’s a little better today, but still gets very lonely. I laid in the office floor with him until 2-3 AM to quiet him down. I forgot how much energy kittens have, when he plays he just goes all out plowing into everything. I was also surprised at how much to feed a kitten; 1 can per pound body weight per day, so that works out to basically 1 can, 3 times a day.

Patch is somewhat concerned by the noises coming from the office, but otherwise seems indifferent. Today I gave them a little introduction, holding Shaun in the same room while Patch looked on.

In the process of trying to get up to keep Shaun from running out of the office, I stumped my little toe really hard on the bookcase. It smarted for a while and started swelling up. After dinner it was purple and blue on both sides. Today the swelling and discoloration spread farther, enough to make me go to urgent care. They took some x-rays and decided it wasn’t broken, just sprained. The doctor showed me how to split it with tape to my other toes and said that’s all that should be done with it.

Last month I finally decided to re-join the modern age and get cable. I’ve been waiting for FiOS from Frontier ever since I moved in, and my apartment building was finally wired up for it. I was flabbergasted when they told me the install charge for TV+internet over FiOS was $500, but if I just wanted internet, there was no install fee. It turns out, according to the internet, Frontier is losing money left and right on channels and desperately wants to get out of the cable TV business.

I flipped over to comcast.com. I didn’t have a high opinion of them because of the data caps, but just wanted cable. I was able to put in my order within a few minutes without having to call. When I got my DVR/cable box and hooked it up, it didn’t work. I spent a while troubleshooting and got nowhere. I dreaded having to call and wait around for a tech to be dispatched or something (and I was traveling a lot), so I put it off. I put it off for a couple of weeks.

When I finally called and navigated through the IVR tree, it was a surprisingly quick “oh let me send a reset signal” and that’s all it took to make it come to life. Less than three minutes on the phone. After a couple of hours I still wasn’t getting all of my channels. I needed to call again, but was still dreading hassle and put it off.

When I called again at 1 AM this morning, I got a female apparently in Seattle, who was very friendly and sincere. She reset my cable box and walked me through verifying channel subscriptions. Then she credited me a whole month of service because it had never worked since install. Less than ten minutes too, no too bad!

You win this time Comcast, but I’m keeping my eye on you.

Scumbag PNW weather

This weekend it was 86-88 F outside (and inside), perfectly clear and sunny. Absolutely beautiful day, I felt fantastic! Seriously, I haven’t felt that happy and blissful in a while. Clear, warm night too, could see the stars from my balcony.

Today, it instantly turned .. October. Low hanging grey clouds, and everything is wet. It never really rained, but it’s just been a constant drizzle that’s soaked everything. At least it’s warm enough that I can have my door+windows open, but this weather is making me anxious and dreary of winter.

Stars

It’s awesome to sit on my balcony and be able to easily see the Big Dipper.

TIL about Tourette’s

While reading the chapter Witty Ticcy Ray about Tourette’s in Oliver Sacks’ book The Man Who Mistook His Wife for a Hat, I discovered an amazing list of things I didn’t know:

  • Tourette was a student of Jean-Martin Charcot, the founder of modern neurology
  • Freud was also a student of Charcot
  • Freud named one of his children Jean-Martin, after Charcot
  • Charcot was the first to describe ALS and multiple sclerosis
  • Oliver Sacks also wrote the book Awakenings, which a movie was based on
  • In the movie, Robert DeNiro played Leonard (I had forgotten about this)
  • There has been research into the grammar and linguistic structure of tics
  • L-Dopa is a dopamine-precursor, and Haldol is a dopamine antagonist

California again

On vacation this week in the Bay Area! Weather is FANTASTIC. I had to travel 320 miles south on the 5 before I reached clear skies on Monday. I realized on the way down that I had forgotten what Northern California looked like during the day. The last two times I came down (this year), it was winter so it was dark by the time I passed through. The only time I saw it in daylight was 2006 or 2007, when I was driving back from Seattle during the great epic AUS-KCI-BM-SEA-SF-LA-AUS roadtrip. It reminds me a lot of West Texas, especially how the sky gets layers of violet and blue at sunset.

Today was eating lunch in the Golden Gate park, then wandering around the de Young Museum. Picasso was on expo again, looked like there was some pieces there that I hadn’t seen in Seattle. Lots of cool stuff in the general exhibit halls, I especially like the cathedral made out of gun parts and bullets and trompe l’oeil painting. I’m always amazed that not only did some man/woman imagine a piece of art, they sat down and actually created the thing.

I will have to buy some sunscreen for my outdoor wanderings. I’m not as tan as I was in Texas, so casual exposure can really burn me now. :(

« Newer Posts - Older Posts »