Feed on
Posts
Comments

I learned a hard lesson in PostgreSQL administration, in particular what happens when tables aren’t vacuumed. I come from a MySQL background and haven’t fully learned what it takes to keep Postgres humming along. I had recently moved the OpenNMS database to another host in an attempt to throw more spindles at a UI slowness issue. Most everything else worked well except the events page would time out while trying to render.  Moving Postgres to another box worked well until it started getting slow again over the holidays.

Looking at my resource graphs for the database server over the previous week, the amount of CPU time spent in I/O wait grew considerably until it eventually stayed pegged out doing nothing but read operations from the disks.  This didn’t seem “right” since I hadn’t added any new switches and the configuration was the same.  Since this was some random box with 2 GB of RAM and I had really been generous with the memory allocations to PostgreSQL, I thought maybe I was exhausting the Linux buffer cache as it was hovering around 6700 KB. Tuning down Postgres’ memory footprint, it didn’t affect buffer cache and I was still doing a ton of read ops off disk.

I fixed the pgsql stats collector and went hunting for what OpenNMS was doing to generate so many reads. It turns out that an UPDATE to the snmpinterface table was consistently taking several seconds to execute.  This is where I learned that PostgreSQL tables need to be vacuumed frequently, especially ones where there are many UPDATEs on that table.  Even after stopping OpenNMS and letting the box do its thing, it took several hours for a ‘VACUUM VERBOSE snmpinterface’ to finish. Even though I had auto_vacuum turned on, it wasn’t actually working because “stats_row_level” wasn’t set to on. Therefore, the tables were never vacuumed!

While that was running, I was reading the event maintenance documentation. I checked out the events table and discovered it too had become bloated.  I update the provisioning group hourly and I had hundreds of thousands of ‘nodeScanCompleted’ and reinitializePrimarySnmpInterface’ UEIs in my events table. Since I really don’t care about these events, I deleted them and added a statement to vacuumd-configuration.xml to purge them after 2 days.  This made a gigantic improvement in displaying events in the web UI, namely, it actually worked again.

Between cleaning up both tables and fixing vacuuming, the excessive read operations completely vanished. From what I can tell, because of the way PostgreSQL’s transaction stuff works, any time it did an update it was having to read all these old transactions off disk first and over time it was a vicious feedback loop.

Lessons learned:

  • sar -d 1 300 is your friend.  If “rd_sec/s” is stays around zero, that means your OS filesystem cache is adequate and it’s not having to go to disk frequently for hot blocks. Finally realizing how VM management and buffer cache works in practice in Linux was a great experience. This is something one of my old bosses tried to teach me a long time ago and it never really sunk in until now.
  • PostgreSQL’s stat collection tries to connect to a random high-numbered port on 127.0.0.1.  If you have deny-all, permit specific rules like I do, make sure you accommodate this with “-A INPUT -i lo -j ACCEPT” or something. Otherwise, the pg_stat tables are empty and there’s no indication why.
  • The I/O characteristics between the database and RRD storage are complete opposites.  The RRD filesystem deals with tons of tiny write operations. The database server does a modest amount of reading large chunks of data off disk (table scans?) with relatively few writes.
  • Keep the events table trimmed, apparently there’s a lot of event correlation going on in the background and extra rows can really slow it down.
  • Check pg_log/postgres-X.log to make sure auto vacuuming is actually working.

Now to see what the next learning experience is!  If I can figure out what’s causing all the “storeResult: org.opennms.netmgt.snmp.SnmpResult” to be written to my output.log, I’ll be happy. (changing collectd logging to FATAL hasn’t seemed to help)

2 Responses to “Hard lessons learned about OpenNMS database maintenance”

  1. pcdbh says:

    After reading your post I ran a few querries on my opennms and found out the following:
    eventuei | count
    ——————————————————————————–+———
    uei.opennms.org/vendor/Cisco/traps/ccmGatewayFailed | 2457394
    uei.opennms.org/IETF/ENTITY/traps/entConfigChange | 860872
    uei.opennms.org/translator/traps/SNMP_Link_Down | 273819
    uei.opennms.org/translator/traps/SNMP_Link_Up | 273727
    uei.opennms.org/IETF/DialControl/traps/dialCtlPeerCallSetup | 223817
    uei.opennms.org/IETF/DialControl/traps/dialCtlPeerCallInformation | 223802

  2. IP Gnome says:

    Having a similar problem with opennms. Stats are custom and no longer showing. Currently running a vacuum on the db to see if that helps.

    Thank you for the tip.

Leave a Reply