Skip to content

March 22, 2011

4

Fixed: SNMP stops responding after policy update

It’s been a while since I’ve been truly excited about a service pack, but I definitely am when it comes to Service Pack 1 for Windows Server 2008 R2! For literally years now, I’ve watched SNMP (mis)behave erratically on our Windows servers. Originally, we used ipMonitor (before SolarWinds purchased it) and then last year, we moved up to SolarWinds Orion NPM. Love the graphs. Love the traffic stats. ‘Get really frustrated when servers just flake out and stop answering SNMP’s calls…

So, quite ironically, I finally open a case with SolarWinds…last week (a.k.a. five days before we deployed SP1 network-wide). Nothing pops out to us, so we start capturing traffic with Wireshark and Microsoft Network Monitor at various points. Then on Sunday, we push SP1, which unbeknownst to us includes the hotfix described in KB980259.

Yesterday, merely 12 hours after installing it, several servers start flagging in NPM as not responding to SNMP, and I decide to dig in the event logs, hoping to see something I might have missed before. The event below was there twice and perfectly coincided with the cessation of SNMP service. In other places it showed up once or not at all, but the failing nodes had it twice…

Log Name:             System
Source:                   SNMP
Date:                       3/21/2011 1:41:22 PM
Event ID:               1500
Task Category:     None
Level:                     Error
Keywords:             Classic
User:                      N/A
Computer:            node.domain.com
Description:         The SNMP Service encountered an error while accessing the registry key SYSTEM\CurrentControlSet\Services\SNMP\Parameters\ExtensionAgents.

Interesting… After firing off this discovery to SW support before heading home for the day, they do some searching and come across Microsoft KBs 980259 and 972840, which cover a situation where SNMP stops responding after a Group Policy refresh. Our SNMP configuration is pushed via Group Policy and apparently after some refreshes, it can fail to find the registry keys that the policy deletes and recreates. And of course once it fails, it never tries again.

Upon looking at the hotfixes, my coworker and I found that the R2 version was included in SP1, which explains why only non-R2 servers are breaking this week. The non-R2 will come in an as-yet-unannounced SP3 for Win2k8 and Vista (I’m not holding my breath).

Anyways, ‘just figured I’d share this with the world at large in case you, too, are having issues with SNMP and are either too busy to dig deeply (like we were) or simply haven’t come across the fixes. Enjoy!

Read more from Microsoft
4 Comments Post a comment
  1. Nick Marchini
    Jun 8 2011

    We’ve just started to implement SNMP settings via group policy as well so these KB’s will not get pushed out as part of the next patching round.

    I’d be interested to know how you have deleted the registry keys via group policy and then recreated them, we have a basic ADM file that works on 2008 (still not moved to 2008 AD domain yet). If you can share the method here or let me know where to go to study this method it would be appreciated.

    Reply
    • Chris
      Jun 9 2011

      Hey Nick,

      The actual process of those reg keys getting deleted/recreates happens automatically when Group Policy (GP) applies the community and manager settings. GP sees that it has SNMP settings to apply, deletes the current keys (if present), and then creates new ones with the config in the policy. Since the default behavior (pre-hotfix) is only to try once, when it fails to apply (create the key), it’s left effectively unconfigured.

      Another thing I just realized in looking at the event I posted is that we have a kerberos race condition that happens at 1:41pm on our network and can break network communication (like SNMP) between devices. You’ll notice that the above event happened at 1:41pm. Microsoft is aware of the issue (it happens on their network, too, just at a different time) and we’re working with them to find a solution (to the kerberos race). Anyways, just a little bonus info :). Enjoy!

      Reply
  2. Robert
    Jun 6 2012

    Chris,

    We are having this issue as well and the hotfixes don’t seem to be resolving the issue so we are still troubleshooting. I am interested to know how you identified the Kerberos race issue and how you fixed it.

    Reply
  3. Chris
    Jun 6 2012

    Hey Robert,

    Are you running W2k8R2 or R1/W2k3? Just checking since we never messed with the hotfixes on those non-R2 versions (it was our motivation to upgrade and in the meantime we left monitoring to SCOM/WMI).

    As for the Kerberos race condition, we noticed it because our Remote Desktop (RDP) sessions to servers would drop everyday at 1:41pm. As we dug deeper, we saw other evidence of comms interruptions and were able to trace it back to AD/kerberos (I’d have to dig back a ways to find the exact events, etc). If you’re seeing those issues at a consistent time of day, let us know and we’ll relay our MS case info. We have a workaround in place that mostly mitigates the issue and moves it to a more convenient time of day (pre-business hours).

    Thanks,
    Chris

    Reply

Share your thoughts, post a comment.

(required)
(required)

Note: HTML is allowed. Your email address will never be published.

Subscribe to comments