Community string corruption during topology poll

Started by rgkordia, April 22, 2015, 10:13:12 PM

Previous topic - Next topic

rgkordia

Hi,

I've been using 1.2.17 for a number of months now - fantastic product and generally working very well.

One issue - I notice that we get a lot of SNMP AUTH events from our Cisco 6509 switches every so often.  Data collects fine via SNMP for these devices, so I did a packet capture and found that there's some SNMP get-next-request requests for OID 1.3.6.1.2.1.17.4.3.1.1 that are using an incorrect community.  NXMS is sending our normal community, but suffixed with some extra characters.

eg: If our community were "public" the above mentioned requests would use a community of "public@301" on some requests, "public@302" on others.

It seems this happens on a schedule and can also be triggered by doing a topology poll.  The Layer 2 switch forwarding database for these devices are blank in NXMS.

Only happens on our 6509.  Our other Cisco switches are fine and show valid layer 2 fdb's.

A bug or config issue?

Thanks,
Richard

Victor Kirhenshtein

Hi!

Yes, it is part of topology poll. It's actually a Cisco's non-standard feature to provide FDBs per VLAN - you should pass community string suffixed by @ character and VLAN ID to get FDB for that VLAN (or set context to "vlan-<id>" for SNMP v3) (Cisco example for reading FDB: http://www.cisco.com/c/en/us/support/docs/ip/simple-network-management-protocol-snmp/44800-mactoport44800.html). I'm not completely sure what Cisco devices requires this to get full FDB and what do not. It seems that we'll have to create separate driver for 6500 series - can you figure out what is correct way for reading full FDB from them?

Best regards,
Victor

rgkordia

#2
Hi Victor,

After some investigation I suspect the issue is that my community already contains an @ sign.  I'm trying to test this theory, and I've changed the SNMP community on the device and in NetXMS.  It seems however that the old community string is persisting in NetXMS.  The GUI will show the new string for a while, then revert back to the old one.  My feeling is that the topology poll is somehow caching the old community and overwriting the new community (although I've not thoroughly validated this).  My new community is working fine but then when I did a topology poll I see (from wireshark) that it uses the old community, and then the device properties now shows the old community again.

I've restarted the server and agent, but the problem still persists.  How can I reliably change the community for this device without having to recreate the node?

Rich

Victor Kirhenshtein

Hi,

it actually sounds really weird - NetXMS should not cache non-working community. How did you change it - in communication properties of the node or change list of default communities? Can it be that device responds on old community as well as on new one and you have old community in default community list?

Best regards,
Victor

rgkordia

Hi Victor,

Yes I changed it in communication properties.  The default community was left alone (I have a number of other nodes using the default and any new ones will use the default).  I disabled all the old communities on the device and changed to the new one.  In fact, the when NetXMS attempts to use the old one it gets no response and the DCI's all move to a deactive state.  The sequence of events seems to be as follows:

1. Change community on router
2. Change community in communication settings
3. Everything works fine
4. After a while (30+ minutes maybe), the communications properties show the old community, but polling still works fine

At this point, I tried doing some polls:

5a. Topology poll partially works.  VLAN list is retrieved but the next query fails.  I see auth errors, so assume old community is partially being used
5b. Interface names poll works.  After I did this, the communications properties now shows the new community (topology poll still fails though)
5c. Configuration poll succeeds.
5d. Status poll succeeds.

I then left it for another while (30+ mins) and the old community is showing in the communications properties.

It may be that the right community is stored in the DB, but I opened the properties at some point and clicked OK, saving the old community and breaking things.  So somewhere the old community is being stored.

Interestingly, I made two community changes to a device and the same thing occurred, except it kept reverting to the previous value.  This at least showed that it wasn't reverting to the default community, but in fact the previous value of community.

Rich


paul

I was wondering what the outcome of this is as I somehow seem to have a similar problem.

Network discovery = off.
Add device if trap received = on

About 300 CISCO Routers added - SNMP is working.

Every 30 minutes (topology polling interval) - auth failure trap received.

Disable topology polling - alerts go away.

Given I have not touched these devices, but in my list of communities there will be at least one that does not work, why would topology polling not only not be working, but on not receiving a correct response, why not trigger an event for topology polling failure.

More importantly though - having not set anything on these auto discovered devices - how did I break it and how do I fix it?