After upgrade to 2.2.11 - Status of DCI changed to UNSUPPORTED

Started by Michal Hanajik, January 10, 2019, 03:23:35 PM

Previous topic - Next topic

Michal Hanajik

Hello,

we recently upgraded to version 2.2.11 and after we are getting at night (around 1:30am) massive amount of warning (see attached picture).
Devices are mainly UPS but there are servers as well - different hardware with different versions of Windows.

Weird thing is, we have 3 same servers (hardware and software) but this alarm comes only from one of them. We tried to update agents to most recent version, but that did not helped.
All database upgrades, checks and fixes were applied.

Any tips or hints, where to look for problem or solution?

Thank you!

Victor Kirhenshtein

Hi,

do those DCIs recover by itself afterwards? Could you set debug level for server to 7 around that time and check for lines with text GetItemFromSNMP? But keep in mind that debug level 7 will produce huge amount of logging, so depending on your system size it may not be a good idea). Could it be that housekeeper start time set around 1:30?

Best regards,
Victor

Michal Hanajik

Hello and sorry for a bit late answer. We have been trying to figure this out over the last days.

So it has definitely something to do with housekeeper. It was set at 2am, when we changed it to different time, all those warning arose about that new time.

So after that I was able to easy backup logs :)
So this are some of those SNMP errors from log. If it will be more helpful, I can attach you whole log file.
We even updated to version 2.2.12, but still this is happening. If you need some more information let me know.


2019.01.18 11:02:43.357 *D* Node::connectToAgent(TLA-RPi-agent [1283]): already connected
2019.01.18 11:02:43.358 *D* [agent.conn.113     ] Sending message CMD_DATA_COLLECTION_CONFIG (1304) to agent at 87.197.117.89
2019.01.18 11:02:43.359 *D* [db.cpool           ] Handle 0x7f9bfae45e60 released
2019.01.18 11:02:43.359 *D* [db.cpool           ] Handle 0x7f9bfae46220 acquired (call from dbwrite.cpp:250)
2019.01.18 11:02:43.359 *D* [db.cpool           ] Handle 0x7f9bfae46040 released
2019.01.18 11:02:43.359 *D* [event.proc         ] Event 9218196 with code 53 passed event processing policy
2019.01.18 11:02:43.359 *D* [event.corr         ] CorrelateEvent: event SYS_DCI_UNSUPPORTED id 9218197 source GIM-CCTV-UPS1 [1365]
2019.01.18 11:02:43.359 *D* [event.corr         ] CorrelateEvent: finished, rootId=0
2019.01.18 11:02:43.359 *D* [event.proc         ] EVENT SYS_DCI_UNSUPPORTED [53] (ID:9218197 F:0x0001 S:2 TAG:"") FROM GIM-CCTV-UPS1: Status of DCI 9009 (SNMP: .1.3.6.1.4.1.935.1.1.1.3.2.1.0) changed to UNSUPPORTED
2019.01.18 11:02:43.359 *D* [event.policy       ] EPP: processing event 9218197
2019.01.18 11:02:43.359 *D* [event.policy       ] Event 9218197 match EPP rule 22
2019.01.18 11:02:43.359 *D* AlarmManager: adding new active alarm, current alarm count 18



2019.01.18 11:02:43.330 *D* [agent.conn.174     ] Sending message CMD_SNMP_REQUEST (43037) to agent at 212.55.237.190
2019.01.18 11:02:43.335 *D* [db.cpool           ] Handle 0x7f9bfae45aa0 released
2019.01.18 11:02:43.335 *D* [db.cpool           ] Handle 0x7f9bfae45e60 acquired (call from dbwrite.cpp:250)
2019.01.18 11:02:43.335 *D* [db.cpool           ] Handle 0x7f9bfae45c80 released
2019.01.18 11:02:43.335 *D* [event.proc         ] Event 9218195 with code 53 passed event processing policy
2019.01.18 11:02:43.335 *D* [event.corr         ] CorrelateEvent: event SYS_DCI_UNSUPPORTED id 9218196 source GIM-GXSW26-IVZ-EKON RIADITEL [1393]
2019.01.18 11:02:43.335 *D* [event.corr         ] CorrelateEvent: finished, rootId=0
2019.01.18 11:02:43.335 *D* [event.proc         ] EVENT SYS_DCI_UNSUPPORTED [53] (ID:9218196 F:0x0001 S:2 TAG:"") FROM GIM-GXSW26-IVZ-EKON RIADITEL: Status of DCI 11388 (SNMP: .1.3.6.1.4.1.25506.2.6.1.1.1.1.12.12) changed to UNSUPPORTED
2019.01.18 11:02:43.336 *D* [event.policy       ] EPP: processing event 9218196
2019.01.18 11:02:43.336 *D* [event.policy       ] Event 9218196 match EPP rule 22
2019.01.18 11:02:43.336 *D* AlarmManager: adding new active alarm, current alarm count 17


Thank you,
Michal

pzandvoort

We have the exact same issue.

Every time the housekeeper runs (2:00am by default) it seems to re-apply the templates. Since not every DCI in the template is supported by all nodes, "SYS_DCI_UNSUPPORTED" fires and we get the "Status of DCI changed to UNSUPPORTED" alarm. This makes perfect sense the first time the template gets applied, since it's a new DCI for that node and it just figured out that the DCI isn't supported. But it shouldn't happen if the node already has that DCI and that DCI is already known to be unsupported or disabled.

2.2.10 did this correctly. The logic seems broken in 2.2.11 and up.

Did you figure out a workaround to this? We can obviously suppress the alarm by changing the response to SYS_DCI_UNSUPPORTED, but that seems wrong.

Peter

Michal Hanajik

Hey Peter,

no we haven't found any solution. Still waiting for Victor or someone with much bigger insight to advice or check if there is bug.
We can only always after cleanup delete those UNSUPPORTED alarms and go with the rest as usual.

Victor Kirhenshtein

This is caused by housekeeper re-applying templates. This was added to automatically fix issues when not all DCIs were applied or updated correctly or was accidentally deleted (had those issues in few big deployments). The problem here is that template re-apply also resets DCI status to "active", which on next data collection run changes to "unsupported" and causes SYS_DCI_UNSUPPORTED event generation. Correct approach would be to leave unsupported DCIs in unsupported state. We will fix it before next release.

Best regards,
Victor

pzandvoort

Victor,

That makes perfect sense and matches exactly what we're seeing. For now, we've disabled the generation of the alarm on SYS_DCI_UNSUPPORTED to suppress the result, but if you can make it work like you describe that'd be awesome! Looking forward to the next release.
Thanks!

Peter

Michal Hanajik

Hello,

has been this fixed? I have 2.2.13 and the problem is still persisting.

lweidig


Michal Hanajik

Hello, this problem for us persists in bigger scale after upgrade to netxms 3.

Do you have any suggestion?


Minor Outstanding Status of DCI 442 (Internal: PingTime) changed to UNSUPPORTED     1 0 30.09.19 12:31:09 30.09.19 12:31:09
Minor Outstanding Status of DCI 409 (Internal: PingTime) changed to UNSUPPORTED   1 0 30.09.19 12:31:09 30.09.19 12:31:09
Minor Outstanding Status of DCI 842 (Internal: PingTime) changed to UNSUPPORTED   1 0 30.09.19 12:31:09 30.09.19 12:31:09
Minor Outstanding Status of DCI 1043 (Internal: PingTime) changed to UNSUPPORTED   1 0 30.09.19 12:31:09 30.09.19 12:31:09
Minor Outstanding Status of DCI 846 (Internal: PingTime) changed to UNSUPPORTED   1 0 30.09.19 12:31:09 30.09.19 12:31:09
Minor Outstanding Status of DCI 834 (Internal: PingTime) changed to UNSUPPORTED   1 0 30.09.19 12:31:09 30.09.19 12:31:09
Minor Outstanding Status of DCI 829 (Internal: PingTime) changed to UNSUPPORTED   1 0 30.09.19 12:31:09 30.09.19 12:31:09
Minor Outstanding Status of DCI 826 (Internal: PingTime) changed to UNSUPPORTED   1 0 30.09.19 12:31:09 30.09.19 12:31:09
Minor Outstanding Status of DCI 831 (Internal: PingTime) changed to UNSUPPORTED   1 0 30.09.19 12:31:09 30.09.19 12:31:09
Minor Outstanding Status of DCI 550 (Internal: PingTime) changed to UNSUPPORTED   1 0 30.09.19 12:31:09 30.09.19 12:31:09
Minor Outstanding Status of DCI 568 (Internal: PingTime) changed to UNSUPPORTED   1 0 30.09.19 12:31:09 30.09.19 12:31:09


lweidig

Yes, we have not seen it resolved either and each release seems to actually grow.

StanHubble

In my opinion this is not broken....rather your templates are including nodes that they shouldn't or you have dci's that are too specific for the template.

Nodes can appear in multiple templates and we have some templates that depend on firmware versions and others that depend on application versions.  Each deployment will be different, but in general you should define templates from the general to the specific.

lweidig

Stan:

I would completely agree with you if what you were describing was the problem, but it is NOT!  We use many different templates as well to fine tune what is being collected.

The issue is that it is changing items to UNSUPPORTED that are 100% legitimate!  In one of our cases it is the polling of a SNMP value to show GPS sync state of a device.  Have verified many times the OID is correct and the console has no issues polling that OID withing the MIB Explorer.  We have also looked to make sure that more than one template does not have the OID and that we are looking at the wrong one. 

It is simply broken!

StanHubble


Michal Hanajik

In shortcut ... We have 3 identical servers. 2 of them show unsupported DCI's and 3rd one is completely OK. All use same templates.

It's very strange and semi-random I'd say.