Agent polling issues

Started by johnny, March 05, 2018, 12:46:36 PM

Previous topic - Next topic

johnny

Hello all,
I've been using netxms as snmp server for more than a year. I'm quite pleased with the service but I'm having an issue regarding to agent polling.
I've configured several devices. On the servers(centos and ubuntu) I use the netxms agent for polling.
what happens sometimes is after netxms server reboot (and maybe when the "clients" or agents are not yet ready) I'm having problem on polling data.
When I do a status poll on the agent "clinet" I get this:
[05.03.2018 12:28:43] **** Poll request sent to server ****
[05.03.2018 12:28:43] Poll request accepted
[05.03.2018 12:28:43] Starting status poll for node sftptest
[05.03.2018 12:28:43]    Starting status poll on interface lo
[05.03.2018 12:28:43]       Current interface status is UNKNOWN
[05.03.2018 12:28:43]       Interface status cannot be determined
[05.03.2018 12:28:43]       Interface is UNKNOWN for 21239 polls (1 poll required for status change)
[05.03.2018 12:28:43]       Interface status after poll is UNKNOWN
[05.03.2018 12:28:43]    Finished status poll on interface lo
[05.03.2018 12:28:43]    Starting status poll on interface eth0
[05.03.2018 12:28:43]       Current interface status is NORMAL
[05.03.2018 12:28:43]       Starting ICMP ping
[05.03.2018 12:28:43]       Interface is NORMAL for 21238 polls (1 poll required for status change)
[05.03.2018 12:28:43]       Interface status after poll is NORMAL
[05.03.2018 12:28:43]    Finished status poll on interface eth0
[05.03.2018 12:28:43] Node is connected
[05.03.2018 12:28:43] Finished status poll for node sftptest
[05.03.2018 12:28:43] Node status after poll is NORMAL
[05.03.2018 12:28:43] **** Poll completed successfully ****

If I disable usage of ICMP pings for status polling I get this:
[05.03.2018 12:37:25] **** Poll request sent to server ****
[05.03.2018 12:37:25] Poll request accepted
[05.03.2018 12:37:25] Starting status poll for node sftptest
[05.03.2018 12:37:25]    Starting status poll on interface lo
[05.03.2018 12:37:25]       Current interface status is UNKNOWN
[05.03.2018 12:37:25]       Interface status cannot be determined
[05.03.2018 12:37:25]       Interface is UNKNOWN for 21248 polls (1 poll required for status change)
[05.03.2018 12:37:25]       Interface status after poll is UNKNOWN
[05.03.2018 12:37:25]    Finished status poll on interface lo
[05.03.2018 12:37:25]    Starting status poll on interface eth0
[05.03.2018 12:37:25]       Current interface status is UNKNOWN
[05.03.2018 12:37:25]       Interface status cannot be determined
[05.03.2018 12:37:25]       Interface is UNKNOWN for 1 poll (1 poll required for status change)
[05.03.2018 12:37:25]       Interface status after poll is UNKNOWN
[05.03.2018 12:37:25]    Finished status poll on interface eth0
[05.03.2018 12:37:25] Node is still unreachable
[05.03.2018 12:37:25] Finished status poll for node sftptest
[05.03.2018 12:37:25] Node status after poll is UNKNOWN
[05.03.2018 12:37:25] **** Poll completed successfully ****


From what I see is like server is not even trying to poll from netxms agent on that machine. Also on the overview of the machine on capabilities the isAgent has switched to No.
Firewall ports are open
At the Switches and router that I'm doing snmp polls I've never had that issue.
Normally if I do a reboot on netxms server it will solve the issue, but is there anyway to track the problem in order not to happen again, because it could happen and I could notice it after a few days.

On netxms server and the other machine's agents I don't get any logs.
Netxms current version is 2.2.1, but I've got this issue from previous versions as well.
Netxms server is on centos.

johnny

Dear all,
an update on the case
I haven't restart netxms server yet so I still have the problem I've described above.
Today I tried to add a new network device with snmp and it seems that netxms server is not able to connect and collect data from the device.

I then killed netxms proccess and started at debug level
the previous devices still cannot poll data and also from the new network device.
At debug level 6 I've checked that the messages seem same to the working and non working devices
example:
Sending message CMD_POLLING_INFO (128 bytes)
Sending compressed message CMD_POLLING_INFO (120 bytes)
Sending compressed message CMD_POLLING_INFO (104 bytes)
Sending compressed message CMD_POLLING_INFO (112 bytes)


also after a system restart I still cannot poll data from the devices.
Firewall is down on the devices that netxms server cannot poll.
Any ideas?

also is there a way to start netxms with more logging at the log file?
I'm currently getting:
2018.03.15 09:45:40.503 *I* NetXMS Server started
2018.03.15 09:45:40.503 *I* SocketListener/Clients: listening on 0.0.0.0:4701
2018.03.15 09:45:40.503 *I* SocketListener/MobileDevices: listening on 0.0.0.0:4747
2018.03.15 09:45:40.503 *I* SocketListener/Clients: listening on [0.0.0.0]:4701
2018.03.15 09:45:40.503 *I* SocketListener/AgentTunnels: listening on [0.0.0.0]:4703
2018.03.15 09:45:40.503 *I* SocketListener/MobileDevices: listening on [0.0.0.0]:4747

johnny

#2
sorry,
the problem with the new device was that it was wrong the ip address of the device
but the problem still remains on the previous device was I couldn't poll data before restart

ADDED:
From a centos machine that I cannot poll data I start the agent with debug.
Server still cannot poll data, but when I telnet to netxms port I get it open and from client debug I get:
[15-Mar-2018 13:08:27.419] [DEBUG] DataCollector: sleeping for 60 seconds
[15-Mar-2018 13:09:27.419] [DEBUG] DataCollector: sleeping for 60 seconds
[15-Mar-2018 13:10:06.673] [DEBUG] Incoming connection from X.X.X.X
[15-Mar-2018 13:10:06.673] [DEBUG] Connection from X.X.X.X accepted
[15-Mar-2018 13:10:06.673] [DEBUG] Session registered for X.X.X.X
[15-Mar-2018 13:10:09.537] [DEBUG] [CS-0(1)] Communication channel closed by peer
[15-Mar-2018 13:10:09.537] [DEBUG] [CS-0(1)] writer thread stopped
[15-Mar-2018 13:10:09.537] [DEBUG] [CS-0(1)] Session with X.X.X.X closed
[15-Mar-2018 13:10:09.537] [DEBUG] [CS-0(1)] Session unregistered
[15-Mar-2018 13:10:09.537] [DEBUG] [CS-0(1)] Receiver thread stopped

so as you see it seems from the agent, when I poll data agent from debug doesn't get connections.
When I've done that test before some months I remember that at polling I see that from debug

2nd Edit:
How I kind of solved the problem:
from console to the current problematic node at netxms server I've changed the ip to something else.
Then I created a new node with the correct IP and it polled the data normally.
I also saw that from the agent debug.
Then I delete the test(temp) node with the correct IP, and changed back to the correct IP at the problematic node and after 2 seconds everything went back to normal.
Do you have any idea why is that happening?

Tursiops

Hi,

What happens if you run a full configuration poll on the node that is no longer being polled?
That's under the assumption that you know/are sure that the node's agent/SNMP are working perfectly fine.

I've seen situations where the server would stop polling SNMP or the agent itself, but a full configuration poll would just fix it (a normal poll would not work because the node was unreachable - and status polls would never return an ok status). I never found an actual underlying cause or solution for this behaviour, nor could I reliably reproduce it.

Cheers

johnny

Dear Tursiops,
thank you for you answer. I haven't check to do a full configuration poll on the problematic node.
I will try to recreate the problem and see what will happen.

The node's agent is working fine. I can telnet from netxms server to node and see that to debug,
and also when I changed the ip address to a not valild to that node, and created another new node with corresponding ip I polled the data.

johnny

#5
The problem appeared again when someone did a restart on the pc,
The agent is up, i can telnet on the port but imideately it closes the connection, while on another pc which I can poll data the connection remains on when I telnet.
I did a configuration full poll and the problem still remains.

[20.03.2018 12:45:52] **** Poll request sent to server ****
[20.03.2018 12:45:52] Poll request accepted
[20.03.2018 12:45:52] Starting configuration poll for node XXX PC
[20.03.2018 12:45:52] Capability reset
[20.03.2018 12:45:52] Checking node's capabilities...
[20.03.2018 12:45:52]    Checking NetXMS agent...
[20.03.2018 12:45:52] Capability check finished
[20.03.2018 12:45:52] Checking interface configuration...
[20.03.2018 12:45:52] Unable to get interface list from node
[20.03.2018 12:45:52]    Interface "unknown" is no longer exist
[20.03.2018 12:45:53] Interface configuration check finished
[20.03.2018 12:45:53] Checking node name
[20.03.2018 12:45:53] Node name is OK
[20.03.2018 12:45:53] Finished configuration poll for node XXX PC
[20.03.2018 12:45:53] Node configuration was not changed after poll
[20.03.2018 12:45:53] **** Poll completed successfully ****


EDIT:
on agent debug it says
connection from x.x.x.x rejected

2nd edit:
the problem was the DNS resolving
on netxms agent I had as Master server the dns and the pc could not resolve it.
I added also the IP and the issue resolved for now