Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - aron

#31
Hello

Unfortunately I will be going on holiday for a week (this being the last day) so I will not be a position to do this constructively in the short term.

I have been trying to push this forward in my own time to see if I can assist. I have been trying to add tools build / find the memleaks, predominantly to get a working build environment and gain greater familiarity with the code base and how it works. The memleak tool looks ideal with no modification to the codebase required to run the memleak tests. http://valgrind.org/

We have been pushing the forward with the deployment regardless which does provide additional useful information. The average time between degraded performance has been generally around 7 hours. I would say that we have at least doubled the amount of DCI's (combination of NXAgent/SNMP) being collected but this appears to have had no affect on the up time. I assume the fact the issue is being quite elusive and seems to predominantly affect our deployment it may be something specific to our or the way we have it deployed.

Total objects: 6608, Monitored Nodes: 356, Number DCis 3119

I would have said that most of our equipment being probed would be run of the mil, but we probably steer away from the norm with our L3 cisco switch deployments. We have 60 Cisco Switches (2950-3750 24 or 48 port POE), NetXms is seeing all of these units for (isBridge,isCdp,isRouter,isSnmp,isSTP). 20 of these are acting as L3 vlan routers (generally at different locations) with generally around 60-200 vlans each, not every vlan will have an IP address however each deployment would have similar IP deployment structure (some would be configured but disabled). Depending on how the IP addresses are used by netxms it may get confused of duplicate IP addresses, IE LocationA.Vlan5 ip = 192.168.5.1/255.255.255.0 and LocationB.Vlan5 = 192.168.5.1/255.255.255.0 , for each of the 20 locations.

As our experience with NetXMS is growing we are also noticing other features that may be the result of the above deployment style or of the issue that is causing the memleak.
- When selecting a Cisco Switch > Show L2 Topology in some instances show a correct network maps, in other instances they will show a connection between 2 physically separated units which share no physical connection. What is slightly stranger is that it is inconsistent. IE L2 Topology on LocationB.Switch2 shows a connection to LocationA.Switch1, however if you do the same topology map is done on LocationA.Switch1 it shows no connection back to LocationB.Switch2.
- IP Neighbours does get very confused but I would expect this due to the IP deployment structure so I consider this a NonFault but mention just in case.

I have been trying to see if this should be setup using the VPN connector however the currently GUI does not seem to allow the creation of the connector. Documentation on the Forum/Manual seems a little with regards the use of the VPN connector.

Sorry I have provided lots of information that may provide little constructive insight. As soon as I get back I will setup a test rig to do the requested tests.

Regards

Aron

#32
I have changed some of the settings and had a look at the logs to try and see if I can provide any constructive information. Firstly I changed the house keeping and configuration poll timers from 1 hours to 2 hours. This had no affect on the general time before it had an operational affect. The logs unfortunately did not find anything particularly insightful but when the system goes into its degraded state it reports socket_error =8 which i believe is the Out Of Memory message.

If there is anything particularly constructive I can do please let me know.

Regards

Aron
#33
Unfortunately 1.2.1 does not appear to have resolved this issue at this stage. Attached is the Health graph of the system, the last peak is the upgrade to server 1.2.1.

Regards

Aron
#34
Quite possibly may be due to my implementation of the external script.

DCI - Floating Point. Transformation: delta is either none/per min depending on purpose.
script;
Quoteuse transform;
return Bit2Mb($1);

Script is called transform;
Quotesub Byte2Mb(byte)
{
   if (byte > 100000000 | byte < 0)
   {
      return 0;
   }
   else
   {
      return ((byte * 8)/1024)/1024;
   }
}

sub Bit2Mb(bit)
{
   if (bit > 100000000 | bit < 0)
   {
      return 0;
   }
   else
   {
      return ((bit)/1024)/1024;
   }
}

Sorry quote seems to be smiley'ing the times by 8. As far as other scripting goes we do not use it to heavily at the moment as we are still in the early deployment phase. We do use the templating and auto-apply quite heavily, below just in case it helps;

Quotesub main()
{
   return $node->snmpOID == ".1.3.6.1.4.1.10002.1";
}

Best Regards

Aron
#35
I have been doing some further investigation and noticed the release notes on 1.2.1. I have looked at the scripts I have been using and unfortunately I can not see where I have used the 2 functions stated in the release notes, (unless they are used internally by netxms). Most functions used by myself are fairly generic for turning bitsToMB and bytesToMb which are all stored in an external script.

Regards

Aron
#36
I have been continuing to narrow down my investigation to keep the updates as clean as possible. At this stage the server (windows 2003 server, 1 Gig RAM,3 Gig Page File) that is hosting NetXMS + Mysql database has been consuming all memory until it reaches a tipping point at which it degrades the NetXMS responsiveness, then queue length increases. I will attach a Dashboard that I have been using to refine the issue. You will see the memory utilization slowly creeping up until it reaches a point and has a direct affect on the queues, a restart of the NetXMS service releases resources and then returns to normal. I have included the MYSQL in the graph to confirm that this is not grabbing all the availability. I will continue to try and establish if it is something specific with the deployment/server however feel the issue is more related inside Netxmsd but will continue to try and find something more specific to back this up. I was not sure if you have seen anything similar in your test bed?

There are quite allot of memory page faults but unfortunately I have not had the right collectors configured so have not included this for the time being. We did not seem to experience these issues in the previous release of netxms. Only thing being mentioned in the startup is below at the current moment;

[21-May-2012 17:16:07] Inconsistent database: interface 5859 linked to non-existing node 5150
[21-May-2012 17:16:07] Failed to load interface object with id 5859 from database

I am not sure if there are a specific probes you might advise or specific logging you would recommend?

Best Regards

Aron
#37
Unfortunately I believe I may have miss identified the issue, it appears that it may be functioning but at a severely reduced capacity. With regards the server there does not appear to be any loading on the CPU to result in the increase time. I have increased the amount of data collectors from 25 to 40 to see if this will assist.

Responses to questions;
After Server reset, mainly Ubuntu but some windows instances.

Regards

Aron
#38
Hello

I am experiencing a strange issue with SNMP proxy functionality within the server. We have a fairly distributed deployment and rely on the SNMP Proxy functionality quite heavily. After a period of about a day the SNMP proxy functionality will lock up with no clear explanation or error (that I can find but still fairly new to the system), the snmp proxy functionality is not locking up on a remote per agent basis and seems to be global to all remote agents (6-8 remote instances). If the poller/configuration proxy flags are removed on the object it will correctly poll that agent. I have increase the logging on a remote agent instances and can still see the agent receiving requests from the server, tcpdump confirms that the remote agent is doing icmp polls but not snmp requests. From a server perspective there are no outstanding requests in show queue and the show pollers does not show a locked poller for the non-responding instances.

Sorry I am a little unsure what additional information would be constructive to help identified the issue.

Regards

Aron
#39
Thank you, that has done the trick.
#40
Hello

We are currently rolling out the NetXms system internally to our servers which has gone well until we have got to the older versions of windows. The server in question is Windows 2000 SP4.

Currently after installing when running the agent the system reports the following error:
nxagentd.exe - Entry Point not found;

The procedure entry point GetVolumePathNamesForVolumeNameA could not be located in KERNEL32.dll

Any assistance would be greatly received.

Regards

Aron