Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - blazarov

#31
Hello team,
we are extensively using the NXAgent local caching (data reconciliation) function which is great.
Recently we've found out a nasty limitation of the implementation based on SQLite.

When its a busy agent with lots of DCIs and for some reason it loses connectivity with the server for relatively long period (several hours +) its local SQLite cache database (dc_queue table in particular) quickly gets large. The larger it gets the slower the select queries become which results in longer periods between the reconciliation operations between the agent and the server. This basically leads to a snowball effect where it gets worse and worse with the time and it could never catch up to sync all cached data with the server and start sending the "fresh data".
Now we have an agent that has 7+ million of rows in the dc_queue table in the SQLlite and every select query takes around 40-50 seconds to complete:

[02-Jun-2017 10:22:31.816] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [43475 ms]
[02-Jun-2017 10:22:32.638] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:23:12.913] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [40224 ms]
[02-Jun-2017 10:23:14.085] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:23:58.584] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [44448 ms]
[02-Jun-2017 10:23:59.432] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:24:43.598] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [44115 ms]
[02-Jun-2017 10:25:13.633] [DEBUG] ReconciliationThread: timeout on bulk send
[02-Jun-2017 10:26:03.604] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [49917 ms]
[02-Jun-2017 10:26:04.483] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:26:49.148] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [44614 ms]
[02-Jun-2017 10:26:50.456] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:27:35.038] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [44006 ms]
[02-Jun-2017 10:27:38.335] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:28:30.247] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [51862 ms]
[02-Jun-2017 10:28:34.130] [DEBUG] ReconciliationThread: 1024 records sent
[02-Jun-2017 10:29:26.638] [DEBUG] Long running query: "SELECT server_id,dci_id,dci_type,dci_origin,status_code,snmp_target_guid,timestamp,value FROM dc_queue WHERE server_id=3095327485888869043 ORDER BY timestamp LIMIT 1024" [52457 ms]
[02-Jun-2017 10:29:27.455] [DEBUG] ReconciliationThread: 1024 records sent

this results in a reconciliation rate of around 1500 records per minute which is much lower than the rate of the new data coming, so we are already in a snowball situation that is getting worse and worse and will never catch up.
So far the solution that we have is to delete the SQLite database and that immidiately fixes the situation, but costs us losing valuable monitoring data. Unfortunately the database format is very different between the agent and the server and so far we havent found a working way where we can "manually" dump SQLite and then import to server database. That would be a nice option to manualy fix such situations.

The hardware that runs the NXAgent is pretty decent and unfortunately giving the VM more CPU/RAM or putting it to faster storage (even tried all-flash storage) does not help significantly. It seems to me that SQLite is capped to using just one core.

So after this long introduction i have several questions:

  • Is there an option to use real database such as MySQL or PostgreSQL for the local agent caching DB? Since this is very critical for us we can live with some advanced installation or configuration just to make it work
  • Does my understanding and analysis of the issue and the cause make sense?
  • Any other ideas how we can solve our problem or maybe workaround it?
#32
Hi,
we are also suffering quite a lot because of that, so we'll appreciate an idea for workaround using transformation script, as well as planned timeline for the bug fix.

I can contribute with my "workaround" so far, but its quite simple and useless most of the times, because you have to set a static "high limit" that might be very different depending on the device in our setup.

sub main()
{
   if ($1 > 85899345920) {
      exit;
   } else {
      return $1 * 8;
   }
}
#33
Hi Victor,
Thanks for your reply. I totally agree with you. Unfortunately i am not into Java and coding in general so i can not advise on that.

I just hope that options will make place in some of the near future releases :)
#34
Hi,
I was thinking that it would be very good if there is way to access particular graphs outside of the NetXMS console.
Is there such thing available right now? For example if i need to show some graph on a standard web site?

I was thinking of two possible soltions:
1) Provide access to a "dynamic graph web API" which is just a web service that you can call with GET/POST attributes for the graph (size, DCIs, time period, etc...) and it returns a rendered PNG on the fly. Or maybe you create the graph you want as a predefined graph and just "call" it on the web API to get the rendered PNG for use.
2) Set a scheduled PNG file creation in a specified location.

Any ideas on how to realize that?
#35
Thanks Victor,
I understand the logic.

My understanding now is when the instance discovery filtering script returns False that particular instance is skipped or removed if already available. Is that true?
If yes, does that mean that if there's a single SNMP error (eg Timeout) in a regular Instance discovery and the script returns False for that particular instance it will be deleted from the node DCI table? As it happens now if i really delete the interface from the router?

This will be problematic, because now NetXMS immediately deletes a DCI when it detects its absense during instance discovery resulting in permanent history data loss, which is not acceptable for the bussiness. This is a feature that i have already requested in "FEature Requests" - to have the option to leave DCI's present (for example in UNSUPPORTED state) after they dissapear from instance discovery.

I was thinking of somehow "breaking" the script in order to avoid any changes instead of returing False which will eventually result in deleting DCIs with their history data. Am i on the right track?
#36
Hi,
I have developped and been using an NXSL script for interface instance discovery using SNMP for over an year now.
Although it is working fine it keeps on throwing alarms for script execution errors on the server node. I am pretty sure that the cause for those errors in 99% of the cases is just SNMP timeout. I have many of the monitored nodes with slow and unreliable connection so this is expected and more or less inevitable.

So now i am looking for a way to enhance my script in a such way that it catches and handles such problems and just quietly abort until the next execution. I have looked int the documents and in the forum, but couldnt find a solution.
Ideas anyone?


Here is my instance discovery script:
snmp = CreateSNMPTransport($node);
ifName = SNMPGetValue(snmp, ".1.3.6.1.2.1.2.2.1.2." . $1);
ifName .= " ";
ifName .= SNMPGetValue(snmp, ".1.3.6.1.2.1.31.1.1.1.18." . $1);
if (ifName ~= "Loopback.*") {
return %(false, $1, ifName);
} else {
return %(true, $1, ifName);
}


Screenshot of the alarms is attached.

Thanks in advance!
#37
That would be really nice. MIN/AVG/MAX info is really useful regardless its just a single DCI or dual.
#38
I will set it up this week and will share my experience as well :)

Thank you very much!
#39
Hi,

These days I am experimenting with the Performance tab and I am pretty happy with the results. The only thing that I am missing is user-editable time range, but I already registered a feature request for that. :)

I noticed something weird that might be a bug or feature?

There is no legend in the bottom of the graph UNLESS there is no second DCI attached. I've tried all options, but can not get the legend to appear in a single-DCI graph.
So - is this a bug or am I missing something? :)
Example in the included screenshot.
#40
Hi,
thanks for the extremely useful info!
How do you find the overall performance and stability of the netxms console on the raspberry so far?
I want to do the same thing - install a big TV in the office with the RPi2 running the netxms console so it will work only if it is stable. Alternatively I will use a normal PC with Linux, but I really like the idea of the raspberry.
#41
Hi,
I would like to suggest a new feature that will allow setting a grace period before deleting the DCI's  of missing instances after instance discovery.

This will be useful in cases when monitoring devices that have their interfaces changes often. It will allow for keeping the data collected for a given period before permanently deleting the data.

Thanks!
#42
Yes, dynamic tunnels, e.g. PPTP, OpenVPN are bitch to monitor. Mikrotik have nice feature to create a virtual interface statically bound to a specific client that helps.
My request is more related to statically configured interfaces such as GRE tunnels or dot1q subinterfaces which does not change that often, but when someone does it noone cares to go to the monitoring system to do the proper changes. So what Victor suggests with a new feature for grace period will work perfectly fine for my case. I need that mostly just in case if we need to check some historical stats for an interface that has been deleted lately.
Posting a feature request.
#43
Hello everyone,

I have just noticed something that never crossed my mind so far.

We're extensively using ICMP subagent via nxagent proxies. Therefore we have templates with DCIs like these:
Icmp.AvgPingTime(%{node_primary_ip})
Icmp.PacketLoss(%{node_primary_ip})

When the template is applied on a node the macro %{node_primary_ip} is being transformed with the correct value. My problem is that when we change the node IP address, it is not being updated in the node's DCI,  because it seems the macro is being replaced only at the moment of template apply. Is that correct?

If so Any ideas how to work around it?

Thanks in advance!
#44
Thanks Victor,
I'll give it a try and post feedback.
#45
Hi Victor,
thanks for your input. Yes I've checked everything and everything is as expected, but still the "foreach" node of the filtering script seems like it execute only once - with the value of the first row of the snmpwalk.
Anyways, however I just had a great idea that I can achieve the same functionality with much simpler script, just by reversing the logic. I already made it and it works just as expected.
Here's what I did:

- Instance discovery method = SNMP Walk Values ; Base SNMP OID: .1.3.6.1.2.1.4.20.1.2

Then my filtering script:
snmp = CreateSNMPTransport($node);
ifName = SNMPGetValue(snmp, ".1.3.6.1.2.1.2.2.1.2." . $1);
if (ifName ~= "Loopback.*") {
   return %(false, $1, ifName);
} else {
   return %(true, $1, ifName);
}

Basically I am first SNMPwalking the ipAdEntIfIndex table which returns all IP addresses with a value if the corresponding ifIndex. So I don't need any further filtering - In the values I am getting only ifIndex of all interfaces with IP address set.
Then with the filtering script I am only polling the interface name, along with a quick check to filter out the Loopbacks, because I don't need to monitor them.

Thank you guys, I am getting to like the NetXMS more and more! :)

I just still don't understand what happens when some interface is deleted and instance discovery runs. Are the DCI with their history completely deleted? Is there a way to control that - for example leave - delete them one month after disappearing?