Simple question - CDM monitoring using snmp - anybody doing this ** all good **

Started by paul, May 23, 2019, 06:50:22 PM

Previous topic - Next topic

paul

It seemed like a simple idea. Replace existing clunky snmp based monitoring product with netxms. With all devices already setup for SNMP - no need for agent installs - just use templates.

Although the netxms preference clearly is for using local agent, has anybody gone down the path of using SNMP only for Linux and Windows devices for CPU / Disk / Memory monitoring - or is everybody using agents?

The reason for asking is that installing agents locally is not an option, so wanting to know now whether to progress with using netxms with SNMP or stop now.

Tursiops

We did originally use SNMP for Linux systems (never for Windows, as we needed the agent for other purposes anyway), but switched to agents.
Not that it didn't work, we just found that agents provided additional useful functionality for our Linux systems as well.

In other words: I'm sure you can make it work with SNMP only.

paul

I am sure this has been done 500 times already - and done by people much more knowledgeable with netxms.

Just about any / every CPU / Disk / Memory monitor that is SNMP based is using the HOST-RESOURCE mib or its predecessor, the UCD-SNMP mib; both of which are defined from RFC 2790 - standard as part of net-snmp. Windows have this implemented as part of the HOST mib and the MIB_II mib that are standard with every windows install.

So, to ask another way, has anybody created any templates, based on SNMP, that collect CPU utilization, where there are multiple cores / processors; collect disk storage information %utilization and %free and Total - where there are multiple drives / mount points' and the same for memory - % used and % free and Total - Physical, Virtual and Swap?

My contribution is to spend time mapping traps to events and alarms. I have 2600 active mibs, about 100 critical, which could benefit from being templated. I can get up to 100,000 traps a week so not having to try and reinvent a wheel many others have already invented, would be great. I believe that is where my time can be better spent.

Did I at least try? - Yes. As per below, the documentation is conflicting - so the smart option is to simply ask those who have already done it  :)

The documentation under lists says not supported
https://www.netxms.org/documentation/adminguide/data-collection.html#list-dcis

The documentation under instance appears to give an example of how to get a list of mount points. Missing is how to set thresholds for each mount point etc. - but it shows this has been done.
https://www.netxms.org/documentation/adminguide/data-collection.html#instance

paul

OK...lets make this even simpler.

Can somebody point me at the old SNMP template for Linux that existed previously.

That would be enough to get me going and I can work out the Windows equivalent by using the Windows equivalent OIDs.


Tursiops

I don't have the old default one, nor do I have a particularly great template. Never had a need to clean it up in regards to auto-bind or instance-discovery, as we're not really using this anymore (it gets auto-applied to a few NAS devices still). But maybe it helps getting you started. I've removed our threshold configuration as it's using a number of some custom events.

paul

Absolutely fantastic - made my day :)

Having completed my fresh install - ready to get this going.

paul

The excitement of the template - followed by the agony of "unsupported" :(

but.... a snmpwalk using snmpwalk utility showed it working - but netxms walk was blank...

hmmm...

Even though device was recognized as snmp and details showing - it decided fault community.

Updated community..... and yes - I have data!!

It is only File System so far, and it does have duplicates for some reason - but thank you very much - made my whole year!!!

paul

And even more now green.

Absolutely fantastic day - thanks all :0

paul

And apart from duplicates for the file system entries, everything is now green.

So what did I do to get it working? - apart from the correcting the community string - which is a netxms discovery issue, not the template - nothing. Just needed to wait :)

Tursiops

Glad to hear it works for you. :)

The reason for some data not working right away would've been because some items are polled every five minutes or so, while others are only polled only once an hour (like total space, there was no expectation of that changing a lot).
So when the first check failed due to incorrect community, it would've taken another hour to fix itself. Right-click and force poll would've fixed it immediately.
You'll probably want to adjust the instance discovery script to filter out file systems you don't care about.
There are some examples on the forum I believe on how to filter by file system type (not just instance name).

Good luck. :)

paul

Thanks for the tip - be patient!!! I lost hours today expecting instance refresh - I know better now!.

Well...so how am I progressing?  it gets soooo much better :) :)

The template already had a number of exclusions so I can update / alter that as required.

But that is not the best bit....

Given that Windows uses the same HOST-RESOURCE-TABLE - I have been able to make a huge start on a Windows template.

Created a new template object and copied across the three file system DCI's.
Updated Instance Discovery to delete the return false check for the file system to start with a / - replaced it with an exclusion for our CD drives - y: drive.
if (mountPoint->value ~= "^Y:(*.*)?" ) return false;

Which leaves me 75% towards a fully functioning Windows SNMP template.

CPU is proving tricky as Windows has CPU utilization for each processor rather than overall so need to be able to collect each processor as well as do an aggregated total for all processors.
The OID in question is .1.3.6.1.2.1.25.3.3.1.2 - will see how I go.

Also of note - in the template there is code for a driveAlertCustom - which I have yet to play with. I assume that it is there to allow for different thresholds for different mounts?



Tursiops

I probably should've cleaned that template up a bit more before posting it.
As I said, this is an older template I built before we switched to agents for Linux.

However, we do still use it for NAS devices and the driveAlertCustom is something that was added to allow for different threshold alerts depending on mount points. It's a combination of auto-bind rules, instance discovery and custom attributes.
By now, I would do this very differently and use script thresholds with default alert values in persistent storage and individual overrides as custom attributes.

So my suggestion for you: remove any reference to that from the instance discovery script. It doesn't do anything without the other stuff. And that other stuff is overly complicated. :)

paul

That template saved my sanity - including those *extras*. I have cleaned it out already, however, it was mentioned as, like you pointed out, we also need different thresholds.

It may have been an old template - but there is nothing newer. It was a fantastic start and without it, I would have given up and moved on - seriously!!

where has it now progressed to?
Windows CPU % busy is now working universally and unilaterally. Thanks to Victor for the assist with this - pulled a few handfuls out on this!!
Windows memory I am getting from the mounts table and as long as I remember that Virtual = Physical + Paging - memory monitoring is also fine.
Uptime in days is collected - both numerically for exception monitoring and in easy to read format for display on Overview
Windows file systems - automatic discovery of each file system is working.

For me - less than 2 weeks into NetXMS - script thresholds and default alert values in persistent storage are still beyond me. I have yet to have write a script and the only script I have updated is the discovery hook to populate object name from sysdescription if name is not resolved on discovery.

So what is next?
The different defaults depending on the drive. system drive (C:)  no more than 70%  Data Drive can go to 90% (D:) and Paging Drive (P:) - no threshold as only has paging file.
Override the default - some servers we never change will allow for a higher utilization - no need to add space just to get under a threshold.

I do need different thresholds for individual mount points so I will need to work that out.

I have no preference which way to go - I just want to get there in the quickest and easiest manner. Once finished (or sooner) I will post the template so that any other new users can get quickly up to speed if they use SNMP based system monitoring.



There are thousands out there like me - inherited SNMP based monitoring along with entrenched mindsets against agents - so I live within what I can. There are also those that are frustrated with the monitoring that they have (SolarWinds ORION(NPM) and ManageEngine OpManager) - my two current hair pullers.

Making NetXMS the easy and obvious choice would be fantastic - all it needs, really, are a few additional basic templates. Everyone I have showed are impressed - the loading time is fantastic, the ease of navigation(once you get the hang of it), multi line trap display, multi line emails of selected alarms, comments on alarms that can be updated and deleted. So many of the things with my other products that cause my angst are solved with NetXMS.

So, if you would like to do it differently - feel free to make some suggestions. The scope is as follows:
SNMP only
1000+ devices
Windows (server / Desktop / CE)
Linux(RHEL/Centos/Debian/Ubuntu/Solaris/Rasperian)
CISCO switches
Other devices that support SNMP monitoring such as Firewalls, load balancers, etc.
SNMP Traps - need  to actively support multi line, 15+ varbind traps - including presentation in readable format in both console and via email.

I know NetXMS can do all the above and I am about 95% there already.

I am happy to grind away and work out the rest - however - specific thresholds for DCI's contained within an instance populated table requires NXSL and NetXMS knowledge I simply do not have(yet), and without examples that I can copy and modify, it is an uphill grind. I have 2,000+ traps to work out how to get into NetXMS - NetXMS only includes 8 by default - so I already have my work cut out.

Am I complaining - no way!! - NetXMS is fantastic and the template you supplied was / is awesome. I just need to work out thresholds both the old way and the new way - (we have NAS stuff as well) - and then I can actually go live with this.

Tursiops

I did some work on my test system in regards to thresholds as I've had a similar internal request for a while so can put it to use on our own system at some point.

For your Volume Space Used (%) DCI (on the instance DCI in your template), add a script threshold as follows (Value for script thresholds need to be "1"):
if ( GetCustomAttribute($node,"fileSystem_Warning_".$dci->instance) != NULL ) { threshold = GetCustomAttribute($node,"fileSystem_Warning_".$dci->instance); }
else if ( GetCustomAttribute($node,"fileSystem_Warning") != NULL ) { threshold = GetCustomAttribute($node,"fileSystem_Warning"); }
else if ( ReadPersistentStorage("fileSystem_Warning_".$dci->instance) != "" ) { threshold = ReadPersistentStorage("fileSystem_Warning_".$dci->instance); }
else { threshold = ReadPersistentStorage("fileSystem_Warning"); }
if ( ( threshold == null ) || ( threshold == "" ) ) { return null; }
if ( $1 >= threshold ) return 1;
else return 0;

That'll be for your "Warning" Threshold.
Create two more, one using fileSystem_Error and one using fileSystem_Critical instead of fileSystem_Warning. Those will be your Error and Critical thresholds, assign events as required.

This will allow you to use the following "Persistent Storage" variables (which you must create first):

  • fileSystem_Critical
  • fileSystem_Error
  • fileSystem_Warning
These will be the default thresholds for those states.

You can override them in three ways:

  • A global override for a specific file system, i.e. a Persistent Storage variable called "fileSystem_Critical_P:" with a value of 98 will mean any file system with instance "P:" will trigger a critical threshold at 98 or higher.
  • A per-node override. This overrides all global thresholds and will be the new default for this node for all file systems. Simply create a "fileSystem_Critical" (or _Error or _Warning) Custom Attribute on the node and set it to the required value.
  • A per-node, per-file system override. Same as with the global override for a file system, but create this as a custom attribute as opposed to a Persistent Storage variable. For example creating a "fileSystem_Critical_P:" custom attribute with a value of 50 will trigger the critical threshold for file system P: on this particular node.

Mix and match as required.

Priority in thresholds is as follows:

  • Per-Node & Per-FileSystem Override
  • Per-Node Override
  • Global Per-FileSystem Override
  • Global Default

As always, there may be easier/faster/better/more efficient ways of doing this.
The above worked on my test system (I was testing with Linux, i.e. /, /var and similar instances, but should work with P:, C:, etc. as well) and should get you started.
If nothing else, it might serve as an example of Persistent Storage & Custom Attribute usage. :)

Note: What the above will not do is include the actual disk space usage value in the alert message.

paul

Well - another fantastic piece of assistance - in and already working!! :) :)

What is nice is that when opening Alarm details - the DCI is shown AND each of the threshold alarms that have triggered are shown in events.

The only observation to make is that it does actually the space usage in the message - it just does not include the threshold from the Persistent Storage variable.

I assume - somehow - that this can be inserted / embedded into the alert somewhere.  Worst case - a feature request to be able to specify a Persistent Stored variable as the "exception" in Threshold definition screen for script thresholds so that the custom threshold is passed across in the event / alarm title rather than "script(1)".

I have not started on specific drive level yet = just getting it up and going is 95% of all battles for me - so very happy so far.