News:

We really need your input in this questionnaire

Main Menu
Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - paul

#1
General Support / Re: Does anyone have ZTE MIB files?
October 06, 2019, 03:01:13 PM
You might find the ZXR10 group of mibs from ZTE may contain what you need - they are the switch / router mibs for ZTE. It has helped some and not others.

There are 33 of them in total and you can find all of them in the snmplabs mib public repository.

Looking at https://mibs.observium.org/mib/ZXR10-MIB/  suggests trying a snmpwalk of the following (Board Temperature) first and seeing if you get a response back. If you do, you are in luck - one of the above mibs has what you want - or just use that OID.

zxr10SlotTemperature       .1.3.6.1.4.1.3902.3.2.3.1.10

If no response, just do a full snmpwalk and see what OID's you get back. if they are somewhere in the 3902 range, the details of each range are broken down here:
https://mibs.observium.org/mib/ZXR10-SMI/

Other than that, not much I can think of.

Good Luck.


#2
Tursiops - correct as usual :)
https://wiki.netxms.org/wiki/NXSL:TIME

Would be nice for NXSL TIME to also have mona as the actual month - adding 1 to mon so I don't have to.

Adding that within SetCustomAttribute will be fun. Something like one of the following I hope - but probably not that lucky.

SetCustomAttribute($node, "timelastcameup",localtime(time())->mday.".".((localtime(time())->mon)+1).".".localtime(time())->year.", ".localtime(time())->hour.":".localtime(time())->min.":".localtime(time())->sec);

or

SetCustomAttribute($node, "timelastcameup",localtime(time())->mday.".".localtime(time())->mon+1.".".localtime(time())->year.", ".localtime(time())->hour.":".localtime(time())->min.":".localtime(time())->sec);

Or going the long way:


day = localtime($1)->mday;

mon = localtime($1)->mon+1;

year = localtime($1)->year;

hour = localtime($1)->hour;

min = localtime($1)->min;

SetCustomAttribute($node, "timelastcameup", day."-".mon."-".year.",".hour.":".min;


One of these should work.
#3
Announcements / Re: NetXMS 3.0 released
September 12, 2019, 01:57:26 PM
Fantastic work Victor and the team - including those who helped in the testing.

I will wait for the first patch release - or Saturday - whichever comes first :)
#4
I have the following hook in my NodeUp and an equivalent in my node down. Apart from it being the longest way of doing this (first one I found when searching), I get a strange result.

Everything is correct - except - the month is out by 1. It reports month as being last month, not this month. I know the local time is correct(or bein extracted correctly) because I am comparing the Node_Up and Node_Down event times to the custom attributes - and everything matches  - exactly - except the month - which is out by one.

SetCustomAttribute($node, "timelastcameup",localtime(time())->mday.".".localtime(time())->mon.".".localtime(time())->year.", ".localtime(time())->hour.":".localtime(time())->min.":".localtime(time())->sec);

Any ideas?
#5
Thanks for the feedback - appreciated.

Memory I can add - will go to 24GB and see if noticeable difference. Being windows, shared buffers tops out at 512MB I believe(Windows limit), and to use effective_cache_size instead.

Found that exclusions for ativir were not working - getting them fixed to leave postgres folders alone!!

As we are SAN based flash, it is fast, but not local, so will leave the commit and WAL settings where they are for the moment.

Plugged in 24GB and 2 more cores and got the following. Confirms effective_cache vs shared buffers for windows. Will see how these go.

# DB Version: 9.6
# OS Type: windows
# DB Type: oltp
# Total Memory (RAM): 24 GB
# CPUs num: 8
# Data Storage: san

max_connections = 300
shared_buffers = 512MB
effective_cache_size = 18GB
maintenance_work_mem = 1536MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
work_mem = 6844kB
min_wal_size = 2GB
max_wal_size = 4GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
#6
Having such huge values which are not an accurate reflection throws out completely the ability to use the data. Given that the "events" are not being truly processed, probably more like a bug.

Would be nice to have some of these things ti be able to set as an exception, or have them report properly in the first place.

As for performance - headache continues :( - more performance testing today, hitting our newer 600 EPM and backlog occurring. Good news though, having managed to get to 600 EPM through the Postgres settings, plus some EPP disabling, we caught up rather than triggering a bounce.

I have since disabled more EPP - down to 100 active - I can get the disabled events from Event log if I really need them. Not an ideal situation - but what options do I have - EVERYTHING else has been tried :(

On a positive note - my Network guys have agreed to stop the MAC move traps. On a negative note - they said they had done it - but traps keep coming. Will get that sorted tomorrow. :)

Holistically speaking, Event Processor Queue needs to be a DCI item as being behind can be a calamity if a mission critical alarm was to be delayed by an hour.  Faith in NetXMS is either there or it is not. If e cannot rely on alarms showing when the event comes in - what is the point of NetXMS? If we cannot even monitor when this is happening - how can we have faith?

I have an exception set for Events Processed in last minute of > 500 for last 4 polls;  indicating NetXMS "might" be under stress. Having a monitor for Event Processor Queue > 1,000 would be much, much, much better.

Even nicer would be the capacity to process 1,000 events per minute AND have 200 EPP's.

My friendly Network guys made their change - what a difference it made!! Fixes my immediate problem - but NetXMS still needs better way of reporting when Event Processing starts banking up.
#7
The Admin guide is helpful for NetXMS settings, but I think my constraint is likely Postgres given I have a stock standard install - default settings.

We have 2000 nodes, 50k Objects, 70k DCI's and 150 EPP's. Once we get above 300 events per minute we start backing up.
Maximum backlog = 113K events - taking 4.5 hours from Event creation to Alarm creation.

I tried https://pgtune.leopard.in.ua/#/ and got the following:

# DB Version: 9.6
# OS Type: windows
# DB Type: oltp
# Total Memory (RAM): 12 GB
# CPUs num: 6
# Data Storage: ssd

max_connections = 300                      (currently 100)
shared_buffers = 512MB                     (currently 128MB)
effective_cache_size = 9GB                 (default - unknown)
maintenance_work_mem = 768MB      (default - 712MB)
checkpoint_completion_target = 0.9
wal_buffers = 16MB                           (default 16MB)
default_statistics_target = 100
random_page_cost = 1.1
work_mem = 4466kB
min_wal_size = 2GB
max_wal_size = 4GB
max_worker_processes = 6
max_parallel_workers_per_gather = 3

Other than going with the above, anybody have any other suggestions as to which Postgres settings impact or assist NetXMS the most?

#8
When it gets behind, I can see a consistent 300 - 350 events per minute being processed instead of the 150-250 range it is able to handle. I can put a threshold alert on this!!

Today it got up to 113,000 events behind and had been getting behind since 23:00 last night - 21 hours of incorrect alert times. Still cannot alert on this though.

At bounce time, Event time ==Alert time was 4.5 hour difference.

Getting rid of those MAC move alerts would eliminate this problem (300k per day reduction in events)

The reporting that Events Processed in last minute of 2 Billion I think is not true - shutdown and restart looks like it just treats all queued as actioned - and clears that backlog.

It means that on restart, I lost four and a half hours of events potentially being potential alarms - which I will never know :(

Really would like to know why our NetXMS seems only capable of about 350 events per minute.

I suspect PostgreSQL standard/default install settings likely factor as there is nothing more at a NetXMS level I can alter to improve NetXMS. 

Updated as per post in Config - did not help - possibly even worse!! Event processing started to climb on restart - getting worse and worse.

The EPP's that yesterday I had changed from Disabled to Enabled - Do not change alarms - definitely made it worse. Changed back to disabled and Event Processor queue coming back down - slowly - but at least it is coming back down. Interestingly - Events processed in last minute is now nearly double since disabled.



#9
Thanks for the pointer - extremely appreciated :)

Sadly (for me) this appears to be correct - I need to disable these at Switch level. If I delete the SNMP_TRAP_PROCESSOR for these NetXMS will treat this as unmapped trap and process it anyway - as an unmapped trap event. Sometimes, you just cannot win :(

24,000 traps in last hour - 83% MAC moves.

Command Default
The sending of SNMP MAC notification traps is disabled.

no snmp-server enable traps mac-notification [ change] [ move] [ threshold]

Will see if my Network Admins will let me suppress these - if not - they can have these traps in their NMS - which, ironically they do NOT have.

Does not fix my underlying woeful performance problem - but does provide a significant lightening on the load.



#10
Well - 2.2.16 did not fix it :(

Did some load testing and was seeing a consistent 2,000 messages queued which meant a 5 minute delay between event time and Alarm creation time.

Whilst looking tonight, some clown pulled a mgmt interface and triggered a total of  5,000 events over 5 minutes - blowing my Event = Alarm time out again.

As it turns out, 2,500 of them are CISCO MAC Move events .1.3.6.1.4.1.9.9.215.1.3.3.0. - which I do not care about - AT ALL!!

What is my best option for this specific event to unburden event processing?
a) Delete the SNMP trap configuration so it does not become an event and takes away half the load of Event Processor.
b) Leave the configuration, but set the Event Processing Policy to "Do not change alarms" and "Stop Processing"
c) some other option.

Other than the above, what else can anyone suggest I do to speed up Event Processing?

FYI - We are up at 152 EPP's now and growing.

#11
Another fantastic explanation. Thanks Victor.

Most important statement: Is unreachable universal? = yes
In both 2.2 and 3.0 "unreachable" means that NetXMS server cannot reach the node for any reason.

Second most important: Clarity on flags - same flag different value - or separate flags:
"NETWORK_PATH_PROBLEM is an additional indicator "

Will look at undoing my customization at some point, though I do like my last time up and last time down custom attributes I current add :)


#12
Great - "state" attribute works for me ;)

The only thing to know is that a Node, Unreachable because of a Network_Path_problem, will also be SNMP_Unreachable.

So what status will that be?

In my perfect world - a Network_Path_Problem is where my ADSL/BDSL link is down but my 3G backup kicks in - so has a problem but is reachable.

For a Network_Path_Problem where I only have ADSL / BDSL - the upstream device - Router  - is Unreachable - and so is my Switch. Both of those I consider Down.

This would match the 2.2.16 flag of being unreachable - but has 3.0 undone that by having the switch as Network_Path_Problem?

Don't get me wrong - I love the fact that "state" has the various possible values - but it might be better for each of those to be a flag in their own right.

If each one was their own flags - I could have one container for nodes_down (to see who is impacted) and one excluding the Network_Path_Problem - so I can see where the actual problem is.

#13
Fantastic - thanks Victor.

Coding style - mine is mostly a hack and slash approach to something else I have seen as an example or that I know works. Once it works, rather than go back and clean it up - more trial and error - I move onto the next highest priority task.

The more I do with NetXMS, the better I will get.

Between this and the runtime flag for unreachable - very, very happy person I am today :)
#14
wow - great to know. As Tursiops notes though - need to see what these represent in v 3 before pushing forward with this completely.

I did implement a solution based on custom attribute, hooked into SYS_NODE_DOWN and SYS_NODE_UNREACHABLE / SYS_NODE_UP. My status polling hook did not have access to the outcome of the status poll (which I now assume the runtime flag does have) which meant that if SYS_NODE_UP was not triggered, container would not update. I can at least get that hook working properly - at last.

The only real problem I have with my solution and if using the runtime flag - the container that holds down devices only stays red even when last device is removed. If I progress to using the runtime flag rather than binding and unbinding via script I expect that problem will also be solved.

I think it stays red as I unbind before the SYS_NODE_UP changes the node colour back to green by clearing the SYS_NODE_DOWN alert. A timing issue and I do not want to add a "wait 5 seconds" to the various hooks - I can live with nothing down but container is red.
#15
My maps had issues going to 2.2.16.

Only had 2 so deleted and recreated.