NetXMS grinds to a crawl - Event Processing effectively stops

Started by paul, July 20, 2019, 09:50:32 AM

Previous topic - Next topic

paul

I think I know what is happening and perhaps even what is contributing - but do not know why.

Each night at the moment we get to a point where Events themselves come in and are registered as events but Event Processor starts to grind and Alarms stop being generated - slowly coming through hours later.

A  sh q gives me 330042 items in the Event Processor queue - it is clearly backing up.

Restarting NetXMS core works - it picks up those 300k events and processes them - clearing that queue - but not the ideal way to fix.

There seems to be no settings or options that would provide Event Processing with extra capacity to perform Event Processing Policy.

My mitigation thoughts - reduce number of alarms - but what else can I do?

as a bare minimum - can I add Event Processor as a DCI so I can set a threshold on Event Processor queue - at least let me know when I am in trouble!!

paul

Happened again today - 250,000 queued - another NetXMS restart needed.

Appears that the queue is slowing dropping - about 1,000 per minute - do not have 4 hours to wait for NetXMS to catch up.

An extra 2 cores and 4 GB memory did not help. 

Can see, consistently, CPU usage climbs 15-20% when "grind-to-a-halt" starts.

threads before restart = 491 CPU average = 23% - Events queued 250,000
threads after restart = 482 CPU average = 5% - Events queued zero.

Screen prints attached.

Tursiops

Are you monitoring "Server.TotalEventsProcessed" on your server (DCI)?
It could help identify sudden spikes in events giving you a place to start looking, e.g. does this always happen at the same time? Or after roughly the same time period?
What is the server doing at that time (other than piling up events)? Are there resource issues? Is the housekeeper running? Are there database locks?

For comparison, our system is currently handling 220 events per seconds on average without any issues. We're running 60 EPPs some with somewhat complex filter scripts, but I believe your number is more than double that?
Are you using "Stop event processing" to stop processing EPPs after a matching policy? Without it every event will be checked against all policies. Unless your events match a whole range of policies this shouldn't have a massive impact, but throwing it out there anyway.

paul

Thanks for the feedback.

As you can see in the attached CPU usage chart, apart from the first one that happened at around 16:00, they all seem to start around 10:00 each day.

The last one that required a service bounce was on the 23rd. On that day, I changed some of the Events to have the "stop processing" ticked.

My daily backup starts at 14:10 so it is not that.

I dug out more details from Events Processed and can see that there is a spike when the problem seems to start. Looks like "something" triggers a dump of events at around 10:00 (the high Event Processor queue) of which NetXMS tries to process - getting through about 600 per minute - but due to the where volume of events, simply cannot get through them fast enough.

I will progress through the rest of the Event Processor policies and change them to "Stop event processing" as even though I do have some events that have multiple policies - I do not have any event that has two policies that actually do anything.

I would have thought that, from a speed perspective, that NetXMS knows which policies are triggered for each event and is not doing a full Event Processor Policy scan of every policy for each event - if it did, no wonder it grinds :( - but does not explain my issue - my issue seems to involve a bulk dumping of events to process.

Will put an exception trigger of 500 for events processed in last minute - must exceed this for 5 minutes - and see if I can pick it up this way.

Tursiops

Event processed per minute, are you transforming that into a per-second average or is that literally 500 events per minute that are being processed?
If it's the latter, while it sure looks like your system is hitting a ceiling, it wouldn't be a lot of events to process.
As I mentioned, ours is handling ~220 events per second.
A recent test I ran as part of a question on Telegram showed that our system could handle ~4500 events per second.

In addition to finding what's causing the daily event spikes (the occasional 1,000,000,000+ event spikes on your monthly graph are obviously roll-overs of some sort which you might want to filter out or cap the y-value on the graph), maybe have a look at the NetXMS server disk I/O, CPU and RAM.

paul

No transforming - straight from the Performance tab - NetXMS template standard monitors.
NetXMS server: events processed for last minute

As for capacity to process - when we restart we rip through the backlog in a minute which, for a backlog of 300k equates to about 3-5,000 per second.

I think that in reality, we quite possibly really are only processing about 300 "events" per minute - not very much at all.

We do get a large number of traps that I have set not to write to event log and hence no "event processing" - might look to see if anything their.

DISK I/O queue is average of 0.5 apart from when the 10 minutes when the backup runs - spikes out to 15 for a minute or so.  Sits on FLASH storage so expect this to be fast.

CPU - plenty of cores and also plenty of memory - now. I noticed the "physical memory used by process" had a steady climb - I assume it was the every growing Node count. Deleted my 3000 duplicate nodes and got 1GB back - had added 4GB just to be safe - will watch to see how this goes.

Not in huge hurry to reproduce this - just really strange that it happens.

The really annoying part is that as Event Processor queue is not a performance DCI, I cannot track this issue any better.

On a huge plus side - if you are able to process 220 events per second and I am doing about 300 per minute,  I clearly have capacity to "lean" on NetXMS more. I have ORION in my sights and my only concern (apart from the grind) was capacity.

I think that the 4GB extra ram, the 2 extra cores and the deletion of my duplicate Nodes has probably pushed this issue down so it is not triggering like it was.  Will keep an eye on this though - and would really like to see Event Processor Queue as a Template item - NetXMS critically depends upon this - so instrument it - please!!



paul

Stumbled across this 2.2.16 fix
NX-869 (nxevent sometimes hanges after sending event)

I wonder.....

Will do an upgrade to 2.2.16 or 3.0 in a couple of weeks so will see then.

Victor Kirhenshtein

Hi,

did you check what events were processed during spike? That can gave an idea where to look. Also, do you have filter scripts in your event processing policy?

Best regards,
Victor

paul

Went to 2.2.16 as my primary work around.

No filtering scripts but all policies had the default "stop processing" unticked.

I still seem to have a CPU usage increase around 10:00 each day - but now with my 4000 duplicate nodes deleted plus "stop processing" ticked, do not seem to be being impacted.

I know how to probably check (now I have a few minutes to think about it) - open up the alarms with create time when I think problem starts and compare to event time - they are normally the same so any difference indicates Event Processor is lagging.

Will look a bit closer and see if anything stands out.





paul

Well - 2.2.16 did not fix it :(

Did some load testing and was seeing a consistent 2,000 messages queued which meant a 5 minute delay between event time and Alarm creation time.

Whilst looking tonight, some clown pulled a mgmt interface and triggered a total of  5,000 events over 5 minutes - blowing my Event = Alarm time out again.

As it turns out, 2,500 of them are CISCO MAC Move events .1.3.6.1.4.1.9.9.215.1.3.3.0. - which I do not care about - AT ALL!!

What is my best option for this specific event to unburden event processing?
a) Delete the SNMP trap configuration so it does not become an event and takes away half the load of Event Processor.
b) Leave the configuration, but set the Event Processing Policy to "Do not change alarms" and "Stop Processing"
c) some other option.

Other than the above, what else can anyone suggest I do to speed up Event Processing?

FYI - We are up at 152 EPP's now and growing.


Tursiops

To stop the event from coming in in the first place, you would have to reconfigure your Ciscos. I'm guessing no snmp-server enable traps mac-notification should do it.
Otherwise, as long as the traps come in, NetXMS will have to process "something"?

paul

Thanks for the pointer - extremely appreciated :)

Sadly (for me) this appears to be correct - I need to disable these at Switch level. If I delete the SNMP_TRAP_PROCESSOR for these NetXMS will treat this as unmapped trap and process it anyway - as an unmapped trap event. Sometimes, you just cannot win :(

24,000 traps in last hour - 83% MAC moves.

Command Default
The sending of SNMP MAC notification traps is disabled.

no snmp-server enable traps mac-notification [ change] [ move] [ threshold]

Will see if my Network Admins will let me suppress these - if not - they can have these traps in their NMS - which, ironically they do NOT have.

Does not fix my underlying woeful performance problem - but does provide a significant lightening on the load.




paul

When it gets behind, I can see a consistent 300 - 350 events per minute being processed instead of the 150-250 range it is able to handle. I can put a threshold alert on this!!

Today it got up to 113,000 events behind and had been getting behind since 23:00 last night - 21 hours of incorrect alert times. Still cannot alert on this though.

At bounce time, Event time ==Alert time was 4.5 hour difference.

Getting rid of those MAC move alerts would eliminate this problem (300k per day reduction in events)

The reporting that Events Processed in last minute of 2 Billion I think is not true - shutdown and restart looks like it just treats all queued as actioned - and clears that backlog.

It means that on restart, I lost four and a half hours of events potentially being potential alarms - which I will never know :(

Really would like to know why our NetXMS seems only capable of about 350 events per minute.

I suspect PostgreSQL standard/default install settings likely factor as there is nothing more at a NetXMS level I can alter to improve NetXMS. 

Updated as per post in Config - did not help - possibly even worse!! Event processing started to climb on restart - getting worse and worse.

The EPP's that yesterday I had changed from Disabled to Enabled - Do not change alarms - definitely made it worse. Changed back to disabled and Event Processor queue coming back down - slowly - but at least it is coming back down. Interestingly - Events processed in last minute is now nearly double since disabled.




Tursiops

The following won't help with your performance issue, but should explain the 2 billion events.

The 2 billion events would be the total number of events resetting to 0. In version 3 you can open the raw dci value graph to confirm that. So it would basically be the value of all events that the system had processed since last startup.
I see that with a few counters like that. In version 2 I used to add some code to just drop values that seem crazy high. In version 3 I'm wondering if I can use the raw dci data to detect a decrease in value, at which point I'd simply replace the current simple delta with the current raw dci value (as going down on a total value counter should only happen on a reset to 0). Similar with average delta per minute, except I'd need to take the timestamp into account as well.

paul

Having such huge values which are not an accurate reflection throws out completely the ability to use the data. Given that the "events" are not being truly processed, probably more like a bug.

Would be nice to have some of these things ti be able to set as an exception, or have them report properly in the first place.

As for performance - headache continues :( - more performance testing today, hitting our newer 600 EPM and backlog occurring. Good news though, having managed to get to 600 EPM through the Postgres settings, plus some EPP disabling, we caught up rather than triggering a bounce.

I have since disabled more EPP - down to 100 active - I can get the disabled events from Event log if I really need them. Not an ideal situation - but what options do I have - EVERYTHING else has been tried :(

On a positive note - my Network guys have agreed to stop the MAC move traps. On a negative note - they said they had done it - but traps keep coming. Will get that sorted tomorrow. :)

Holistically speaking, Event Processor Queue needs to be a DCI item as being behind can be a calamity if a mission critical alarm was to be delayed by an hour.  Faith in NetXMS is either there or it is not. If e cannot rely on alarms showing when the event comes in - what is the point of NetXMS? If we cannot even monitor when this is happening - how can we have faith?

I have an exception set for Events Processed in last minute of > 500 for last 4 polls;  indicating NetXMS "might" be under stress. Having a monitor for Event Processor Queue > 1,000 would be much, much, much better.

Even nicer would be the capacity to process 1,000 events per minute AND have 200 EPP's.

My friendly Network guys made their change - what a difference it made!! Fixes my immediate problem - but NetXMS still needs better way of reporting when Event Processing starts banking up.