News:

We really need your input in this questionnaire

Main Menu

Event Processing stops

Started by woodrivercontrols, February 22, 2025, 04:44:08 AM

Previous topic - Next topic

woodrivercontrols

Just recently we've started having an issue where NetXMS will completely stop event processing. Checking the events log through nxmc will show the last processed event and will never update, and the event processing queue will begin to pile up. Restarting netxmsd.service clears the queue and processing will work again for a while.
The VM has plenty of resources, and it doesn't seem like memory/CPU/io is limiting.
I set debug level to 6 but was unable to see anything in the logs that looked relevant, however debug level 6 does spit out a ton of logging so I may have missed something . Is there a higher debug level I should be using ? My biggest concern is that the logs at higher levels put out so much information that I would need to catch the issue within minutes at our current log retention , can I set the logs to be specific topics or even just for longer retention ?
Any other troubleshooting advice would be welcome as well
Thanks

Filipp Sudanov

If you do
debug event.* 8
in server debug console, this will set debug level 8 only for event.* debug tags. This should produce way less lines in the log.

It could happen that event parsing thread locks. When the situation starts, pls run this script three times with 20-30 second interval:
https://github.com/netxms/netxms/blob/master/tools/capture_netxmsd_threads.sh
Script requires gdb to be installed in the system. It produces files in /tmp folder, pls share these with us

What version of NetXMS are you using?
What is the value of Events.Processor.PoolSize in Configuration->Server Configuration?

woodrivercontrols

Setting debug event.* 8 helped immensely for the logs, it locked up processing yesterday while I was out but I was still able to view what I needed as the logs hadn't overwritten yet. 
The last event that shows up in the Events Log tool in NXMC is SYS_SSH_OK for CN-RKr-BH at 17:58:21, in the screenshot I can see what I believe to be the issue beginning right after that, the SQL Query Failed logs begin after this and pretty much fill up everything since that time.
Logs at time of lockup


I have files generated from running the script but I am unsure of the best way to share them on the forum, how would you like me to share those?

Filipp Sudanov

I've sent a link in private message

Filipp Sudanov

Hi,
Yes, I see the files. Sorry, can you please install netxms-dbg package on the system and capture and upload file again?

Also please provide the following:
What version of NetXMS are you using?
What is the value of Events.Processor.PoolSize in Configuration->Server Configuration?

woodrivercontrols

I am not sure if this is relevant, but our database is located on a seperate VM, netxms-dbg installed several new packages to the system, 'sudo apt-get install netxms-dbg' broke our installation, I have updated everything and brought it back online now, and will send the new files momentarily.

Our running version was 5.1.3, it is now 5.1.4. The issue continues.
Events.Processor.PoolSize is 1200.

Filipp Sudanov

#6
Yes, because netxms-dbg by default wanted to install 5.1.4, but since versions of all packages have to be the same, it pulled newer version of all netxms packages. There's a way to tell apt-get to install specific version of a package.

On Events.Processor.PoolSize - it's actually limited by 128 and depends on what you have in Events.Processor.QueueSelector - by default it is %z so that events from each zone go to a separate queue. So if you have just one zone, then it's still everything in one queue. Using something else for queue selector might be dangerous. In overall, except if you system is really huge, it should be enough with one event processor, but EPP rules should not take long time to do their job.

So, what's happening on your system - you have some epp rule which calls an action (which probably sends a notification). Preparation of text for this action calls a script via %[script_name] macro. And the thing is that while this script is running, this blocks further EPP processing. And you have some script that is talking to some API. I've sent some more details in PM.

Once the text is prepared, subsequent processing of this action is detached from EPP processing and not blocking anything.

So what you can improve:


- instead of notification sending action you can have script action and in that script you can talk to API and do SendNotification() - this won't delay EPP processing.

- depending on what this API is doing, may be you can have  a scheduled task for nxsl script that would pull data, parse it (nxsl has support for json parsing) and stores data in persistent storage or custom attributes of some object. Then it would be faster to get data from there.

- When calling web service from script there's acceptCached parameter that can speed up operations with a web api: https://netxms.org/documentation/nxsl-latest/#_instance_methods_17

woodrivercontrols

Thank you for the help, I discovered in the script a section that had the potential to infinitely loop if the webservice it was accessing was offline. I fixed this oversight in the script, and moved the script to be the Event Processing action as per your first suggestion, and we haven't had the issue since.
Thanks again!