Topics - tarnmensch

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Topics - tarnmensch

Pages1

General Support / nxagentd.exe eating up memory, maybe related to the Oracle subclient

September 17, 2015, 02:45:56 PM

Hello again!

So far our network and its monitoring is behaving quite normally, but one single server (the only one with the Oracle subclient enabled) has been doing strange things lately. Maybe these observations are somehow related to each other:

The server dropped out of our DNS twice. Its IP was still the same, but the name wasn't known anymore by other nodes. A restart solved the problem.
That restart needed almost 30 minutes. The event log tells that some OracleConsole processes didn't start within 16 minutes, but not all of them. In the meantime, you can't access the desktop (even via VM Console), but the databases are already reachable.
Between dropping out of the DNS the first and the second time, 2 weeks passed, and in both cases a week before, we couldn't connect via RDP anymore as domain users, just as the local admin.
Pinging itself, the server uses IPv6 as default, even though that protocol is disabled. When I tried to enable IPv6, it told me that therefore I needed a network card - how the hell did I open up that RDP connection if there's no LAN interface?!
One of the six databases (4 of them including the special one are monitored using the Oracle.DBInfo.IsReachable parameter) worked fine until last week, but now it keeps crashing within hours. There are different errors, one of them was that the available 150 sessions were full - definitely not by so many users.
Right then we noticed that the nxagentd.exe process had reached a memory consumption of 2GB. After restarting the service, it went back to about 25MB but kept growing again - see attached screenshot. CPU utilisation is low, though (just a few times rising to 1 or 2%). (Unfortunately, I didn't monitor the agent prior to that finding.)

We're using Oracle 11.2.0.1.0 x64. This is the policy "oracle" (without the original names and passwords), only applied to that server:

Code Select


<config>
	<agent>
		<subagent>oracle.nsm</subagent>
	</agent>
	<oracle>
		<databases>
			<database id="1">
				<id>111</id>
				<tnsname>111</tnsname>
				<username>111</username>
				<password>xxx</password>
			</database>
			<database id="2">
				<id>222</id>
				<tnsname>222</tnsname>
				<username>222</username>
				<password>xxx</password>
			</database>
			<database id="3">
				<id>333</id>
				<tnsname>333</tnsname>
				<username>333</username>
				<password>xxx</password>
			</database>
			<database id="4">
				<id>444</id>
				<tnsname>444</tnsname>
				<username>444</username>
				<password>xxx</password>
			</database>
			<database id="5">
				<id>555</id>
				<tnsname>555</tnsname>
				<username>555</username>
				<password>xxx</password>
			</database>
		</databases>
	</oracle>
</config>

The failing database is a test instance. I copied imported a current datapump from the live system and ran an update afterwards. Before, this has been working great for many times, so I'm not sure now where the errors emerge from. It could be the datapump export/import, the update, the suspiciously huge nxagentd.exe or just another strange error on a server that seems to have some severe problem.

Could the (sub-) agent be generating more and more sessions to grow linearly and to finally kill our database? (Why just this one?)
Is there a way to tell where these huge amounts of data come from?

Thanks a lot to anyone who can help a bit here - I'm feeling a little uncomfortable about monitoring that server as I fear the subagent is messing up somehow with that system. As long as it's just the test database, it's ok, but 4 of 6 are our live databases.

Best regards from Germany

General Support / How to find out the event parameters?

September 07, 2015, 10:07:22 AM

Hello again!

(As always, sorry for a probably quite silly question...)

I've been learning a lot about NetXMS lately (even though I'm still far away from using complex scripts etc.), but I just can't figure out how to know the parameters for custom events.

I created the events DC_AGENT_REACHABLE and DC_AGENT_UNREACHABLE to be triggered by an AgentStatus DCI. But when an alarm was created and later on terminated, I noticed both events had different parameters. From testing and reading the descriptions of other DC events, I got this list:

DC_AGENT_REACHABLE parameters

Parameter name
Item description
Data collection item ID
?!
Threshold value
Current value

DC_AGENT_UNREACHABLE parameters

Parameter name
Item description
Threshold value
Current value
Data collection item ID
Instance
Repeat flag

Please, can someone tell me why two (as far as I know) similarly created events have got different parameters and how to know them without triggering test alarms? (It's been bugging me that I've accidently been sending test mails to the department mailing list a few times - partly with alarms like "There's less than 200% free space on drive C:"

)

Thank you for your great work and support!

General Support / [solved, but found bug] Automatic alarm termination won't work in just one case

August 28, 2015, 03:52:53 PM

Hi there!

Sorry if I made some stupid mistake, but I just can't find it... My newly created DCI generates an alarm but doesn't terminate it automatically, though the other alarms that seem to be set up the same way work fine. Here's what I've got:

DCI: System.CPU.Usage15, triggering the event DC_HIGH_CPU_UTIL after 3 poll values greater than 70, deactivation event is set to DC_HIGH_CPU_UTIL_OK.
Activation event: DC_HIGH_CPU_UTIL, Minor, Write to event log, Message: CPU-Last dauerhaft über %3 (Derzeitiger Wert: %4 für %2)
Deactivation event: DC_HIGH_CPU_UTIL_OK, Normal, Write to event log, Message:CPU-Last wieder unter %3 (Derzeitiger Wert: %4 für %2)
Generation policy: if DC_HIGH_CPU_UTIL generate alarm %m with key HIGH_CPU_%i and execute action UHD-Mail
Termination policy: if DC_HIGH_CPU_UTIL_OK terminate alarms with key HIGH_CPU_%i and execute action UHD-Mail

To test the alarm, I added 71 as a transformation script and set the poll time to 9 seconds. After the third poll, the alarm is triggered and an e-mail is sent. But when resetting the transformation script to "+0", the alarm doesn't get terminated, no e-mail is sent and there's not even a log entry. If I noticed correctly, the "Last Values" tab even shows threshold OK just when the first poll is done, instead of waiting for 3 polls.
The fact that there's no log entry created tells me that the event processing policy isn't even involved, it seems the event just won't be triggered.

Maybe I'm just blind, but please, can someone Show me what I'm doing wrong here?

Thanks a lot!

Edit:

I think part of the problem was my transformation script. Instead of deleting it, I changed it from "$1+71" to "$1+0", but from that point, the graph shows almost all the time the value 0, with some short peaks to +1 as well as -1 . As a CPU load of -1% is rather unlikely, I deactivated the transformation script, and now the values stay around 2 to 4.
Testing the same settings with System.CPU.Usage works just fine... Is it just incorrect to set a transformation script or new thresholds to trigger and terminate alarms? (For some scenarios it's rather hard to simulate failures for testing purposes, so setting normal values as critical ones is a very easy way to test the alarm processing.)

Whoaps, sorry - but maybe this topic should stay up, just in case someone will try the same... Here's my conclusion:

To test your alarms, don't change your transformation scripts just to get your values into and out of critical ranges - in my case terminating the alarm didn't work.
If you don't need your transformation script, delete it. Setting it to "$1+0" may cause errors.

General Support / SYS_AGENT_UNREACHABLE alarm not until 3 polls fail

August 10, 2015, 09:37:08 AM

Hello everyone!

Let me first thank you for a great piece of software as well as a very useful forum! I've been using NetXMS for about 2 months now, so I'm still learning a lot here.

My configuration works quite well, but here's my problem: To get a warning if an agent fails, I included SYS_AGENT_UNREACHABLE in the event processing policy, but I can't find a way to avoid alarms that emerge from reboots or just short network issues. Is there any way to set the alarm to wait for 3 failed polls like you can do for every data collector as well as the general node reachable state?

Sorry if I overlooked some solution here. The only idea I can think of is to set the AgentCommandTimeout variable to some very high value, but I think this could influence the NetXMS behaviour generally.

Thanks a lot, bye!

Pages1

NetXMS Support Forum

News:

Show posts

Topics - tarnmensch

General Support / nxagentd.exe eating up memory, maybe related to the Oracle subclient

General Support / How to find out the event parameters?

General Support / [solved, but found bug] Automatic alarm termination won't work in just one case

General Support / SYS_AGENT_UNREACHABLE alarm not until 3 polls fail