News:

We really need your input in this questionnaire

Main Menu
Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Messages - tarnmensch

#1
Sorry, just stumbled upon this one again - of course, the Oracle version I posted was totally wrong. What I meant was 11.2.0.1.0.

Best regards,
Nicki
#2
Quote from: tarnmensch on September 22, 2015, 02:54:47 PM
To answer your questions: It's the 1.2.17 client running on Windows Server 2008 R2.
Do you mean the Oracle version posted above, or do you ask for some other client version? (I'm not too familiar with databases, you know... :-[ )

Best regards,
Nicki
#3
Hey Victor!

Thank you for pushing me right to the solution!  :)

Setting the debug level showed me that the fifth database specified in my Oracle policy caused errors: I didn't notice the login wasn't working as I don't monitor that database, I just included it in the policy to be able to do so later. The database user was locked, so an error message popped up every minute - didn't know the agent was trying to establish the connection even if no DCI ever will poll any value. Thinking about it, it's logical.

Anyway, the error should never cause the agent process to eat all the available memory!

To answer your questions: It's the 1.2.17 client running on Windows Server 2008 R2.

Cheers
Nicki

Edit:
Regarding the attachment: Agent process was growing for a day, then restarted twice, still starting to grow. Then the Oracle policy was deactivated and the database user was unlocked. After another agent restart to get the Oracle subclient working again, the memory consumption seems to stay low. I hope it doesn't keep rising slower now, but that wouldn't be too serious, as the server will reboot every few weeks automatically.

2nd edit:
20 hours later, memory consumption keeps staying in between 22 and 24MB. I think it's just a bug that failing connections make the subagent grow steadily.
#4
Hi,

thanks a lot Victor, you're doing a great Job as always!
#5
Hello again!

So far our network and its monitoring is behaving quite normally, but one single server (the only one with the Oracle subclient enabled) has been doing strange things lately. Maybe these observations are somehow related to each other:

  • The server dropped out of our DNS twice. Its IP was still the same, but the name wasn't known anymore by other nodes. A restart solved the problem.
  • That restart needed almost 30 minutes. The event log tells that some OracleConsole processes didn't start within 16 minutes, but not all of them. In the meantime, you can't access the desktop (even via VM Console), but the databases are already reachable.
  • Between dropping out of the DNS the first and the second time, 2 weeks passed, and in both cases a week before, we couldn't connect via RDP anymore as domain users, just as the local admin.
  • Pinging itself, the server uses IPv6 as default, even though that protocol is disabled. When I tried to enable IPv6, it told me that therefore I needed a network card  ??? - how the hell did I open up that RDP connection if there's no LAN interface?!
  • One of the six databases (4 of them including the special one are monitored using the Oracle.DBInfo.IsReachable parameter) worked fine until last week, but now it keeps crashing within hours. There are different errors, one of them was that the available 150 sessions were full - definitely not by so many users.
  • Right then we noticed that the nxagentd.exe process had reached a memory consumption of 2GB. After restarting the service, it went back to about 25MB but kept growing again - see attached screenshot. CPU utilisation is low, though (just a few times rising to 1 or 2%). (Unfortunately, I didn't monitor the agent prior to that finding.)

We're using Oracle 11.2.0.1.0 x64. This is the policy "oracle" (without the original names and passwords), only applied to that server:

<config>
<agent>
<subagent>oracle.nsm</subagent>
</agent>
<oracle>
<databases>
<database id="1">
<id>111</id>
<tnsname>111</tnsname>
<username>111</username>
<password>xxx</password>
</database>
<database id="2">
<id>222</id>
<tnsname>222</tnsname>
<username>222</username>
<password>xxx</password>
</database>
<database id="3">
<id>333</id>
<tnsname>333</tnsname>
<username>333</username>
<password>xxx</password>
</database>
<database id="4">
<id>444</id>
<tnsname>444</tnsname>
<username>444</username>
<password>xxx</password>
</database>
<database id="5">
<id>555</id>
<tnsname>555</tnsname>
<username>555</username>
<password>xxx</password>
</database>
</databases>
</oracle>
</config>


The failing database is a test instance. I copied imported a current datapump from the live system and ran an update afterwards. Before, this has been working great for many times, so I'm not sure now where the errors emerge from. It could be the datapump export/import, the update, the suspiciously huge nxagentd.exe or just another strange error on a server that seems to have some severe problem.

Could the (sub-) agent be generating more and more sessions to grow linearly and to finally kill our database? (Why just this one?)
Is there a way to tell where these huge amounts of data come from?

Thanks a lot to anyone who can help a bit here - I'm feeling a little uncomfortable about monitoring that server as I fear the subagent is messing up somehow with that system. As long as it's just the test database, it's ok, but 4 of 6 are our live databases.

Best regards from Germany
#6
General Support / How to find out the event parameters?
September 07, 2015, 10:07:22 AM
Hello again!

(As always, sorry for a probably quite silly question...)

I've been learning a lot about NetXMS lately (even though I'm still far away from using complex scripts etc.), but I just can't figure out how to know the parameters for custom events.

I created the events DC_AGENT_REACHABLE and DC_AGENT_UNREACHABLE to be triggered by an AgentStatus DCI. But when an alarm was created and later on terminated, I noticed both events had different parameters. From testing and reading the descriptions of other DC events, I got this list:

DC_AGENT_REACHABLE parameters

  • Parameter name
  • Item description
  • Data collection item ID
  • ?!
  • Threshold value
  • Current value
DC_AGENT_UNREACHABLE parameters

  • Parameter name
  • Item description
  • Threshold value
  • Current value
  • Data collection item ID
  • Instance
  • Repeat flag

Please, can someone tell me why two (as far as I know) similarly created events have got different parameters and how to know them without triggering test alarms? (It's been bugging me that I've accidently been sending test mails to the department mailing list a few times - partly with alarms like "There's less than 200% free space on drive C:" :o )

Thank you for your great work and support!
#7
Hi there!

Sorry if I made some stupid mistake, but I just can't find it... My newly created DCI generates an alarm but doesn't terminate it automatically, though the other alarms that seem to be set up the same way work fine. Here's what I've got:


  • DCI: System.CPU.Usage15, triggering the event DC_HIGH_CPU_UTIL after 3 poll values greater than 70, deactivation event is set to DC_HIGH_CPU_UTIL_OK.
  • Activation event: DC_HIGH_CPU_UTIL, Minor, Write to event log, Message: CPU-Last dauerhaft über %3 (Derzeitiger Wert: %4 für %2)
  • Deactivation event: DC_HIGH_CPU_UTIL_OK, Normal, Write to event log, Message:CPU-Last wieder unter %3 (Derzeitiger Wert: %4 für %2)
  • Generation policy: if DC_HIGH_CPU_UTIL generate alarm %m with key HIGH_CPU_%i and execute action UHD-Mail
  • Termination policy: if DC_HIGH_CPU_UTIL_OK terminate alarms with key HIGH_CPU_%i and execute action UHD-Mail

To test the alarm, I added 71 as a transformation script and set the poll time to 9 seconds. After the third poll, the alarm is triggered and an e-mail is sent. But when resetting the transformation script to "+0", the alarm doesn't get terminated, no e-mail is sent and there's not even a log entry. If I noticed correctly, the "Last Values" tab even shows threshold OK just when the first poll is done, instead of waiting for 3 polls.
The fact that there's no log entry created tells me that the event processing policy isn't even involved, it seems the event just won't be triggered.

Maybe I'm just blind, but please, can someone Show me what I'm doing wrong here?

Thanks a lot!


Edit:

  • I think part of the problem was my transformation script. Instead of deleting it, I changed it from "$1+71" to "$1+0", but from that point, the graph shows almost all the time the value 0, with some short peaks to +1 as well as -1 :o. As a CPU load of -1%  is rather unlikely, I deactivated the transformation script, and now the values stay around 2 to 4.
  • Testing the same settings with System.CPU.Usage works just fine... Is it just incorrect to set a transformation script or new thresholds to trigger and terminate alarms? (For some scenarios it's rather hard to simulate failures for testing purposes, so setting normal values as critical ones is a very easy way to test the alarm processing.)


Whoaps, sorry - but maybe this topic should stay up, just in case someone will try the same... Here's my conclusion:

  • To test your alarms, don't change your transformation scripts just to get your values into and out of critical ranges - in my case terminating the alarm didn't work.
  • If you don't need your transformation script, delete it. Setting it to "$1+0" may cause errors.
#8
Hello Victor,

Thank you for your reply and the great work - I finally found time to change to the AgentStatus parameter.

Best regards
#9
Hello everyone!

Let me first thank you for a great piece of software as well as a very useful forum! I've been using NetXMS for about 2 months now, so I'm still learning a lot here.

My configuration works quite well, but here's my problem: To get a warning if an agent fails, I included SYS_AGENT_UNREACHABLE in the event processing policy, but I can't find a way to avoid alarms that emerge from reboots or just short network issues. Is there any way to set the alarm to wait for 3 failed polls like you can do for every data collector as well as the general node reachable state?

Sorry if I overlooked some solution here. The only idea I can think of is to set the AgentCommandTimeout variable to some very high value, but I think this could influence the NetXMS behaviour generally.

Thanks a lot, bye!