nxagentd.exe eating up memory, maybe related to the Oracle subclient

Started by tarnmensch, September 17, 2015, 02:45:56 PM

Previous topic - Next topic

tarnmensch

Hello again!

So far our network and its monitoring is behaving quite normally, but one single server (the only one with the Oracle subclient enabled) has been doing strange things lately. Maybe these observations are somehow related to each other:

  • The server dropped out of our DNS twice. Its IP was still the same, but the name wasn't known anymore by other nodes. A restart solved the problem.
  • That restart needed almost 30 minutes. The event log tells that some OracleConsole processes didn't start within 16 minutes, but not all of them. In the meantime, you can't access the desktop (even via VM Console), but the databases are already reachable.
  • Between dropping out of the DNS the first and the second time, 2 weeks passed, and in both cases a week before, we couldn't connect via RDP anymore as domain users, just as the local admin.
  • Pinging itself, the server uses IPv6 as default, even though that protocol is disabled. When I tried to enable IPv6, it told me that therefore I needed a network card  ??? - how the hell did I open up that RDP connection if there's no LAN interface?!
  • One of the six databases (4 of them including the special one are monitored using the Oracle.DBInfo.IsReachable parameter) worked fine until last week, but now it keeps crashing within hours. There are different errors, one of them was that the available 150 sessions were full - definitely not by so many users.
  • Right then we noticed that the nxagentd.exe process had reached a memory consumption of 2GB. After restarting the service, it went back to about 25MB but kept growing again - see attached screenshot. CPU utilisation is low, though (just a few times rising to 1 or 2%). (Unfortunately, I didn't monitor the agent prior to that finding.)

We're using Oracle 11.2.0.1.0 x64. This is the policy "oracle" (without the original names and passwords), only applied to that server:

<config>
<agent>
<subagent>oracle.nsm</subagent>
</agent>
<oracle>
<databases>
<database id="1">
<id>111</id>
<tnsname>111</tnsname>
<username>111</username>
<password>xxx</password>
</database>
<database id="2">
<id>222</id>
<tnsname>222</tnsname>
<username>222</username>
<password>xxx</password>
</database>
<database id="3">
<id>333</id>
<tnsname>333</tnsname>
<username>333</username>
<password>xxx</password>
</database>
<database id="4">
<id>444</id>
<tnsname>444</tnsname>
<username>444</username>
<password>xxx</password>
</database>
<database id="5">
<id>555</id>
<tnsname>555</tnsname>
<username>555</username>
<password>xxx</password>
</database>
</databases>
</oracle>
</config>


The failing database is a test instance. I copied imported a current datapump from the live system and ran an update afterwards. Before, this has been working great for many times, so I'm not sure now where the errors emerge from. It could be the datapump export/import, the update, the suspiciously huge nxagentd.exe or just another strange error on a server that seems to have some severe problem.

Could the (sub-) agent be generating more and more sessions to grow linearly and to finally kill our database? (Why just this one?)
Is there a way to tell where these huge amounts of data come from?

Thanks a lot to anyone who can help a bit here - I'm feeling a little uncomfortable about monitoring that server as I fear the subagent is messing up somehow with that system. As long as it's just the test database, it's ok, but 4 of 6 are our live databases.

Best regards from Germany

Victor Kirhenshtein

Hi,

what version of agent you are running and on what OS? Could you run agent for some time with debug level 6 and send me log file?

Best regards,
Victor


tarnmensch

Hey Victor!

Thank you for pushing me right to the solution!  :)

Setting the debug level showed me that the fifth database specified in my Oracle policy caused errors: I didn't notice the login wasn't working as I don't monitor that database, I just included it in the policy to be able to do so later. The database user was locked, so an error message popped up every minute - didn't know the agent was trying to establish the connection even if no DCI ever will poll any value. Thinking about it, it's logical.

Anyway, the error should never cause the agent process to eat all the available memory!

To answer your questions: It's the 1.2.17 client running on Windows Server 2008 R2.

Cheers
Nicki

Edit:
Regarding the attachment: Agent process was growing for a day, then restarted twice, still starting to grow. Then the Oracle policy was deactivated and the database user was unlocked. After another agent restart to get the Oracle subclient working again, the memory consumption seems to stay low. I hope it doesn't keep rising slower now, but that wouldn't be too serious, as the server will reboot every few weeks automatically.

2nd edit:
20 hours later, memory consumption keeps staying in between 22 and 24MB. I think it's just a bug that failing connections make the subagent grow steadily.

Victor Kirhenshtein

Hi,

I've made some checks and was unable to detect memory leak when agent cannot connect to database. What Oracle client you are using?

Best regards,
Victor

tarnmensch

Quote from: tarnmensch on September 22, 2015, 02:54:47 PM
To answer your questions: It's the 1.2.17 client running on Windows Server 2008 R2.
Do you mean the Oracle version posted above, or do you ask for some other client version? (I'm not too familiar with databases, you know... :-[ )

Best regards,
Nicki

tarnmensch

Sorry, just stumbled upon this one again - of course, the Oracle version I posted was totally wrong. What I meant was 11.2.0.1.0.

Best regards,
Nicki

VladimirV

Hi,
we have same issue. Memory consumption increased tenfold in a month.

Oracle DB server:
Win 2012R2
Oracle DB 11.2
NetXMS agent 2.1.2


Victor Kirhenshtein

Hi,

please try upgrading agent to 2.2.8. There were lot of fixes since 2.1.2.

Best regards,
Victor