News:

We really need your input in this questionnaire

Main Menu
Menu

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Show posts Menu

Topics - jermudgeon

#1
I have 82421 collectible DCIs and an additional 25552 that are missing a corresponding node.

nxdbmgr check does not find and purge these.

There do not appear to be any leftover data in the corresponding idata tables.

select
   count(*)
   from items i
   left join object_properties p on i.node_id = p.object_id
   where p.name is null
#2
Postgres backend is functioning, but netxms reports 'Server has lost connection with backend database'.

sh dbcp
netxmsd: sh dbcp
0x7f8745e44780 04.May.2020 17:42:14 dbwrite.cpp:457
0x7f873cbd5d80 04.May.2020 17:42:15 dcitem.cpp:1309
0x7f873cbd5960 04.May.2020 17:42:05 syncer.cpp:238
3 database connections in use

Stuck since yesterday.

sh q

netxmsd: sh q
Data collector                   : 0
DCI cache loader                 : 46673
Template updates                 : 0
Database writer                  : 5
Database writer (IData)          : 3354870
Database writer (raw DCI values) : 28133
Event processor                  : 1334
Event log writer                 : 0
Poller                           : 1319
Node discovery poller            : 1187
Syslog processing                : 0
Syslog writer                    : 0
Scheduler                        : 0


dbcp reset doesn't do much:

netxmsd: dbcp reset
Resetting database connection pool
Database connection pool reset completed
netxmsd: sh dbcp
0x7f8745e44780 04.May.2020 17:42:14 dbwrite.cpp:457
0x7f873cbd5d80 04.May.2020 17:42:15 dcitem.cpp:1309
0x7f873cbd5960 04.May.2020 17:42:05 syncer.cpp:238
3 database connections in use

#3
General Support / 3.3 timescale upgrade procedure
May 05, 2020, 06:53:08 PM
               
WARNING: Background upgrades pending. Please run nxdbmgr background-upgrade when possible.
[jaustin@jaustin systems]$ nxdbmgr background-upgrade                                     
NetXMS Database Manager Version 3.3.285 Build 3.3-285-gfe2e9b646f (UNICODE)
                                                                           
Running background upgrade procedure for version 33.6
Converting table idata_sc_default                   
Converting table idata_sc_7     
Converting table idata_sc_30                                                   
Converting table idata_sc_90                                             
Converting table idata_sc_180                                   
Converting table idata_sc_other                                                                                                                                           
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corru
pted shared memory.                                                                       
HINT:  In a moment you should be able to reconnect to the database and repeat your command.

----

I restarted the background-upgrade process and got this:

RNING: Background upgrades pending. Please run nxdbmgr background-upgrade when possible.
[jaustin@jaustin systems]$ nxdbmgr background-upgrade
NetXMS Database Manager Version 3.3.285 Build 3.3-285-gfe2e9b646f (UNICODE)

Running background upgrade procedure for version 33.6
Converting table idata_sc_default
SQL query failed (42P01 ERROR:  relation "v33_5_idata_sc_default" does not exist
LINE 1: ...stamp(idata_timestamp),idata_value,raw_value FROM v33_5_idat...
                                                             ^):
INSERT INTO idata_sc_default (item_id,idata_timestamp,idata_value,raw_value) SELECT item_id,to_timestamp(idata_timestamp),idata_value,raw_value FROM v33_5_idata_sc_default
Background upgrade procedure for version 33.6 failed


----

So apparently the upgrade process can't handle a resume, as it breaks on the tables that have already been migrated. (It had already dropped successful tables.)

I manually completed the inserts from the remaining tables using ON CONFLICT DO NOTHING, and dropped the remaining v33_5 tables.

However, the same problem exists -- I can't complete the upgrade because the v33_5 tables no longer exist.
#4
General Support / Rate of deleted objects
March 23, 2020, 05:25:57 PM
Is there a way to speed up deletion of objects? syncer.cpp appears to be deleting objects one at a time.

In this particular case, we're removing many individual interfaces prior to deletion of the parent node. The time to delete each interface is approximately 1 second, and then the next interface is queued for deletion.

Is there no way to batch the deletions so they happen faster?
#5
General Support / Large-scale NetXMS deployments?
March 17, 2020, 08:10:13 PM
Given the lack of a comprehensive NetXMS scaling guide, I'm interested in talking with people who have NetXMS deployed in

1) Deployments with horizontal load balancing of polling, discovery, etc.
2) Deployments with large devices — hundreds of interfaces, thousands of MAC addresses
3) Deployments with at least 5000 nodes, scaling up to 20k-30k nodes

In particular, I'm looking for solid scaling values for:

A) netxmsd RAM usage per node
B) average nodes / poller (thread pool POLLERS) (in the context of avg DCI per node)

I'm having issues around the 100k DCI mark, but it doesn't apear to be DCIs themselves that are the bottleneck.

TIA
Jeremy
#6
General Support / Netxms 3.1.261 and device discovery
December 12, 2019, 10:11:19 PM
I have a large class of devices (perhaps mostly Cisco?) that are failing discovery in an odd way.

1) Devices are detected with isSNMP=Yes, GENERIC driver, and added to db
2) Manual inspection shows that devices were added with SNMP 'public' community
3) Devices respond to a snmpwalk using 'public' with the following:
iso.3.6.1.2.1 = No more variables left in this MIB View (It is past the end of the MIB tree)
Note that this is a different response than with an invalid string; with an invalid string, queries just time out. With this (ACLed) string, 'public' simply has no allowed views.
4) Devices (and discovery) are configured with a *different* SNMP string which does actually work via a walk, but not via discovery.

Is there a way to change discovery behavior to try 'public' *last*? There doesn't appear to be an order in the SNMP Configuration that's relevant.

Is there a way to batch change configured SNMP communities on nodes? Batch 'properties' change doesn't seem to exist in the UI. Better yet, can I do this with NXSL? I'm not seeing an attribute that lets me check or set the SNMP community.

Thanks
#7
I'm running netxms 3.x with Postgresql 10 (timescaledb). Occasionally the database crashes and automatically restarts/recovers. However, netxmsd does not reconnect to the database. Are there any settings to tweak this? My understanding (reading old forum messages) is that netxms *should* reconnect, but it does not, and requires a full restart of netxmsd to reconnect to the db.
#8
Netxms server consuming RAM and crashing. Valgrind log attached.

Options:
valgrind --log-file=/home/jaustin/vg.log --leak-check=full --undef-value-errors=no netxmsd -D3
#9
General Support / netxmsd 3.1 crashing on startup
December 04, 2019, 05:59:16 PM
 I'm getting a segfault on startup:

<snip>
DCObject::filterInstanceList(.1.3.6.1.4.1.2636.3.60.1.1.1.1.7.{instance} [242469]): instance "548" removed by filtering script
2019.12.04 06:56:55.606 *D* [                   ] DCObject::filterInstanceList(.1.3.6.1.4.1.2636.3.60.1.1.1.1.8.{instance} [243303]): instance "507" name set to "0"
2019.12.04 06:56:55.606 *D* [                   ] DCObject::filterInstanceList(.1.3.6.1.4.1.2636.3.60.1.1.1.1.8.{instance} [243303]): instance "507" removed by filtering script
2019.12.04 06:56:55.609 *D* [                   ] DataCollectionTarget::doInstanceDiscovery(js2.jber7079.mxu.acsalaska.net [18199]): read 25 values
2019.12.04 06:56:55.610 *D* [                   ] DCObject::filterInstanceList(.1.3.6.1.4.1.2636.3.60.1.1.1.1.1.{instance} [243281]): instance "514" name set to "???"
2019.12.04 06:56:55.610 *D* [                   ] DCObject::filterInstanceList(.1.3.6.1.4.1.2636.3.60.1.1.1.1.1.{instance} [243281]): instance "514" removed by filtering script

Thread 372 "$POLLERS/WRK" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff74e2e700 (LWP 6249)]
0x00007ffff79bfbfe in Node::reconcileWithDuplicateNode(Node*) () from /usr/lib/x86_64-linux-gnu/libnxcore.so.31
(gdb) bt
#0  0x00007ffff79bfbfe in Node::reconcileWithDuplicateNode(Node*) () from /usr/lib/x86_64-linux-gnu/libnxcore.so.31
#1  0x00007ffff79d73c4 in Node::configurationPoll(PollerInfo*, ClientSession*, unsigned int) ()
   from /usr/lib/x86_64-linux-gnu/libnxcore.so.31
#2  0x00007ffff796e3d2 in DataCollectionTarget::configurationPollWorkerEntry(PollerInfo*, ClientSession*, unsigned int) ()
   from /usr/lib/x86_64-linux-gnu/libnxcore.so.31
#3  0x00007ffff793f621 in ?? () from /usr/lib/x86_64-linux-gnu/libnxcore.so.31
#4  0x00007ffff698d337 in ?? () from /usr/lib/x86_64-linux-gnu/libnetxms.so.31
#5  0x00007ffff698d10e in ?? () from /usr/lib/x86_64-linux-gnu/libnetxms.so.31
#6  0x00007ffff4ec04a4 in start_thread (arg=0x7fff74e2e700) at pthread_create.c:456
#7  0x00007ffff3a15d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
#10
General Support / 3.0.2284 and TimescaleDB view 'idata'
October 02, 2019, 01:54:47 AM
Upgraded to 3.0.2284 and ran database upgrade script. DCIs migrated successfully.

Database performance seems much worse:

2019.10.01 14:46:11.901 *E* [db.driver          ] SQL query failed (Query = "SELECT idata_value,idata_timestamp FROM idata WHERE item_id=233275 ORDER BY idata_timestamp DESC LIMIT 1"): Internal error (call to PQsendQuery failed)
2019.10.01 14:46:11.901 *E* [db.driver          ] SQL query failed (Query = "SELECT idata_value,idata_timestamp FROM idata WHERE item_id=233276 ORDER BY idata_timestamp DESC LIMIT 1"): Internal error (call to PQsendQuery failed)

Graphs load slowly, etc.

Running 'explain select' on the above failed queries result in:

Limit  (cost=2080.96..2082.61 rows=1 width=14) (actual time=19.277..20.802 rows=1 loops=1)
  ->  Merge Append  (cost=2080.96..31743.08 rows=18037 width=14) (actual time=19.128..19.128 rows=1 loops=1)
        Sort Key: _hyper_7_5752_chunk.idata_timestamp DESC
        ->  Index Scan Backward using "5752_5752_idata_sc_default_pkey" on _hyper_7_5752_chunk  (cost=0.29..13.67 rows=11 width=8) (actual time=0.012..0.012 rows=0 loops=1)
              Index Cond: (item_id = 233258)
        ->  Index Scan Backward using "5755_5755_idata_sc_default_pkey" on _hyper_7_5755_chunk  (cost=0.29..13.67 rows=11 width=8) (actual time=0.004..0.004 rows=0 loops=1)
              Index Cond: (item_id = 233258)

and continuing on with index scan backward.

Note that running similar queries on the individual hypertables that make up the idata view are VERY fast.
#11
Anyone else using timescaledb with postgres?

When I look at timescaledb maintenance, typically the drop_chunks function would be used to reclaim disk space. However, drop_chunks requires TIMESTAMP, TIMESTAMPTZ, or DATE column types. For 'idata' and 'tdata', for example, the timestamp column is actually an int4.

Any pointers on how to determine whether NetXMS is actually pruning old entries to reclaim disk space? It's certainly dropping DCIs correctly from an access standpoint -- queries only return the expected time ranges.


#12
Does anyone have experience querying DC table data directly using Postgres? I'm working in Grafana, and having good success with non-table DCIs. For complicated reasons I need to be able to create custom queries rather than relying on NetXMS' built-in data export features. By parsing dctable.cpp, I believe I get the gist of how tables are constructed using dc_table_columns, with the cells themselves stored in tdata, but I'm not understanding how the
tdata_value field actually works, or whether it's even possible to parse it outside of NetXMS itself.

Another option would be if the Grafana API plugin for NetXMS supported tables, but I'm not sure anyone's working on that.
#13
General Support / Grafana and java errors
June 13, 2019, 03:27:56 AM
Core: 2.2.15-2
Web Svc: 2.2.15-2 running under Tomcat8

Grafana is installed and data collector successfully configured. Alarm queries function.

DCIs are visible in enumeration query, but actually trying to graph a DCI results in an error:
Grafana:
Object
xhrStatus:"complete"
request:Object
method:"GET"
url:"api/datasources/proxy/1/grafana/datacollection"
data:null
params:Object
interval:600000
from:_
to:_
targets:"[{"dci":{"name":"xxxxxxxxxx-removed","id":"190615","$$hashKey":"object:518"},"dciTarget":{"id":"20810","name":"xxxxx-removed"},"legend":"xxxxx-removed","refId":"A","type":"DCI"}]"
response:Object
description:"org.json.JSONArray.iterator()Ljava/util/Iterator;"
error:46


Tomcat:
org.netxms.websvc.WebSvcStatusService                   | Internal error
java.lang.NoSuchMethodError: org.json.JSONArray.iterator()Ljava/util/Iterator;

#14

Version:
NetXMS Server Version 2.2.15 Build 9523 (d67b96f) (UNICODE)
NXCP: 4.48.1.18 (AES-256, Blowfish-256, 3DES, AES-128, Blowfish-128)
Built with: g++ (Debian 6.3.0-18+deb9u1) 6.3.0 20170516


I have some Radwin HSU devices that have odd SNMP characteristics. (For example, the ping3 driver breaks discovery of Radwin devices.)

They don't report their management IP on the correct interface -- the management interface is discoverable but has a loopback IP on it rather than the correct management IP.

One consequence is that the correct network subnet is not created upon device discovery, as NetXMS has no way of determining the correct mask for the IP, and we are not currently discovering the parent router(s) that provide the gateway (and correct mask) for these devices.

Even though these discovered nodes do not appear in "Entire Network", I am able to bind the "missing" devices to a container using the following function:
/* Function utilized to find objects that are missing a parent subnet
* in Entire Network. This should not be happening, but it do.
*
*/

sub BindUnboundNodes() {
parents = GetNodeParents($node);

parentContainers = 0;
foreach(p : parents) {
    // we are only interested in subnets
    if (p->type == 1) {
// if we find any subnet at all, this object is bound
            parentContainers++;
       
    }
}

toBind = parentContainers == 0;

if (toBind) {
    trace(0, "Node '" . $node->name . "' has no parent subnet.");
}

return toBind;
}


Another odd behavior is that if I put the relevant discovery networks into a zone other than Default, devices fail to be added at all.

#15
General Support / Duplicate node detection vs. DNS?
June 05, 2019, 06:16:12 PM
Enabled server flags:
EnableZoning 1
UseDNSNameForDiscoveredNodes 1
SyncNodeNamesWithDNS 1
NetworkDiscovery.EnableParallelProcessing 1

Behavior:
SNMP-capable devices with multiple IP addresses in discovery networks are discovered multiple times. What we want is for the first discovered IP to be the primary host name.
Reverse DNS for each IP is unique.
For example,
ip1.node.some.domain <- primary, wanted
ip2.node.some.domain <- not wanted
ip3.node.some.domain <- not wanted

Shouldn't sysOID duplicate detection prevent the creation of duplicate nodes?
#16
General Support / DCI deletion failure
May 14, 2019, 10:12:51 PM
Testing DCI templating with Postgres+timescale.

Deleting some objects results in a corrupted SQL call:

https://paste.ee/p/VRduy
#17
General Support / Hook::CreateSubnet
May 14, 2019, 07:16:38 PM
I'm testing Hook::CreateSubnet functionality in 2.2.14 HEAD.

- Is there documentation of the NXSL 'Subnet' class?

- I note that the local server node is automatically discovered and added to the node database. Multiple instances (!) of one of the IPv4 subnets get created. (In this case, there are multiple IPv4 addresses on the server in that subnet.) Is the subnet creation logic not checking for a pre-existing subnet when multiple addresses exist on the same interface?
#18
I have been seeing some strange behavior for a few days. Database performance seems fine, running on flash. Attempting to handle/terminate/resolve more than even a few alarm entries at once results in a pegged CPU core with netxmsd.

netxmsd: show dbstats
SQL query counters:
   Total .......... 2061537
   SELECT ......... 861140
   Non-SELECT ..... 1200397
   Long running ... 0
   Failed ......... 0
Background writer requests:
   DCI data ....... 20263
   DCI raw data ... 20262
   Others ......... 49


netxmsd: show msgwq
0 active queues
Housekeeper thread state is RUNNING


Show pollers shows about half and half in cleanup and awaiting execution.

netxmsd: show queues
Data collector                   : 0
DCI cache loader                 : 0
Template updates                 : 0
Database writer                  : 0
Database writer (IData)          : 0
Database writer (raw DCI values) : 0
Event processor                  : 0
Event log writer                 : 0
Poller                           : 0
Node discovery poller            : 0
Syslog processing                : 0
Syslog writer                    : 0
Scheduler                        : 0



Show stats will time out while the CPU core is pegged.

netxmsd: show watchdog
Thread                                           Interval Status
----------------------------------------------------------------------------
Item Poller                                      10       Running
Syncer Thread                                    30       Sleeping
Poll Manager                                     5        Sleeping
Ad hoc scheduler                                 5        Sleeping
Recurrent scheduler                              5        Sleeping



Stopping the netxmsd process and repairing the DB will resolve the stuck CPU temporarily.

Viewing the logs, I get a lot of "Poll Manager" does not respond to watchdog thread.
Anything further to check?
#19
General Support / Template DCI disappearance?
May 06, 2019, 11:46:35 PM
In console version 2.2.13, I am able to move DCI template items to template groups. As template groups don't support DCI items, they disappear.

Is this expected behavior?
#20
General Support / Radwin discovery failing
May 06, 2019, 11:39:22 PM
I tried discovering some Radwin Jet devices. Subscriber units ("HSU") discovered fine using SNMPv2c.

Base stations ("HBS"), however, didn't discover, even though discovery configuration had valid IP ranges and community strings.

If I manually add a base station to discovery by IP, I observe that NetXMS runs through all the community strings first with v2c, and then with v1. The valid v1 string returns a response like this:

(tcpdump)
12:34:18.143917 IP aaaa.38220 > bbbb.161:  C="xxxx" GetRequest(58)  .1.3.6.1.2.1.1.2.0 .1.3.6.1.2.1.1.1.0 .1.3.6.1.4.1.35160.1.1.0
12:34:18.158012 IP bbbb.161 > aaaa.38220:  C="xxxx" GetResponse(58)  genErr@3 .1.3.6.1.2.1.1.2.0= .1.3.6.1.2.1.1.1.0= .1.3.6.1.4.1.35160.1.1.0=

I also note that if I use snmpget:

[jaustin@jaustin tmp]$ snmpget -v1 -c xxxx bbbb  .1.3.6.1.2.1.1.2.0 .1.3.6.1.2.1.1.1.0 .1.3.6.1.4.1.35160.1.1.0
12:36:01.720684 IP aaaa.36646 > bbbb.161:  C="xxxx" GetRequest(59)  .1.3.6.1.2.1.1.2.0 .1.3.6.1.2.1.1.1.0 .1.3.6.1.4.1.35160.1.1.0
12:36:01.737880 IP bbbb.161 > aaaa.36646:  C="xxxx" GetResponse(59)  genErr@3 .1.3.6.1.2.1.1.2.0= .1.3.6.1.2.1.1.1.0= .1.3.6.1.4.1.35160.1.1.0=
Error in packet
Reason: (genError) A general failure occured
Failed object: iso.3.6.1.4.1.35160.1.1.0

12:36:01.737996 IP aaaa.36646 > bbbb:160  C="xxxx" GetRequest(42)  .1.3.6.1.2.1.1.2.0 .1.3.6.1.2.1.1.1.0
12:36:01.757800 IP bbbb.161 > aaaa.36646:  C="xxxx" GetResponse(66)  .1.3.6.1.2.1.1.2.0=.1.3.6.1.4.1.4458.20.5.1.1 .1.3.6.1.2.1.1.1.0="Wireless Link"
iso.3.6.1.2.1.1.2.0 = OID: iso.3.6.1.4.1.4458.20.5.1.1
iso.3.6.1.2.1.1.1.0 = STRING: "Wireless Link"

-----------

So snmpget will discard the invalid OID and try again.

I note that 1.3.6.1.4.1.35160 is from the ping3 device driver.

I also note that I *can* set snmp.testOID as an override on a device *that already exists*, but this does not help devices that are failing discovery.

Any recommendations? I don't have any ping3 devices, so theoretically I could disable the driver; however, that doesn't fix the underlying logical problem.