Possible Bug with HOOK:AlarmStateChange on Alarm Creation

Started by graeChris, May 02, 2023, 05:23:19 PM

Previous topic - Next topic

graeChris

I was able to get the external alerting to work with OpsGenie finally and I think I discovered a bug in the Alarm generation.

I am using the following code inside HOOK:AlarmStateChange 

sub main()
{
    eventserver = FindNodeObject($node, 100);
    global alarmstate = $alarm->state;
    global eventname = $alarm->eventName;
    global sourceobj = FindObject($alarm->sourceObject);
    global nameobj = sourceobj->guid;   
   
    trace(0, "Alarm State" . alarmstate . " Alarm:" . eventname . nameobj);
    switch(alarmstate)   
    {
/* Alarm State is Outstanding */
        case "0":
            break;
/* Alarm State is Acknowledged */
        case "1":
            PostEvent(eventserver, "Xms_Alarm_Ack","ACKALARM", eventname . nameobj);
            break;
/* Alarm State is Resolved */
        case "2":
            PostEvent(eventserver, "Xms_Alarm_Resolve","RSLVALARM", eventname . nameobj);
            break;
/* Alarm State is Sticky Acknowledged */
        case "17":
            PostEvent(eventserver, "Xms_Alarm_StickyAck","SACKALARM", eventname . nameobj);
            break;       
    }
}

I originally had code for state 3 ( terminated ). This is where the bug seems to occur.
When the Alarm is created HOOK:AlarmStateChange runs. This is not a big deal as that allows us to call scripts when an event is created as it should have a state of 0. I noticed via the trace() function that some of the alarms were being created with a state of 3 instead of a state of 0.

Currently we have an EPP that uses a server action to send a CURL request to the OpsGenie API for alert generation. We have Opsgenie integrated directly with Teams, and Jira for ticket management. I would be happy to post a guide on how to do this if anybody else wants to know how we implemented it.

The code for state 3 ( terminated ) created an event "Xms_Alarm_Term". This event would be processed by the EPP and use a server action to send a CURL request to OpsGenie to close the alert. The practical implications of the alarm being created with a state of 3 is that the alert is created in OpsGenie and Closed in Opsgenie at exactly the same moment.

I suspect the problem is in alarm.cpp -> Copy constructor.

https://github.com/netxms/netxms/blob/master/src/server/core/alarm.cpp#L378

I could be wrong, but it appears that code is part of the alarm constructor that duplicates alarms if the event occurs again. I believe the reason this happens is the line (378 )
m_state = src->m_state;
That code appears to set the new alarm state as the state of the source alarm. This is only an issue if the source alarm is terminated as it creates a new copy of the alarm with a state of 3.

I could very well be wrong about this, but it makes sense in my head.

Filipp Sudanov

I've played a bit trying to replicate this, but did not get any success.

There are two different things - creation of alarm and increasing count for already existing alarm.


When a new alarm is created, HOOK:AlarmStateChange is not being executed (as it runs only when status changes for already existing alarm).

But when alarm already exists and status changes (e.g. Alarm was previously acknowledged and changes to outstanding), this script will be executed.
If alarm is terminated by EPP, script will also be executed and $alarm->state will be equal to 3.


Can you give some exact sequence of what was happening on your system - was alarm already present on a node or it was created? Based on what event alarm is created? Are there any EPP rules that could modify or terminate the alarm?



graeChris

I haven't been able to duplicate this issue, but I think I discovered what caused the problem. Our server needed to be restarted as it was utilizing too much of the allocated resources. I believe this was causing the unexpected behavior as the database was also not updating values that were set for the object_properties.name field.
For the sequence of events. 

1.) Agent Disabled on endpoint for alarm creation -> Alarm created via EPP {status=0}
      1.1) Alert is created in OpsGenie
2.) Alarm marked as acknowledged by user -> HOOK:AlarmStateChange is executed -Confirmed via trace()- {status=1}
      2.1) Event is processed and opsgenie alert is acknowledged via cUrl
3.) Alarm Enabled on endpoint -> Alarm terminated via EPP  -> HOOK:AlarmStateChange is executed -Confirmed via trace()- {status=3}
      3.1) Alert is closed in OpsGenie
4.) Agent Disabled on endpoint again for alarm creation -> Alarm created via EPP {status=0} -> Log shows that HOOK:AlarmStateChange has executed via trace and Alarm has {status=3}
      4.1) Alert is created in Opsgenie via cUrl
      4.2) Alert is marked as closed in opsgenie via cUrl

I think this is more of a server related problem as we were running low on disk space at the time.


Victor Kirhenshtein

Copy constructor is not a problem in executing AlarmStateChange hook - script is executed from Alarm::executeHookScript, which in turn only called after m_state for source alarm was changed.
I currently cannot see how hook could have been called with state 3 for new alarm. In described sequence of events, when did you restart the server?

Best regards,
Victor

graeChris

Sorry for the delay! I've been in the hospital, but I'm all better now. 

We restarted the server shortly after this happened but only because it stopped returning the correct parameters to grafana SQL calls and we realized the disk space was low. 

Victor Kirhenshtein

One possibility that I see is out of order execution of hook scripts which is theoretically possible. Server schedules execution of hook script in background thread pool, passing copy of alarm in its current state to the script. It is possible that due to database problems thread pool was full with tasks waiting on database, so earlier script execution task was waiting on free thread, next task was scheduled in the meantime, and then both were started in parallel, possibly in wrong order.
I will think about it, but probably we have to serialize executions of alarm state change hooks.

Best regards,
Victor