MTBF (Mean Time Between Failures) and MTTR (Mean Time To Repair)

Started by testos, November 08, 2012, 11:23:17 AM

Previous topic - Next topic

testos

Hello.

Recently I was told that we need to implement the measures MTBF (Mean Time
Between Failures) and MTTR (Mean Time To Repair) for each node, but I do not know how.

Could you help me if there is any somehow to get this measures? Does anybody have implemented
this in NetXMS?

Best regards.

Victor Kirhenshtein

Hi!

Key question here is how you determine if node is in a failed state. After that you can do some combination of internal DCIs and custom attributes to do calculations.

Best regards,
Victor

testos

Hi.

These are nodes that have only SNMP capabilities. The node failed state is determined by the node Internal State oid.

QuoteAfter that you can do some combination of internal DCIs and custom attributes to do calculations.

Can you expand this?



Best regards.

Victor Kirhenshtein

Hi!

One possible way to implement MTBF calculation could be following:

1. Assume that you have two events, one for downtime start and one for downtime end.
2. We will use the following custom attributes for each node:

mtbf
mtbfTotalUptime
mtbfNumFailures
mtbfTimeUp

3. On downtime end, run the following script:


SetCustomAttribute($node, "mtbfTimeUp", time());


4. On downtime start, run the following script:


uptime = time() - GetCustomAttribute($node, "mtbfTimeUp");
mtbfNumFailures = GetCustomAttribute($node, "mtbfNumFailures");
if (mtbfNumFailures == null)
   mtbfNumFailures = 0;
mtbfNumFailures++;
mtbfTotalUptime = GetCustomAttribute($node, "mtbfTotalUptime");
if (mtbfTotalUptime == null)
   mtbfTotalUptime = 0;
mtbfTotalUptime += uptime;
mtbf = mtbfTotalUptime / mtbfNumFailures;
SetCustomAttribute($node, "mtbfNumFailures", mtbfNumFailures);
SetCustomAttribute($node, "mtbfTotalUptime", mtbfTotalUptime);
SetCustomAttribute($node, "mtbf", mtbf);


After this script execution, custom attribute "mtbf" will contain MTBF in seconds.

5. If you want to see current MTBF value as DCI for the node, you can create DCI with source "Internal" and name "Dummy", and use the following transformation script:


return GetCustomAttribute($node, "mtbf");


MTTR can be calculated in similar way.

Best regards,
Victor

testos

Hi Victor.

Thank you very much for your help.
Based on your proposal, I think that this requirement can be simplified to just two steps:
1. - Create a template called say "Availability" with the four DCIs shown in the image
       "Availability template DCIs.png". Transformation scripts for each DCI are these
     
       For "Failures" DCI
       
Quotereturn GetCustomAttribute($node, "NumFailures");

       For "MTBF (hours)" DCI
       
Quotereturn GetCustomAttribute($node, "mtbf");

       For "MTTR (hours)" DCI
       
Quotereturn GetCustomAttribute($node, "mttr");

       For "Node availability (percentage)" DCI
       
Quote// This script calculates MTTR, MTBF and perAvailability parameters and stores them in custom attributes
// Initialize some custom attributes the first time.
// Undefined attributes are created by SetCustomAttribute function automatically
CurrentStatus = GetDCIValue($node, FindDCIByName($node, "Status"));
PreviousState = GetCustomAttribute($node, "PreviousState");
if (PreviousState == null)
{ // In the first time, previous state is null
   SetCustomAttribute($node, "PreviousState", CurrentStatus);
   SetCustomAttribute($node, "TimeStamp", time());   
   SetCustomAttribute($node, "NumFailures", 0); 
   SetCustomAttribute($node, "TotalUptime", 0);
   SetCustomAttribute($node, "TotalDowntime", 0);
   return 100;
}

// From here the 2nd and subsequent times
NumFailures = GetCustomAttribute($node, "NumFailures");
LastTime = time() - GetCustomAttribute($node, "TimeStamp");

// Status is up
if (CurrentStatus == 0)
{
   if (PreviousState != CurrentStatus)
   {   // just changed to up
      // update mttr
      TotalDowntime = GetCustomAttribute($node, "TotalDowntime") + LastTime;
      mttr = TotalDowntime / ((NumFailures == 0) ? 1 : NumFailures) / 3600;   // to prevent division by ze
      SetCustomAttribute($node, "TotalDowntime", TotalDowntime);
      SetCustomAttribute($node, "mttr", mttr);
   }
   else
   {      // still up
      // update mtbf
      TotalUptime = GetCustomAttribute($node, "TotalUptime") + LastTime;
      mtbf = TotalUptime / ((NumFailures == 0) ? 1 : NumFailures) / 3600;   // to prevent division by zero
      SetCustomAttribute($node, "TotalUptime", TotalUptime);
      SetCustomAttribute($node, "mtbf", mtbf);
   }
}

// Status is down
if (CurrentStatus == 4)
{
   if (PreviousState != CurrentStatus)
   {   // just changed to down
      // update mtbf
      NumFailures++;
      TotalUptime = GetCustomAttribute($node, "TotalUptime") + LastTime;
      mtbf = TotalUptime / NumFailures / 3600;
      SetCustomAttribute($node, "NumFailures", NumFailures);
      SetCustomAttribute($node, "TotalUptime", TotalUptime);
      SetCustomAttribute($node, "mtbf", mtbf);
   }
   else
   {   // still down
      // update mttr
      TotalDowntime = GetCustomAttribute($node, "TotalDowntime") + LastTime;
      mttr = TotalDowntime / NumFailures / 3600;
      SetCustomAttribute($node, "TotalDowntime", TotalDowntime);
      SetCustomAttribute($node, "mttr", mttr);
      
   }
}

If (CurrentStatus == 0 || CurrentStatus == 4)
{
   // Save previous state and timestamp
   SetCustomAttribute($node, "PreviousState", CurrentStatus);
   SetCustomAttribute($node, "TimeStamp", time());   

   // perAvailability section
   TotalUptime = GetCustomAttribute($node, "TotalUptime");
   TotalDowntime = GetCustomAttribute($node, "TotalDowntime");
   perAvailability = TotalUptime / (TotalUptime + TotalDowntime) * 100;
   SetCustomAttribute($node, "perAvailability", perAvailability);

   return perAvailability;
}



2. - Apply manually previous template to nodes required or apply this template automatically to nodes filtered by custom script (Properties -> Automatic Apply Rules).


In this way, we avoid having to define custom attributes (now created by the fourth transformation script), events, actions, event processing policy rules, etc.
In addition, the four DCIs are updated at each polling interval.
Only supports up (Normal = 0) and down (Critical = 4) node status.

Best regards.

millerpaint

Hi Testos,

This is something that we can use for sure.  Were you successful using the simplified 2 step implementation?


-Kevin C.

Victor Kirhenshtein

Hi!

That's really cool! I'll put this into wiki as well.

Best regards,
Victor

testos

millerpaint,
I apply this template to nodes that I need to know if my Internet Service Provider meets the Service Level Agreements availability contracted, ie all remote nodes.

Best regads.

Marco Incalcaterra

Quote from: testos on February 21, 2013, 05:28:06 PM
millerpaint,
I apply this template to nodes that I need to know if my Internet Service Provider meets the Service Level Agreements availability contracted, ie all remote nodes.

Best regads.

Hi,

is the template intended to be used on any nodes? I tried to apply it to two nodes where I have the netxms agent running, I left it running for a couple of days but I only get:

Failure 0
MTBF (hours) empty
MTTR (hours) empty
Node availability (percentage) 0

Any hints on where I'm wrong?

Best regards,
Marco

Victor Kirhenshtein

Hi!

I've found syntax error in transformation script - if statement starts with capital letter:

If (CurrentStatus == 0 || CurrentStatus == 4)

after changing it to lowercase, script seems to be working.

Best regards,
Victor

Marco Incalcaterra

Hi Victor!

yes, that was the problem. Thank you very much for your help!

Best regards,
Marco