2.0.3 segfaults - dirty workaround

Started by KjellO, April 19, 2016, 02:13:45 PM

Previous topic - Next topic

KjellO

Just for your information...
Was upgrading from 1.2.17 to 2.0.3, but server segfaulted short after startup. Running under gdb gave that crashes took place in different parts of code from time to time, which was a bit confusing. But all seemed to be related to either loading DCI cache or applying templates.

In some template auto-apply scripts we use DCI values to determine if a template should be applied or not. As we know it takes some time before all DCI values are populated. In the old server this could cause nodes to be thrown out of templates at server start and then included again at next template-apply round. Problem we had with 2.0.3 was likely that we tried to reload cache for DCIs that just vanished because AutoApply removed them from nodes. This caused all kinds of null pointer crashes. Yes, this approach for applying templates may not be a recommended way, but in our case it is a neat way to distinguish between similar nodes but different versions that require different DCIs.

Our quick and dirty solution was to create the Apply template thread at a later point, and delay it a bit before it start working. Server then started as expected.

Hope this not has other unwanted side-effects.


~/netxms-2.0.3/src/server/core$ diff -u objects.cpp_orig objects.cpp
--- objects.cpp_orig    2016-04-10 12:46:07.272845568 +0200
+++ objects.cpp 2016-04-12 08:41:45.788328287 +0200
@@ -69,7 +69,12 @@
  */
static THREAD_RESULT THREAD_CALL ApplyTemplateThread(void *pArg)
{
-       DbgPrintf(1, _T("Apply template thread started"));
+   DbgPrintf(1, _T("Apply template thread started"));
+
+   DbgPrintf(2, _T("Delaying start of ApplyTemplateThread for 5min..."));
+   sleep(300);
+   DbgPrintf(2, _T("ApplyTemplateThread continuing."));
+
    while(1)
    {
       TEMPLATE_UPDATE_INFO *pInfo = (TEMPLATE_UPDATE_INFO *)g_pTemplateUpdateQueue->getOrBlock();
@@ -241,8 +246,8 @@
        // Initialize service checks
        SlmCheck::init();

-   // Start template update applying thread
-   ThreadCreate(ApplyTemplateThread, 0, NULL);
+   // Start template update applying thread, moved to end of LoadObjects
+   //ThreadCreate(ApplyTemplateThread, 0, NULL);
}

/**
@@ -1773,6 +1778,10 @@
    // Start map update thread
    ThreadCreate(MapUpdateThread, 0, NULL);

+   // Start template update applying thread. Moved from ObjectsInit
+   ThreadCreate(ApplyTemplateThread, 0, NULL);
+
+
    return TRUE;
}


Victor Kirhenshtein

Hi,

it seems that actual problem was cache load without locking: CacheLoadingThread calls updateDciCache on every node, which in turn calls updateCacheSize on every DCI, and there was no lock on DCI level inside, so any access to DCI at that moment may cause unpredictable results. And because you have NXSL scripts accessing DCI data running at same moment as well as templates removal it happens. I've changed cache loading to use proper locks on DCIs, that should solve this issue (changes already pushed to develop and stable-2.0 branches).
Btw, you could avoid template unbinding on startup by checking return value of GetDCIValue function for being null aborting auto-apply script (using abort statement) - system will make no change in current binding status if script completes with runtime error.

Best regards,
Victor

KjellO

Great! Thanks for your support and for pointing out the abort statement, very useful in these scenarios.

Best regards,
Kjell