Segfault after migration from 3.9 to 4.0 or 4.1

Started by hoeth, May 10, 2022, 03:07:29 PM

Previous topic - Next topic

hoeth

Hi,

we tried upgrading from NetXMS 3.9.176 to NetXMS 4 on Debian (buster and bullseye, with postgresql 11 and 13, respectively). The nxdbmgr did it's job of upgrading the database without complaining. But netxmsd segfaults after about 1 minute. We tried NetXMS-4.0 and 4.1.283, same effect.

Setting up a new netxms installation with the same scripts, templates, events, nodes, etc. works fine. But we would like to keep the data history. Has anybody else observed such a behavior? Is there anything special we need to do?

Victor Kirhenshtein

Hi,

do you have coredump from the crash? If not, could you try to run netxmsd under gdb? For running onder debugger follow this instruction:

1. Stop netxmsd service
2. Install package netxms-server-dbg
3. Run

gdb netxmsd

4. In (gdb) prompt, enter

run -D2

5. When server stops, you'll get (gdb) prompt again. Enter command

bt

and provide the output.

Best regards,
Victor

fldiet

Hi,

I'm a coworker of hoeth.

Here is the backtrace you requested:
(gdb) bt
#0  CalculateIPChecksum (data=data@entry=0x7fff6b574430, len=len@entry=18446744073709551597) at tools.cpp:470
#1  0x00007ffff795ec0a in PingRequestProcessor::sendRequestV4 (this=0x7ffff79cd560 <s_processorV4>, request=0x7fff6b5764d0) at icmp.cpp:494
#2  0x00007ffff795f345 in PingRequestProcessor::sendRequest (request=0x7fff6b5764d0, this=0x7ffff79cd560 <s_processorV4>) at icmp.cpp:222
#3  PingRequestProcessor::ping (this=this@entry=0x7ffff79cd560 <s_processorV4>, addr=..., timeout=timeout@entry=1500, rtt=rtt@entry=0x7fff6b576604, packetSize=packetSize@entry=1, dontFragment=dontFragment@entry=false) at icmp.cpp:686
#4  0x00007ffff795f45c in PingLoop (dontFragment=false, packetSize=1, rtt=0x7fff6b576604, timeout=1500, numRetries=0, addr=..., p=0x7ffff79cd560 <s_processorV4>) at icmp.cpp:776
#5  IcmpPing (addr=..., numRetries=numRetries@entry=1, timeout=1500, rtt=rtt@entry=0x7fff6b576604, packetSize=1, dontFragment=dontFragment@entry=false) at icmp.cpp:796
#6  0x00007ffff7cc31bf in Node::icmpPollAddress (this=this@entry=0x7fffc2cbf810, conn=conn@entry=0x0, target=0x7fff6e4bd000 L"PRI", addr=...) at node.cpp:11322
#7  0x00007ffff7cccf62 in Node::icmpPoll (this=0x7fffc2cbf810, poller=<optimized out>) at node.cpp:11280
#8  0x00007ffff7d41358 in Pollable::doIcmpPoll (this=0x7fffc2cc0300, poller=0x7fff8dfb9700) at pollable.cpp:237
#9  0x00007ffff7c0acd1 in __ThreadPoolExecute_Wrapper_1<Pollable, PollerInfo*> (arg=0x7fff8df9b620) at ../../../include/nms_threads.h:1101
#10 0x00007ffff79959ce in ProcessSerializedRequests (data=0x7fff8dfa1880) at tp.cpp:472
#11 0x00007ffff7995736 in WorkerThread (threadInfo=0x7fff85fa0820) at tp.cpp:211
#12 0x00007ffff79970fa in ThreadCreate_Wrapper_1<WorkerThreadInfo*> (context=0x7fff85fa0830) at ../../include/nms_threads.h:542
#13 0x00007ffff77eefa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#14 0x00007ffff6f11eff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95



Kind regards,

  Florian

Victor Kirhenshtein

Hi,

looks like you have server configuration parameter ICMP.PingSize set to 1 (or other small value). It has to be set to at least 46. There is a bug in a server that it does not check this value for validity, and incorrect value causes crash later on.

Best regards,
Victor

fldiet

Hi,

setting ICMP.PingSize to 46 from its lowered value fixed the segmentation faults after upgrading.
Interestingly it did not seem to cause any issues at runtime prior to upgrading from NetXMS 3.9.176 though.

Thank you very much!


Best regards,

  Florian

hoeth

Yes, we have set the ping size to 1 in order to reduce traffic, because most of our nodes are connected through mobile internet, so we pay for the traffic. What's the reason for having a 46 byte minimum? And how did NetXMS-3.9 handle this? Did it simply fall back to larger packets?

Victor Kirhenshtein

Actually as I start thinking about it, minimum size is 28, not 46. Because this value includes both IP header (20 bytes) and ICMP header (8 bytes), it cannot be less than that. 46 is minimum payload size for Ethernet frame. If you are using only Ethernet for communications, setting ping size to any value less than 46 will not reduce traffic, as payload will be padded to minimum length anyway. However, if you are using communication channels capable of sending shorter frames than reducing ping size further can make sense.

hoeth

Ah, I thought ICMP.PingSize referred to the payload, not to the whole packet.

I'm now looking at https://wiki.netxms.org/wiki/Server_Configuration_Variables which says "Size of ICMP packets (in bytes, excluding IP header size) used for status polls.", so I guess minimum would be 8, correct? Or is that documentation wrong and the IP header is included in ICMP.PingSize?

Victor Kirhenshtein

Documentation is wrong, I checked source code :) Will fix that.


fldiet

#10
I have done some more testing of our servers' behaviour with changing ICMP.PingSize.
This time, the value was changed in a working NetXMS 4.1.283 instance ( not measuring through an upgrade from 3.9 ).
Each time the delay of deciding whether the error occurred after restarting NetXMS to apply changes was 5mins. Error -if occurring- always was well under 2mins.

Here is the outcome:
28 - no problems, as expected
20 - no problems, unexpected
19 - segmentation fault as discussed

Could it be possible the documentation is partly correct, but it's not excluding IP(20B), but ICMP(8B) header?

Edit: Suppressing an emoji  :)

Victor Kirhenshtein

Actual code that caused crash looked like this:

   int bytes = request->packetSize - sizeof(IPHDR);
   packet.m_icmpHdr.m_wChecksum = 0;
   packet.m_icmpHdr.m_wChecksum = CalculateIPChecksum(&packet, bytes);


If total packet size < 20, bytes will be negative, which will cause crash within CalculateIPChecksum. With packet size between 20 and 27, result will be positive and CalculateIPChecksum will actually calculate checksum for requested number of bytes, but invalid ICMP packet will be sent (with only part of the header).

Best regards,
Victor

fldiet

Ah, that clarifies the situation.
Thank you very much for your insight!

Best regards,

  Florian