nxpush "Request timed out" on nodes with variable connectivity only.

Started by fldiet, June 16, 2021, 11:29:49 AM

Previous topic - Next topic

fldiet

Hello,

Prerequisite information:
We are pushing data to NetXMS from the machine hosting NetXMS itself with a
series of nxpush calls - one for each target node.

Observation:
We are experiencing a node-specific issue, concerning a set of local nxpush
sessions timing out statistically for specific nodes only.  Those nodes are all
experiencing connectivity issues. There seems to be a correlation between
occurrences of timeouts and network issues.

Question:
At the event of a new nxpush session, are there any internal checks taking
place requiring connectivity of the node? Could a recurring process like a poll
get stuck at the connectivity issue and interfere?


Kind regards,

  Florian

Filipp Sudanov

Collection of server log would be beneficial here. Can you set debug level 6 for client.* tag (issue "debug client.* 6" in Tools->Server Console or you can set this in server configuration file).
nxpush is creating the same connection as nxmc, so sessions of both will be visible under this tag. Each new connecting gets a new session id which is visible in the debug tag, e.g. client.session.1234.

Then, when you encounter a nxpush nang, you should be able to find and grep out a corresponding session from the log file.

fldiet

I configured the debug level as requested, restarted netxmsd, set up a series of identical nxpush requests sequentially without artificial delay and observed the result of nxpush next to netxmsd's log.

Here are some basic observances:

  • Typically the nxpush requests take 0.172-0.300s.
  • Isolated hangs take around 5.05s ( low variance ) and do not influence following pushes time.
  • Clusters of hangs back to back ( around 5-10 ) raise the following successful attempts time:

         
    • Hang cluster

             
      • 46.2s ( first following successful )
      •   2.2s ( second following successful )
      •   0.2s ( third  following successful )
    • Hang cluster

             
      • 25.3s
      •   2.9s
      •   0.2s
    • Hang cluster

             
      • 12.9s
      •   0.2s
    • Hang cluster

             
      • 17.4s
      •   0.2s
Here is a snippet of the logs in the neighbourhood of an isolated hang ( answering with CMD_KEEPALIVE to CMD_PUSH_DCI_DATA ), including the surrounding successful ones:

2021.06.16 15:30:01.735 *D* [client.session     ] Client session with ID 84 registered
2021.06.16 15:30:01.735 *D* [client.session.84  ] Received message CMD_GET_SERVER_INFO
2021.06.16 15:30:01.735 *D* [client.session.84  ] Server time zone: UTC+00UTC
2021.06.16 15:30:01.735 *D* [client.session.84  ] Sending message CMD_REQUEST_COMPLETED (616 bytes)
2021.06.16 15:30:01.736 *D* [client.session.84  ] Received message CMD_LOGIN
2021.06.16 15:30:01.736 *D* [client.session.84  ] authentication type 0
2021.06.16 15:30:01.736 *D* [client.session.84  ] Protocol level compression is supported by client
2021.06.16 15:30:01.736 *D* [client.session.84  ] User <ourUserForNxpush>@127.0.0.1 authenticated (language=en clientInfo="nxpush/3.8.382 (Linux 4.19.0-16-amd64; libnxcl 3.8.382)")
2021.06.16 15:30:01.736 *D* [client.session.84  ] Sending compressed message CMD_LOGIN_RESP (312 bytes)
2021.06.16 15:30:01.777 *D* [client.session.84  ] Received message CMD_PUSH_DCI_DATA
2021.06.16 15:30:02.545 *D* [client.session.84  ] Sending message CMD_REQUEST_COMPLETED (32 bytes)
2021.06.16 15:30:02.545 *D* [client.session.84  ] readSocket: connection closed
2021.06.16 15:30:02.545 *D* [client.session.84  ] Session closed
2021.06.16 15:30:02.545 *D* [client.session     ] Client session with ID 84 unregistered
2021.06.16 15:30:02.545 *D* [client.session.84  ] Socket closed
2021.06.16 15:30:02.545 *D* [client.session.84  ] Session object destroyed


2021.06.16 15:30:02.560 *D* [client.session     ] Client session with ID 84 registered
2021.06.16 15:30:02.560 *D* [client.session.84  ] Received message CMD_GET_SERVER_INFO
2021.06.16 15:30:02.560 *D* [client.session.84  ] Server time zone: UTC+00UTC
2021.06.16 15:30:02.560 *D* [client.session.84  ] Sending message CMD_REQUEST_COMPLETED (616 bytes)
2021.06.16 15:30:02.561 *D* [client.session.84  ] Received message CMD_LOGIN
2021.06.16 15:30:02.561 *D* [client.session.84  ] authentication type 0
2021.06.16 15:30:02.561 *D* [client.session.84  ] Protocol level compression is supported by client
2021.06.16 15:30:02.561 *D* [client.session.84  ] User <ourUserForNxpush>@127.0.0.1 authenticated (language=en clientInfo="nxpush/3.8.382 (Linux 4.19.0-16-amd64; libnxcl 3.8.382)")
2021.06.16 15:30:02.561 *D* [client.session.84  ] Sending compressed message CMD_LOGIN_RESP (312 bytes)
2021.06.16 15:30:02.602 *D* [client.session.84  ] Received message CMD_PUSH_DCI_DATA
2021.06.16 15:30:05.416 *D* [client.session.84  ] Sending message CMD_KEEPALIVE (32 bytes)
2021.06.16 15:30:05.423 *D* [client.session.84  ] Sending message CMD_NOTIFY (48 bytes)
2021.06.16 15:30:05.495 *D* [client.session.84  ] Sending message CMD_NOTIFY (48 bytes)
2021.06.16 15:30:07.601 *D* [client.session.84  ] readSocket: connection closed
2021.06.16 15:30:07.602 *D* [client.session.84  ] Waiting for pending requests...


2021.06.16 15:30:07.627 *D* [client.session     ] Client session with ID 71 registered
2021.06.16 15:30:07.627 *D* [client.session.71  ] Received message CMD_GET_SERVER_INFO
2021.06.16 15:30:07.627 *D* [client.session.71  ] Server time zone: UTC+00UTC
2021.06.16 15:30:07.627 *D* [client.session.71  ] Sending message CMD_REQUEST_COMPLETED (616 bytes)
2021.06.16 15:30:07.628 *D* [client.session.71  ] Received message CMD_LOGIN
2021.06.16 15:30:07.628 *D* [client.session.71  ] authentication type 0
2021.06.16 15:30:07.628 *D* [client.session.71  ] Protocol level compression is supported by client
2021.06.16 15:30:07.628 *D* [client.session.71  ] User <ourUserForNxpush>@127.0.0.1 authenticated (language=en clientInfo="nxpush/3.8.382 (Linux 4.19.0-16-amd64; libnxcl 3.8.382)")
2021.06.16 15:30:07.628 *D* [client.session.71  ] Sending compressed message CMD_LOGIN_RESP (312 bytes)
2021.06.16 15:30:07.668 *D* [client.session.71  ] Received message CMD_PUSH_DCI_DATA
2021.06.16 15:30:07.989 *D* [client.session.71  ] Sending message CMD_REQUEST_COMPLETED (32 bytes)
2021.06.16 15:30:07.990 *D* [client.session.71  ] readSocket: connection closed
2021.06.16 15:30:07.990 *D* [client.session.71  ] Session closed
2021.06.16 15:30:07.990 *D* [client.session     ] Client session with ID 71 unregistered
2021.06.16 15:30:07.990 *D* [client.session.71  ] Socket closed
2021.06.16 15:30:07.990 *D* [client.session.71  ] Session object destroyed


Victor Kirhenshtein

Hi,

it looks that DCI data processing for some reason takes too long. Do you have any transformation scripts or script thresholds on those push DCIs?

Best regards,
Victor

fldiet

Hi,

Yes, there are some transformation scripts.  But those are limited in
complexity as well as number ( ~700 DCIs ) and should therefore be manageable.
  • 88% of DCIs do not have a transformation script
  • 9% of DCIs have a lower bound or static normalization ( if;else / division )
  • 3% of DCIs show dynamic normalization using up to two
    'GetDCIValue($node, FindDCIByName($node, "..."));' invokes
    ( all referenced quantities are push-DCIs )

There are no more than 30 *DCI*() invokes for said client in total.
Thresholds boil down to very basic diff(), last() and average() calls.
These DCIs and their threshold events do not trigger any further scripts.

---

Multiple other nodes under very similar circumstances ( they inherit from the
same DCI templates or more, and are being pushed to at the same frequency ) on
the very same server instance are handled perfectly fine.  We are talking ~5
afflicted and 50 unaffected nodes.

Here is a quick plot I did for navigating/cross-referencing netxmsd's logs for
hangs in my last post.
I added the data of an unaffected node ( taken briefly after ) comparing the
time nxpush took for a set of attempts to an affected nodes behaviour.
https://i.imgur.com/CUrEvin.png

Am I off track for assuming there should be an external variable connecting the affected nodes?


Kind regards,

  Florian

Filipp Sudanov

Let's try to exclude the option that transformation scripts get randomly delayed. Try wrapping transformation scripts in something like
trace(0, $object->id . ":" . $dci->id . " tranformation script start");
...
trace(0, $object->id . ":" . $dci->id . " tranformation script end");


first parameter of trace() defines the log level. This way you should get some stats on the actual duration of the transformation scripts.