Better notification/slack integration? Need to stop the spam

Started by Millenium7, December 06, 2017, 01:11:17 AM

Previous topic - Next topic

Millenium7

We currently have Centreon in place for monitoring - it is far too difficult to automate hence we're looking at NetXMS - but it has pretty good notification support. Certainly a lot better than NetXMS at the moment https://wiki.netxms.org/wiki/Slack.com_integration

The above guide works, but is generally unusable for us primarily because we just had an outage on the router where the NetXMS server is located, and once it came back up everybody got bombarded with hundreds of 'Node Down' messages followed by hundreds of 'Node Up' messages. I presume the NetXMS monitor has a delay because the messages all came through approx 1 second apart, meaning complete spam for about 10 minutes straight
I intend to also incorporate a lot of other parameters such as temperature/voltage/SNR monitoring etc, this would only have amplified the spam by 100x

So my questions are this...
1) Is there a better slack addon than the SMS system? providing a bit more flexibility i.e. categories (centreon can mark with low/medium/high priority with colors)

2) Setting dependencies on devices so that i.e. if a main router goes down, don't send notifications for everything behind that router because obviously everything behind it will also be unreachable. We only want a message that the router is down, then suppress all notifications for devices behind it

3) Dependencies for DCI's on devices. Obviously if a client radio loses connection then I don't need notifications about 0 SNR, 0 signal strength, etc

4) Is there a way to buffer the messages for i.e. 2 minutes then sends all of them at the same time, So if we do have a few separate devices all go down at around about the same time, at least we don't have spam every second

5) Re-sending notifications for some core devices. At the moment if a node goes down we only ever get one message. I want important devices such as core routers or radio links to keep sending a notification that it is down every 10 minutes until it comes back up

6) Notification schedule? I havn't found it in NetXMS yet. We do want monitoring to continue, but we don't want any slack messages after 10pm or before 6am

Victor Kirhenshtein

Quote from: Millenium7 on December 06, 2017, 01:11:17 AM
1) Is there a better slack addon than the SMS system? providing a bit more flexibility i.e. categories (centreon can mark with low/medium/high priority with colors)

Currently no. We plan to replace current model when everything except email implemented as SMS drivers with more flexible "communication channels". Then we will be able to make Slack connector more flexible.

Quote from: Millenium7 on December 06, 2017, 01:11:17 AM
2) Setting dependencies on devices so that i.e. if a main router goes down, don't send notifications for everything behind that router because obviously everything behind it will also be unreachable. We only want a message that the router is down, then suppress all notifications for devices behind it

Server supposed to do this automatically if it knows IP path from itself to device. Please check if server has necessary IP topology information. If it does, then it's a bug. If not, and you cannot add intermediate devices to monitoring, then you can use filtering scripts to check router status when handling node down event and skip event processing if router is down.

Quote from: Millenium7 on December 06, 2017, 01:11:17 AM
3) Dependencies for DCI's on devices. Obviously if a client radio loses connection then I don't need notifications about 0 SNR, 0 signal strength, etc

You probably can use persistent storage and/or custom properties to set certain flags, and then check them in other thresholds or event processing rules. For example, set flag when client disconnects and check it when process signal strength 0 event - and if set, ignore it. Or create script threshold and don't even trigger it if client is disconnected.

Quote from: Millenium7 on December 06, 2017, 01:11:17 AM
4) Is there a way to buffer the messages for i.e. 2 minutes then sends all of them at the same time, So if we do have a few separate devices all go down at around about the same time, at least we don't have spam every second

No.

Quote from: Millenium7 on December 06, 2017, 01:11:17 AM
5) Re-sending notifications for some core devices. At the moment if a node goes down we only ever get one message. I want important devices such as core routers or radio links to keep sending a notification that it is down every 10 minutes until it comes back up

You can create script DCI that will return 1 if node is down (script can get it from $node->flags) and 0 if not, and setup threshold as usual with repeat interval set as needed.

Quote from: Millenium7 on December 06, 2017, 01:11:17 AM
6) Notification schedule? I havn't found it in NetXMS yet. We do want monitoring to continue, but we don't want any slack messages after 10pm or before 6am

You can use filtering script in notification rules. Simple script that allow actions to be executed only between 6am and 10pm could look like this:

now = localtime();
return (now->hour >= 6) && (now->hour <= 22);


Best regards,
Victor

Millenium7

Thanks for the response Victor

Seems like a lot of additional scripting is required to get close to the sort of notifications we would like
I don't mind adding scripts if they are easy to manage and change and I can write a very simple guide on how to do so for our other staff - we don't all have time to learn netXMS scripting language and how/where to change values. And we don't want to be spending a significant portion of our time monitoring the monitoring system...

Since we will have multiple types of alarms (node down, low snr, station disconnect etc etc) we don't want to have to change 100 scripts if we decide to change our notification schedule
Can we execute actions from within a script?
And therefore would something like this be the best way to manage this?
1) In Actions Configuration, edit 'Slack Notification' and change it from 'send SMS' to 'Execute NXSL Script' and type in script name "CheckScheduleTime"
2) Go to Script Library, create script 'CheckScheduleTime'. Add something like....

NotiStart = 6;
NotiEnd = 22;
now = localtime();
if (now->hour >= NotiStart && now->hour <= NotiEnd)
  DoAction("Push Slack Notification");


3) Create another action called 'Push Slack Notification' and set this up to send Slack SMS

This way everything under 'event processing policy' can be set up with just 1 action of 'Slack Notification', that calls a script which checks relevant time, that calls the action which actually makes a slack notification. Then if we change our time (i.e. go on holidays and want to disable it) we don't have to edit every single action in 'event processing policy'. We only have to change the time in the "CheckScheduleTime" script. Is that correct?