News:

We really need your input in this questionnaire

Main Menu

Broadcast Storm

Started by csharve, October 06, 2021, 03:42:37 PM

Previous topic - Next topic

csharve

Hello all! I have some newbie questions regarding upgrades. I am new to my role and this is my first experience with NetXMS. Some background. We have been using NetXMS v2.1 and it came down from upper management that we need to get current software up to date, which can be difficult in a production facility! I was tasked with upgrading v2.1 to the most recent version.

After doing some research it looks as if there is no need to make an intermediary upgrade to another version first. I found info on these forums that I can upgrade right to 3.9, is that correct?

Following the Admin Guide, I stopped the NetXMS server and performed "nxdbmgr check". The check returned no errors, everything showed "Passed". Next I ran the "netxms-server-3.9.298-x64.exe" server upgrade. After the upgrade completed the service did not start, so I performed "nxdbmgr upgrade" to the database. When the database upgrade completed, the service started back up on its own and I started to receive NetXMS emails that many of my nodes changed state to UP. All seems good, right? This is where things went downhill.

I performed the management console upgrade next. I went with the default settings, all looked good. I was then contacted by a few people in operations that they were seeing odd system alarms. My facility has 5 different production plants all running their own distributed control system. Every unit that runs the control system (26 total) starting giving the following alarms "Broadcast Storm ended duration 100 seconds". As you can imagine, this raised major concerns! Our DCS vendor told us something is flooding the network and out of fear of the unknown I stopped the NetXMS service, which in turn stopped all the broadcast storm alarms. Obviously, I cannot take the chance of shutting down 5 production plants!

So the question is, what is happening??? I use NetXMS to monitor servers, switches, firewalls, and operator consoles. There are many, many devices that could be added to that list but they are not monitored through NetXMS, but rather through the DCS. Is v3.9 scanning everything across all plants and causing this? Is it a Network Discovery issue? Can I turn that off through a CLI? I don't even know if the management console is up and running because I was forced to kill the server before I could even start the console!

Obviously, the server upgrade was successful and it connected to the database successfully, hence all the emails from NetXMS telling me my nodes were all changing state to UP. I have a fear to even start the server back up to check the console because the alarms will start to come in again and I don't know what issues it could cause (and apparently the vendor doesn't either). We did run a quick test just to verify it was NetXMS, started server up, alarms came in, shut it down.

Of course I do have the option of rolling back to v2.1, can use a recovery point on the SNMP server and I backed up the database prior to upgrading, but I'd like to be able to get up to v3.9. Eventually I have no choice! Database is SQL.

Any suggestions on a course of action? Is my inexperience causing me to miss something here? Unfortunately, my mentor retired 3 years early and it has left me as the only plant resource for OT management. Trial by fire! Thank you for any suggestions!!!

Filipp Sudanov

So, if I understand correctly, the error "Broadcast Storm ended duration ..." was produces by some other equipment, not NetXMS.

First of all, it would be good to understand, how the monitoring is performed. How many nodes are there in NetXMS. How many nodes per production plane? How these nodes are being polled - is it SNMP or something else? Are there any NXSL scripts used for monitoring?

The recommendation would be to set up a test environment - just another computer with netxms that would be monitoring e.g. just one of your your plants, or may be just some of it's nodes.

There should not be too much difference in how 3.9 is polling the devices, but it could be that e.g. there are some additional SNMP requests performed on each status poll.

The other useful tool is Wireshark - you can install it on the same machine where NetXMS is running. It will capture all network packets (you can limit it e.g. to one IP address), so you can have information on how many packets there are per minute. At least for SNMP, if there are a lot of excessive packets, you can look inside and see what exactly OID is requested.

csharve

Thank you for the response Filipp.

Yes, the Broadcast Storm messages are from the units that run our process. According to the specs, these units see a Broadcast Storm as 120 packets within one second.

As far as monitoring goes, we have around 150 nodes across our site, nodes are polled using SNMP and no NXSL scripts.

The strange part is, we do not monitor these units in NetXMS, they are monitored through our DCS. I did not expect v3.9 to start polling everything across the network. I thought since this was an upgrade, using the existing DB, the settings (particularly Network discovery turned off) would remain the same. It seems to me that Network Discovery is scanning all of our systems, though my understanding of these concepts is minimal!

The problem with a test environment to monitor one of our plants is this; The moment I install NetXMS and the Core server starts up, the broadcast alarms begin. I don't even have a chance to install the management server. In our process, the risk of shutting down the process units is too great to chance it. I'm forced to stop the core server.

Wireshark is the way to go, it is already installed on the machine where NetXMS is running, but again, I'm risking a shutdown just by starting NetXMS.

I think my only solution is to create a one to one connection to where the database exists, so I can start NetXMS v3.9 without being connected to any other networks. Then I can run the management console and verify the Network Discovery configuration. Why the database was created on a different server is beyond me, I didn't work in my current role when NetXMS was installed.

Please forgive my lack of knowledge on this. While I have an IT background, this is my first role in a production environment where even subtle changes can have dire consequences within the DCS. In the past, mistakes meant someone couldn't connect to various systems. Now, a mistake could shut down a chemical process. Needless to say, my stress levels are a tad bit higher!!!

Thank you for your input, and if you have any other suggestions I would love to hear them. I truly appreciate it!


Kevo

120 pps seems awfully low to trigger a broadcast storm warning. That's only about 3Mbps of traffic if my calcs are correct.

Is that 120 number restricted to only certain kinds of packets or was there any other specifics?


csharve

Agreed. It seems much too low, but our DCS is 30 years old and despite upgrades over the years they are not exactly a state of the art provider these days. With that said, these units are also controlling the process, so what amount of traffic they are handling on that side, I do not know. According to the manufacturer, "the unit is protecting itself against what it views as a threat to processing stability". They are Pentium III for darn sake!

I have no specifics on what kind of packets.

Starting to think I will migrate the SQL database directly on the server running NetXMS. Once I migrate the database, I should be able to disconnect the VM from the network and start the core service, connect to the local database, run the management console and adjust my poll settings.

Stupid question, but I am trying to avoid catastrophe, as long as the NetXMS core service is disabled, I can connect the VM back to the network without fear of this broadcast storm? In other words, if the core service isn't running NetXMS will not poll any nodes, correct? Obviously, I cannot run the management console without the service running, but I just want to ensure NetXMS will not run when I connect back to the network.

Thank you for the help!

Filipp Sudanov

Yes, sure, if NetXMS Core service is stopped, it's won't poll anything.