Monitoring Hyper-V cluster on Windows Server 2008 R2 Enterprise ed.

Started by Marco Incalcaterra, July 30, 2012, 07:18:18 PM

Previous topic - Next topic

Marco Incalcaterra

Hi Victor,

I have a Windows 2008 R2 cluster with a group of virtual machines managed through Hyper-V hypervisor. Virtual machines are moving between nodes of the cluster depending on load and/or maintenance state of the nodes of the cluster.
I would like to create a pie chart in the dashboard showing the total number of machine in "running", "turned off" and "saved" state for the entire cluster (not for each node). I have exported successfully from each node the relevant values through the WinPerf subagent.
When I tried to add a new parameter to the data collection configuration of the Cluster node, this just "cloned" the parameter to each node, that is not what I need.
Is there a way to get a global value for each data sampled for the entire cluster (summing the values of each node of the cluster)? This will help me to monitor the cluster instead of each node.

Thank you for your help.

Best regards,
Marco

Victor Kirhenshtein

Hi!

You have to create additional DCI on any node, which will accumulate values from all cluster nodes. You can create DCI with internal source and parameter "Dummy", and use NXSL functions GetDCIValueByName or GetDCIValueByDescription in transformation script for getting values from individual cluster nodes. Don't forget to turn off trusted node check or add target node to trusted nodes list on all cluster nodes (see also this page: http://wiki.netxms.org/wiki/SG:Security_Issues).

Best regards,
Victor

Marco Incalcaterra

Quote from: Victor Kirhenshtein on July 30, 2012, 10:51:12 PM
Hi!

You have to create additional DCI on any node, which will accumulate values from all cluster nodes. You can create DCI with internal source and parameter "Dummy", and use NXSL functions GetDCIValueByName or GetDCIValueByDescription in transformation script for getting values from individual cluster nodes. Don't forget to turn off trusted node check or add target node to trusted nodes list on all cluster nodes (see also this page: http://wiki.netxms.org/wiki/SG:Security_Issues).

Best regards,
Victor

Ok, now I have all the cumulated values properly collected in each node, but how can I put in the dashboard as a single entity to have a pie chart so that if a node is down the values are shown from one of the other node? Is it possible?

Best regards,
Marco

Victor Kirhenshtein

You need only one such DCI, not on each node. As it's source is "Internal", it will be collected even if node is down, because it does not involve any communication with the node - everything is inside server process. You can put this DCI on your dashboard. In more details, you should have configuration similar to this:

Cluster node A:
   "agent" DCI for running VMs
   "agent" DCI for stopped VMs
   "agent" DCI for saved VMs

Cluster node B:
   "agent" DCI for running VMs
   "agent" DCI for stopped VMs
   "agent" DCI for saved VMs

Cluster node C:
   "agent" DCI for running VMs
   "agent" DCI for stopped VMs
   "agent" DCI for saved VMs

Any node (could be any cluster node, or management server itself, or any other):
  "internal" DCI, parameter name "Dummy", for total running VMs
  "internal" DCI, parameter name "Dummy", for total stopped VMs
  "internal" DCI, parameter name "Dummy", for total saved VMs

  each of these DCIs should have transformation script similar to this:

return GetDCIValueByName(FindNodeObject($node, "node-a"), "running_vms_parameter") +
   GetDCIValueByName(FindNodeObject($node, "node-b"), "running_vms_parameter") +
   GetDCIValueByName(FindNodeObject($node, "node-c"), "running_vms_parameter");


The only problem remains is if when cluster node down and value of appropriate parameters is not 0 - then you will get incorrect totals. You can add additional checks in the script, like this:


sub getValue(nodeName, paramName)
{
   node = FindNodeObject($node, nodeName);
   if (node == null)
      return 0;

   dci = GetDCIObject(node, paramName);
   if ((dci->status != 0) || (dci->errorCount > 0))
      return 0;

   return GetDCIValue(node, dci->id);
}

return getValue("node-a", "running_vms_parameter") +
   getValue("node-b", "running_vms_parameter") +
   getValue("node-c", "running_vms_parameter");


Best regards,
Victor

Marco Incalcaterra

Quote from: Victor Kirhenshtein on July 31, 2012, 12:22:58 PM
You need only one such DCI, not on each node. As it's source is "Internal", it will be collected even if node is down, because it does not involve any communication with the node - everything is inside server process. You can put this DCI on your dashboard. In more details, you should have configuration similar to this:

Hi Victor,

got it! Thank you!

I have another question. As far as I understood the cluster container has the possibility to have DCI objects and those objects are propagated to all the nodes. I think that it could be useful to have the possibility to decide if propagate or not, in the last case this kind of transformation can be set directly to the DCI of the cluster container instead of somewhere else, IMHO info related to the cluster should be tied to the cluster container, not to any other node.

Best regards,
Marco

Victor Kirhenshtein

Currently cluster objects works like a special kind of templates - DCIs defined on it applied to underlying nodes, exactly as from template. If cluster objects will also have active DCIs collected on it's own, it will somehow break current concept of node objects being only objects where data collection occurs. Also, there are cases when you don't have cluster object, but still need some common values - for example, total number of users connected to farm of terminal servers. Using one of the nodes is not very elegant and intuitive, as such values related to node group, not only one node. One possible solution would be to add data collection capabilities to container objects as well. Or, we can introduce some new object class for such DCIs. Another question is thresholds on such parameters - what object should be event source in case of threshold violation? I like to got to clear and elegant solution. Any thoughts?

Best regards,
Victor

Marco Incalcaterra

Quote from: Victor Kirhenshtein on July 31, 2012, 11:44:26 PM
Currently cluster objects works like a special kind of templates - DCIs defined on it applied to underlying nodes, exactly as from template. If cluster objects will also have active DCIs collected on it's own, it will somehow break current concept of node objects being only objects where data collection occurs. Also, there are cases when you don't have cluster object, but still need some common values - for example, total number of users connected to farm of terminal servers. Using one of the nodes is not very elegant and intuitive, as such values related to node group, not only one node. One possible solution would be to add data collection capabilities to container objects as well. Or, we can introduce some new object class for such DCIs. Another question is thresholds on such parameters - what object should be event source in case of threshold violation? I like to got to clear and elegant solution. Any thoughts?

Best regards,
Victor

For my specific case the solution of adding DCI to container could be good and generic enough. Since template is something that already exists, I don't know if it is a good idea to have it replicated in the Cluster object.

Still in my case, a threshold like "number of VM that are in off state > X" would not break any rule because it is still relevant to the value handled (transformed) by the parameter I defined. If a need to monitor the specific number of VM running on a specific node, then I could add a specific DCI object to that node (as I did in my first approach).

I really don't know if removing template from cluster and adding DCI collection to all containers (maybe only from Internal origin) is a smart solution, in this moment I'm not able to see other cases that are not properly handled with this approach. You have a wider look of the entire system, do you have knowledge of "strange" situation that cannot be handled by this approach?

Best regards,
Marco.

Marco Incalcaterra

Quote from: Jmp_3f8h on August 01, 2012, 11:15:51 AM
Still in my case, a threshold like "number of VM that are in off state > X" would not break any rule because it is still relevant to the value handled (transformed) by the parameter I defined. If a need to monitor the specific number of VM running on a specific node, then I could add a specific DCI object to that node (as I did in my first approach).

I forgot to mention that in this scenario the event source for the alarm should be the container and not a specific node. But I don't know if promoting a container to source for alamr has other side effects.

Marco

Victor Kirhenshtein

Quote from: Jmp_3f8h on August 01, 2012, 01:17:39 PM
I forgot to mention that in this scenario the event source for the alarm should be the container and not a specific node. But I don't know if promoting a container to source for alamr has other side effects.

This can cause some side effects. For example, now you have $node variable in filtering scripts in event processing policy. If container and/or cluster objects can be event sources as well, you will have to check with what object class you are dealing with before accessing object properties for example. It will be quite complicated change, mostly because lot of places in the code will need to be checked.

Best regards,
Victor

Marco Incalcaterra

Quote from: Victor Kirhenshtein on August 01, 2012, 02:31:57 PM
Quote from: Jmp_3f8h on August 01, 2012, 01:17:39 PM
I forgot to mention that in this scenario the event source for the alarm should be the container and not a specific node. But I don't know if promoting a container to source for alamr has other side effects.

This can cause some side effects. For example, now you have $node variable in filtering scripts in event processing policy. If container and/or cluster objects can be event sources as well, you will have to check with what object class you are dealing with before accessing object properties for example. It will be quite complicated change, mostly because lot of places in the code will need to be checked.

Best regards,
Victor

I was thinking about a $container to be checked for events generated by those objects. Probably a superclass object $some_entity extended by $node and $container can be substituted to the current $node in most of the places. But even without considering the implementation effects (heavy, as far as I understood), the real problem is that I'm not totally convinced that a similar approach can be 100% semantically correct.

Marco

PS May be my comments are totally foolish since I don't know how it is currently implemented. Sorry in advance! :)