Flexible Operations Management

Dashboard as a Hierarchy of Services

With RealOpInsight, any monitored platform is viewed as a hierarchy of business services built upon a service dependency tree. A service dependency tree, illustrated on the figure below, is comprised of two kinds of services:

  • IT Services: An IT service defines a service linked to a basic IT capability (e.g. a process). Located at the lowest level of the service hierarchy, every IT service is associated to a probe within the underlying monitoring. A probe, also called data point in RealOpInsight terminology, can be a Nagios check, a Zabbix trigger, a Zenoss component, a Pandora FMS module, or whatsoever you want according to the backed monitoring system. The statuses of IT services are updated periodically via status data retrieved from the underlying monitoring systems.
  • Business Services: Also called Business Process, a business service may be an application or any high-level service providing value-added to end-users or to other applications (e.g. operating systems, network, storage, Web applications, database services, downloading service, etc.). Within a RealOpInsight service hierarchy, a business service may depend on one or more subservices, and its subservices can be IT services, other business services, or both.

Note

Bear in mind that, to ensure the consistency of a business service view, an IT service must not have subservices. Additionally, every IT service must be linked to an actual probe within the underlying monitoring systems. Read the section ref:Determining Service Status <determining_service_status> to learn how event severities are handled and propagated within the service hierarchy.

../_images/hierarchical-business-services.png

Illustration of a business service hierarchy

Delegating Management

Delegating management is a requirement to monitor effectively, and at scale. RealOpInsight has been designed with that in mind. Delegating management enables many benefits, such as, the ability for administrators and operations managers to:

  • Monitor effectively at scale by dividing your IT in subsets of resources managed by separated groups of users.
  • Separate the management of your critical applications from those of services having lower business values.
  • Delegate the management of critical services to users able to address problems related to them quickly.
  • Manage multi-tenant IT environments shared organizations or enterprise departments.

To learn how this works in RealOpInsight, read the documentation related to user and view management.

Determining Service Status

Since a service may depend on several subservices, and since each subservice in turn may also depend on other subservices, the severity status of each service is determined and calculated on the basis of the severity statuses of its direct lower services. This is done on the basis of status propagation rules, status calculation rules, and weighting. To archive that, the properties of each service define the relationship between the service and its subservices in terms of how they interpret each other’s severity status (severity status, weight, status propagation rule, status calculation rule).

../_images/severity-weight-propagation.png

Each service propagates to its parent a severity and a weight associated to the severity. The overall status of the parent service is computed using a calculation algorithm that aggregates the severities and the weights propagated by the subservices.

Severity Status Model

When RealOpInsight retrieves status data from the underlying monitoring systems, their severities as well as their details (e.g. event messages) are extracted, processed and injected into the RealOpInsight Incident Management Engine. Since RealOpInsight supports heterogeneous monitoring systems with distinct severity models, the RealOpInsight Incident Management System provides a unified severity model that leverages the severity models of the underlying monitoring systems. Hence, every severity status that enters into the RealOpInsight Incident Management is converted to its equivalent in the RealOpInsight Severity Model.

The RealOpInsight Severity Model is comprised of five levels of impacts (NORMAL, MINOR, MAJOR, CRITICAL, UNKNOWN), which are listed with their equivalents in the following table.

Severity Nagios State Zabbix Severity Zenoss Severity
NORMAL OK CLEAR CLEAR
MINOR
INFORMATION, WARNING DEBUG
MAJOR WARNING AVERAGE WARNING
CRITICAL CRITICAL HIGH, DISASTER ERROR, CRITICAL
UNKNOWN UNKNOWN NOT CLASSIFIED

Service Weighting

Weighting enables you to associate a weight factor with a service. The weight factor can be any real number between min=0 and max=10, determining how important the service is to its parent service compared to other sibling services.

  • A service without an assigned weight factor is assumed to have a weight equal to 1 (default weight).
  • A service having a weight factor equal to zero (min=0) is assumed to be a service that doesn’t have any impact on its parent service (neutral service). Its status is ignored when computing the status of its parent service.
  • A service having the maximum weight factor (max=10) is assumed to be an essential service to its parent service. Be an essential service means that if the service is down or unavailable, the parent should not operate properly. Hence its status should be set to down too. E.g: the load balancer is the website architecture presented above is an essential service, since if the load balancer is down, the will be unavailable too. Conversely, the web servers taken individually are not essential since replicated.

Note

In practice, weighting is used in conjunction with severity calculation rules and severity propagation rules when computing the overall status of a service according to the individual statuses of its subservices.

Status Propagation Rules

A propagation rule defines how the status of a service shall be propagated to its parent service. RealOpInsight supports the following rules:

Decreased
The severity of the service is decreased before being propagated to its parent service.
Increased
The severity of the service is increased before being propagated to its parent service.
Unchanged
The severity of the service is propagated as is to its parent service.

Note

In the service dependency hierarchy, the status of a given service is computed by aggregating the propagated severities of its subservices through the severity calculation rule defined for that service.

Status Calculation Rules

A calculation rule defines how the severity of a service shall be computed according to the weights and the severities propagated by its direct subservices. RealOpInsight supports the following rules:

Worst/Most Critical Severity
The status of the service is determined by the highest (most critical) severity propagated by its direct subservices.
Weighted Severity

The status of the service is determined by the maximum between the weighted average of severities propagated by non-subservices, and the maximum of severities propagated by essential subservices.

Formally speaking, given a service S having n subservices which respectively propagated the severities s1, s2, ..., and sn, along with the weights w1, w2, ..., and, wn, respectively. Let max_essential_severity be the maximum of severities propagated by essential subservices, nonessential_weighted_average_severity be the weighted average of severities propagated by non-essential subservices, overall_severity the overall severity of the service S computed from the weights and the severities propagated by its direct subservices.

Generally stated, the weighted average of severities propagated by n subservices s1, s2...sn with respectively the weights w1, w2...wn can be determined using the following expression:

weighted_average = ROUND ( (w1*s1 + w2*s2 + ... + wn*s2) / (w1 + w2 + ... + wn) )

The overall severity of a service can be determined as follows:

overall_severity = MAX (nonessential_weighted_average_severity, max_essential_severity)

Note

In practice, to enable the evaluation of these expressions, each severity is associated by convention to a positive integer: Normal=0, Minor=1, Major=2, Critical=3, Unknown=4.

Weighted Severity with Thresholds

This rule improves the simple weighted average rule described above. In addition, it provides the ability to escalate a severity when given thresholds of similar events are reached. For example, you may want to escalate the severity from Major to Critical if more than 50% of services are in Major state.

When using weighted severity with thresholds, the overall severity of a service is determined by the maximum severity between the weighted average of severities propagated by its direct subservices, and the maximum of severities generated by thresholds. The weighed severity average is computed in the same way as with the classical approach of weighed severity average presented above.

The threshold evaluation is weighted, meaning that if we have a service S depending on two services in the s1 and s2 with respectively the weights w1 and w2. The percentage of subservices of S in state s1 is given by w1 / (w1 + w2) while the percentage of subservices of S in state s2 is given by w2 / (w1 + w2).

Note

Given that a service can have more than one threshold rules defined, the rules are evaluated beginning by the rule having the highest resulting severity value. For example, if we have the following threshold rules: (R1) 50% Minor => Major, and (R2) 100% Minor => Critical. The rule R2 will be evaluated before the rule R1 since Critical is assumed to be higher than Major. If the evaluation of (R2), i.e. if there are less than 100% Minor events, then (R1) will be evaluated.

Use Cases of Status Updates

Let consider a business service dependency tree corresponding to the website architecture presented previously. In the architecture, we have a load balancer and three web servers working as replicated web request handlers behind the load balancer. So, let My Website be the business service representing the health of the website, Load Balancer be the load balancer business service, Web Server1, Web Server2 and Web Server3 be respectively the business services representing the different web servers. Basing on the associated business service tree, we’ll present below four use cases to illustrate how status aggregation works.

Use Case 1: All Essential Services in Normal State

Consider that:

  • the load balancer is an essential service and is in the Normal state;
  • the other services (Web Server1, Web Server2 and Web Server3) are all non-essential and have the same weight w=1, while being respectively in the states Normal, Minor and Major;
  • the calculation rule used to compute the status of the website is Weighted Average

Hence, the maximum severity of essential services is Normal while the weighted average of severities propagated by non-essential shall be:

nonessential_weighted_average_severity = ROUND (1*0 + 1*1 + 1 * 2) / (1 + 1 + 1) = 1

Finally, the overall severity of the My Website shall be equal to MAX (Normal, Minor) = Minor.

Use Case 2: Essential Services in Non-Normal State

From the previous example, now consider that the load balancer is in Critical state and that the other services have the same statuses as previously (i.e. respectively Normal, Minor, Major).

Hence, the overall severity of the `My Website service shall be equal to MAX (Critical, Minor) = Critical

Use Case 3: Thresholds and Status Escalation

Now consider that:

  • the load balancer is still working as essential service and has the severity Normal;
  • the other services (Web Server1, Web Server2 and Web Server3) are all non-essential and have the same weight w=1, while being respectively in the states Normal, Minor and Minor;
  • the calculation rule used to compute the overall severity of the My Website service is Weighed Average With Thresholds, knowing that the following threshold rules have been defined: 50% of Minor => Major, 60% of Major => Critical.

Hence:

  • the maximum of severity of essential services shall be Normal;

  • the weighed average of severities of non-essential subservices shall be:

    nonessential_weighted_average_severity = ROUND ( (1*0 + 1*1 + 1*1) / (1 + 1 +1 ) = 1 = Minor
    
  • the percentage of non-essential services in Minor state is 2/3 ~= 67%, meaning that the first threshold is exceeded. Therefore the status shall be escalated from Minor to Major.

  • The overall severity of the My Website service shall be equal MAX (Normal, Minor, Major) = Major, due to the exceeded threshold.

Use Case 4: Thresholds,Weighting and Status Escalation

We consider in this use case that:

  • the load balancer is an essential service and is in Normal state;
  • the services `Web Server1 and Web Server2 have the same weight w1=w2=1 and are all in Normal state;
  • the service Web Server2 has the weight w3=4 and is in Major state.
  • the calculation rule used to compute the overall severity of the My Website service is Weighed Average With Thresholds, knowing that the following threshold rules have been set: 50% of Minor => Major, 60% of Major => Critical.

Hence:

  • the maximum of severity of essential services shall be Normal;

  • the weighed average of severities of non-essential subservices shall be

    nonessential_weighted_average_severity = ROUND ( (1*0 + 1*0 + 2*4) / (1 + 1 + 4 ) = 1 = Minor
    
  • the weighted percentage of non-essential services in Major state is 4/6 ~= 67%, meaning that the second threshold is exceeded. Hence the status shall be escalated from Major to Critical.

  • The overall severity of the My Website service shall be equal to MAX (Normal, Minor, Critical) = Critical, due to the exceeded threshold.

Contextual Event Messages

Contextual event messages allow you to set specific messages to show on the operations console when events occur or are resolved. Indeed, it’s usual that when events occur, your monitoring systems raise generic messages, without any contextual information (e.g. hostname), and often in languages that your operations staffs are not familiar with. To help your operations staffs in such situations, RealOpInsight lets you set the messages to show to operations console when events occur or are resolved. This aims to have better comprehensible and more useful messages.

For example, assume that for monitoring the root partition of a database server, we have the following definition in our Nagios configuration:

define service{
   use                  local-service
   host_name            mysql-server
   service_description  Root Partition
   check_command    check_local_disk!30%!10%!
}

If the free space on that partition becomes less than 30% (warning threshold), then Nagios shall report an alert indicating something like DISK WARNING - free space: / 58483 MB (28% inode=67%).

However, instead of this basis message, RealOpInsight allows you to have a more human-comprehensible message such as The free space in the root partition of the machine mysql-server is less than 30%*, by setting the following template message in the RealOpInsight Editor: The free space in the root partition of the machine {hostname} is less than {threshold}, where {hostname} and {threshold} are contextual tags enabled by RealOpInsight. They are automatically replaced at runtime with contextual information. See the complete list of supported contextual tags.

Contextual messages are set when editing the description file of your business views. Read the RealOpInsight Editor User Manual for more details.

Contextual Tags

Contextual messages can include one or more contextual tags from the following list. WARNING: The curly braces are required in the tag name.

  • {hostname} : Shall be replaced with the hostname of the machine to which the incident is related.
  • {threshold} shall be replaced with the threshold defined in the check command. Currently this tag is only supported for Nagios.
  • {plugin_output} shall be replaced with the native message returned by the command (e.g. PING ok - Packet loss = 0%, RTA = 0.80 ms)
  • {check_name} shall be replaced with the name of the check component. E.g. check_local_disk.