Introduction
...
au Flapping
Shinken Enterprise supports optional detection of hosts and services that are flapping. Flapping occurs when a service or host change state too frequently, resulting in a storm of problem and recovery notifications. Flapping can be indicative of configuration problems (i.e. thresholds set too low), troublesome services, or real network problems.
How Flap Detection Works
Whenever Shinken Enterprise checks the status of a host or service, it will check to see if it has started or stopped flapping. It does this by.
- Storing the results of the last 21 checks of the host or service
- Analyzing the historical check results and determine where state changes/transitions occur
- Using the state transitions to determine a percent state change value (a measure of change) for the host or service
- Comparing the percent state change value against low and high flapping thresholds
A host or check is determined to have started flapping when its percent state change first exceeds a high flapping threshold.
A host or check is determined to have stopped flapping when its percent state goes below a low flapping threshold (assuming that is was previously flapping).
Example
...
The image below shows a chronological history of service states from the most recent 21 check. OK states are shown in green, WARNING states in yellow, CRITICAL states in red, and UNKNOWN states in orange.
...
permet de détecter en option les hôtes et services en état flapping. Celui-ci arrive quand l'état de l'élément change trop souvent, envoyant beaucoup trop de notifications de d'alertes/reprises successives. Le flapping peut être caractéristique de problèmes de configuration (i.e. seuils trop bas par exemple), ou de vrais problèmes de réseau.
Comment ça fonctionne
A chaque fois que Shinken Enterprise vérifie le statut d'un hôte ou d'un service, il commence par vérifier si l'élément a commencé ou vient d'arrêter d'être en Flapping. Il le fait de la façon suivante :.
- Stockage des résultats des 21 derniers checks
- Analyse des résultats et détermination du moment où l'état à changé
- Détermination du pourcentage de changement d'état
- Comparaison de ce taux avec les valeurs définies comme seuils
Exemple
Voyons la mécanique plus en détail avec un check.
Cette image illustre l'historique de l'état d'un service sur les 21 derniers checks. Les états OK sont en vert, WARNING en jaune, CRITICAL en rouge, et UNKNOWN en orange.
Les résultats de la vérification de l'historique sont examinées afin de déterminer où les changements d'état / transitions se produisent . Les changements d'état se produisent quand un état archivé est différent de l'état archivé qui le précède immédiatement chronologiquement . Comme nous conservons les résultats des 21 derniers contrôles de service dans le réseau , il ya une possibilité d'avoir au moins 20 changements d'état. La valeur 20 peut être modifiée dans le fichier de configuration principal . Dans cet exemple , il y a 7 changements d'état , indiqués par les flèches bleues dans l'image ci-dessus.
La logique de détection de Flapping utilise le changement d'état pour déterminer le pourcentage global du check. C'est une mesure de volatilité du service. Les services qui ne changent jamais d'état sont à 0%, alors que ceux qui changent à chaque check seront à 100%.
Dans l'algorithme de calcul, un poids plus important sera donné aux derniers résultats par rapport aux plus anciens. En règle générale, on fait en sorte que les derniers résultats pèsent pour 50% du total . S
En utilisant l'image ci-dessus, faisons un calcul de pourcentage. Dans cet exemple, il y a 7 changements d'état sur les 21 derniers checks. (à
The flap detection logic uses the state changes to determine an overall percentage state change for the check. This is a measure of volatility/change for the service. Services that never change state will have a 0% state change value, while services that change state each time they're checked will have 100% state change. Most services will have a percentage state change somewhere in between.
When calculating the percentage state change for the check, the flap detection algorithm will give more weight to new state changes compare to older ones. Specifically, the flap detection routines are currently designed to make the newest possible state change carry 50% more weight than the oldest possible state change. The image below shows how recent state changes are given more weight than older state changes when calculating the overall or total percent state change for a particular service.
Using the images above, lets do a calculation of percentage state change for the service. You will notice that there are a total of 7 state changes (at t3, t4, t5, t9, t12, t16, and et t19). Without any weighting of the state changes over time, this would give us a total state change of Sans pondération, le pourcentage moyen serait de 35%:
(7 observed state changes changements observés/ possible 20 state changespossibles ) * 100 = 35 %
Since the flap detection logic will give newer state changes a higher rate than older state changes, the actual calculated percentage state change will be slightly less than 35% in this example. Let's say that the weighted percentage of state change turned out to be 31%...
...
If neither of those two conditions are met, the flap detection logic won't do anything else with the service, since it is either not currently flapping or it is still flapping.
Flap Detection for Checks
Shinken Enterprise checks to see if a service is flapping whenever the service is checked (either actively or passively).
The flap detection logic for services works as described in the example above.
Flap Detection for Hosts
Host flap detection works in a similar way to service flap detection, with one important difference: Shinken Enterprise will attempt to check to see if a host is flapping whenever:
...
Why is this done? With services we know that the minimum amount of time between consecutive flap detection routines is going to be equal to the service check interval. However, you might not be monitoring hosts on a regular basis, so there might not be a host check interval that can be used in the flap detection logic. Also, it makes sense that checking a service should count towards the detection of host flapping. Services are attributes of or things associated with host after all... In any case, that's the best method I could come up with for determining how often flap detection could be performed on a host, so there you have it.
Flap Detection Thresholds
Shinken Enterprise uses several variables to determine the percentage state change thresholds is uses for flap detection. For both hosts and services, there are global high and low thresholds and host- or service-specific thresholds that you can configure. Shinken Enterprise will use the global thresholds for flap detection if you to not specify host- or service- specific thresholds.
This screenshot shows the global and host- or check-specific variables that control the various thresholds used in flap detection.
States Used For Flap Detection
Normally Shinken Enterprise will track the results of the last 21 checks of a host or service, regardless of the check result (host/service state), for use in the flap detection logic.
You can exclude certain host or service states from use in flap detection logic by using the "flap_detection_options" directive in your host or service definitions. This directive allows you to specify what host or service states (i.e. "UP, "DOWN", "OK, "CRITICAL") you want to use for flap detection. If you don't use this directive, all host or service states are used in flap detection.
Flap Handling
When a service or host is first detected as flapping, Shinken Enterprise will:
...
- Remove the block on notifications for the service or host (notifications will still be bound to the normal :ref:`notification logic <thebasics/notifications>`).
Enabling Flap Detection
In order to enable the flap detection features in Shinken Enterprise , you'll need to:
...


