...
A monitoring system needs to meet the expected requirements. The first thing you , as the system/network administrator need to do , is to get your management buy-in on deploying a supervisory monitoring and data acquisition system to meet corporate goals. The second is to define the scope of the monitoring system and its particularities.
...
- For each status or performance data determine if it meets the scope and goals of the project.
How
...
Shinken Enterprise is
...
scalable
Shinken can scale out horizontally on multiple servers or vertically with more powerful hardware. Shinken deals automatically with distributed status retention. There is also no need to use external clustering or HA solutions.
...
- Number of active checks per second (type of active check having a major impact!)
- Number of check checks results per second (hosts and checks combined)
And to a lesser less extent, as performance data is not expected to overload a Graphite server instance (Which a single server can do up to 80K updates per second with millions of metrics) with a hardware RAID 10 of SSD disks.
...
Active checks benefit from Shinken Enterprise's powerful availability algorithms for fault isolation and false positive elimination.
...
The broker is a key component of the scalable architecture. Only a single broker can be active per scheduler. A broker can process broks (messages) from multiple schedulers. In most modern deployments, Livestatus is the broker module that provides status information to the web frontends. (Nagvis, Multisite, Thruk, etc.) or Shinken Enterprise's own WebUI module. The broker needs memory and processing power.
Dependency model
Shinken Enterprise has a great dependency resolution model. Automatic root cause isolation, at a host level, is one method that Shinken Enterprise provides. This is based on explicitly defined parent/child relationships. This means that on a check or host failure, it will automatically reschedule an immediate check of the parent(s). Once the root failure(s) are found, any children will be marked as unknown status instead of soft down.
This model is very useful in reducing false positives. What needs to be understood is that it depends on defining a dependency tree. A dependency tree is restricted to single scheduler. Shinken Enterprise provides a distributed architecture, that needs at least two trees for it to make sense.
...
Metrics or performance data (in a Nagios speakway) are embedded with check results. A check result can have zero or more performance metrics associated with it.
Theses are transparently passed off to systems outside of Shinken Enterprise using a Broker module. The Graphite broker module can easily send more than 2000 metrics per second. We have not tested the upper limit. Graphite itself can be configured to reach upper bounds of 80K metrics per second.
...