This document describes how you can monitor an Oracle database server such as:
There are some steps you’ll need to follow in order to monitor a new database machine. They are:
To make your life a bit easier, a few configuration tasks have already been done for you:
We suppose here that the Oracle machine you want to monitor is named srv-lin-1 and is a linux. Please change the configuration and commands according with the real name of your server of course. |
You will need to configure the user for all your oracle databases. |
Connect to your database as sysadmin on the oracle server :
srv-lin-1:oracle# sqlplus "/ as sysdba" |
And then create your shinken account on the database:
CREATE USER shinken IDENTIFIED BY shinkenpassword;GRANT CREATE SESSION TO shinken; GRANT SELECT any dictionary TO shinken; GRANT SELECT ON V_$SYSSTAT TO shinken; GRANT SELECT ON V_$INSTANCE TO shinken; GRANT SELECT ON V_$LOG TO shinken; GRANT SELECT ON SYS.DBA_DATA_FILES TO shinken; GRANT SELECT ON SYS.DBA_FREE_SPACE TO shinken; |
And for old 8.1.7 database only:
---- if somebody still uses Oracle 8.1.7... GRANT SELECT ON sys.dba_tablespaces TO shinken; GRANT SELECT ON dba_temp_files TO shinken; GRANT SELECT ON sys.v_$Temp_extent_pool TO shinken; GRANT SELECT ON sys.v_$TEMP_SPACE_HEADER TO shinken; GRANT SELECT ON sys.v_$session TO shinken; |
To see if the connection to the database named PROD is ok, just launch :
/var/lib/shinken/libexec/check_oracle_health --connect "(DESCRIPTION =(ADDRESS_LIST =(ADDRESS = (PROTOCOL = TCP)(HOST = srv-lin-1)(PORT = 1521)))(CONNECT_DATA =(SID = PROD)))" --user "shinken" --password "shinkenpassword" --mode connection-time |
The parameters common to every Oracle server monitored shoud be setted in the host template "oracle-connection-method".
The parameters specifics to each server, such as database name for example, have to be done in the host's data itself.
It's a best practice to use the same database monitoring user on every Oracle server monitored. Doing so, you can configure the database user/password only once in the host template "oracle-connection-method". |
You can monitor multiple database on the same host in typing the name of the database separated by a comma. Exemple : DB1,DB2,DB3 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| tnsping | listener | |||
| process-usage | Percentage of max possible processes | 0% to 100% | 80 | 90 |
| sga-shared-pool-free | Free Memory in the Shared Pool | 0% to 100% | 10: | 5: |
| tablespace-usage | Used diskspace in the tablespace | 0% to 100% | 90 | 98 |
| corrupted-blocks | Number of corrupted blocks in database | 0 to n | 1 | 10 |
| invalid-objects | Sum of faulty Objects, Indices, Partitions | 0 to n | 0.1 | 0.1 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| datafile-io-traffic | Sum of IO-Operationes from Datafiles per second | n/sec | 1000 | 5000 |
| datafiles-existing | Percentage of max possible datafiles | 0% to 100% | 80 | 90 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| enqueue-contention | Enqueue wait/request-Ratio | 0% to 100% | 1 | 10 |
| enqueue-waiting | How many percent of the elapsed time since the last run has an Enqueue spend with waiting | 0% to 100% | 0.00033 | 0.00033 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| event-waits | Waits/sec from system events | n/sec | 10 | 100 |
| event-waiting | How many percent of the elapsed time has an event spend with waiting | 0% to 100% | 0.1 | 0.5 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| flash-recovery-area-usage | Used diskspace in the flash recovery area | 0% to 100% | 90 | 98 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| latch-contention | Latch misses/gets-ratio. With –name a Latchname or Latchnumber can be passed over. (See list-latches) | 0% to 100% | 1 | 2 |
| latch-waiting | How many percent of the elapsed time since the last run has a Latch spend with waiting | 0% to 100% | 0.1 | 1 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| pga-in-memory-sort-ratio | Percentage of sorts in the memory | 0% to 100% | 99: | 90: |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| switch-interval | Interval between RedoLog File Switches | 0 to n | 600: | 60: |
| retry-ratio | Retry-Rate in the RedoLog Buffer | 0% to 100% | 1 | 10 |
| redo-io-traffic | Redolog IO in MB/sec | n/sec | 199 | 200 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| rman-backup-problems | Number of RMAN-errors during the last three days | 0 to n | 1 | 2 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| roll-header-contention | Rollback Segment Header Contention | 0% to 100% | 1 | 2 |
| roll-block-contention | Rollback Segment Block Contention | 0% to 100% | 1 | 2 |
| roll-hit-ratio | Rollback Segment gets/waits Ratio | 0% to 100% | 99: | 98: |
| roll-extends | Rollback Segment Extends n | n/sec | 1 | 100 |
| roll-wraps | Rollback Segment Wraps n | n/sec | 1 | 100 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| sga-data-buffer-hit-ratio | Hitrate in the Data Buffer Cache | 0% to 100% | 98: | 95: |
| sga-library-cache-gethit-ratio | Hitrate in the Library Cache (Gets) | 0% to 100% | 98: | 95: |
| sga-library-cache-pinhit-ratio | Hitrate in the Library Cache (Pins) | 0% to 100% | 98: | 95: |
| sga-library-cache-reloads | Reload-Rate in the Library Cache | n/sec | 10 | 10 |
| sga-dictionary-cache-hit-ratio | Hitrate in the Dictionary Cache | 0% to 100% | 95: | 90: |
| sga-latches-hit-ratio | Hitrate of the Latches | 0% to 100% | 98: | 95: |
| sga-shared-pool-reloads | Reload-Rate in the Shared Pool | 0% to 100% | 1 | 10 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| soft-parse-ratio | Percentage of soft-parse-ratio | 0% to 100% | 90: | 98: |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| stale-statistics | Sum of objects with obsolete optimizer statistics | n | 10 | 100 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| sysstat | Changes/sec for any value from v$sysstat | n/sec | 10 | xx |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| tablespace-fragmentation | Free Space Fragmentation Index | 100 to 1 | 30: | 20: |
| tablespace-can-allocate-next | Checks if there is enough free tablespace for the next Extent | |||
| tablespace-remaining-time | Sum of remaining days until a tablespace is used by 100%. The rate of increase will be calculated with the values from the last 30 days. (With the parameter –lookback different periods can be specified) | Days | 90: | 30: |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| seg-top10-logical-reads | Sum of the userprocesses under the top 10 logical reads | n | 1 | 9 |
| seg-top10-physical-reads | Sum of the userprocesses under the top 10 physical reads | n | 1 | 9 |
| seg-top10-buffer-busy-waits | Sum of the userprocesses under the top 10 buffer busy waits | n | 1 | 9 |
| seg-top10-row-lock-waits | Sum of the userprocesses under the top 10 row lock waits | n | 1 | 9 |
| Check | Detail | Check range | Default Warning | Default Critical |
|---|---|---|---|---|
| session-usage | Percentage of max possible sessions | 0% to 100% | 80 | 90 |
| Connected users | Number of currently connected users | 0 to n | 50 | 100 |
Contains all Oracle host templates listed previously.
Using the Oracle-full template could lead to slow issues. You should use the really needed templates instead. |