These are what I feel are some best practice recommendations.
#1 Define what is to be monitored.
- Is it a HOST?
- Is it a SERVICE?
#2 Determine what level of health monitoring is desirable.
- Basic Monitoring Probes (typically active checks)
- Simple TCP/IP port up/down
- Disk space, free memory, swap utilization, etc
- File exists (or does not exist, or is not of at least X size)
- Process is running/not running
- etc
- Simple TCP/IP port up/down
- Log Monitoring (typically passive checks)
- System logs should be collected and analyzed.
- Application logs should be collected an analyzed.
- Any log monitoring software should issue alerts via the notification tool API for notification escalation,etc
- System logs should be collected and analyzed.
- Application monitoring (combination of active/passive)
- Simple service monitoring - send a string to this port and expect a response back(might be validated with regex)
- Complex service monitoring which requires a custom plugin
- Services for which nagios plugins already exist
- Passive alerts sent directly from the application
- Simple service monitoring - send a string to this port and expect a response back(might be validated with regex)
#3 Determine what hosts and services require active checks.
- How often is the service/host checked?
- How is “flapping” handled? Flapping is repeated up/down states during a service check interval.
- Who is notified at failure?
- What is the notification escalation policy
- How do you respond to alerts, quiesce the notification, etc?
#4 Collect metrics when possible.
- Health checks should collect metrics whenever possible. For instance, a “server health” plugin, which monitors CPU,swap,memory, disk space, etc, should return the metrics for the services which are being monitored. This avoids pulling for resource utilization AND alerting.
- Metrics collection may be resource intensive on the monitoring server.
- What is the long term policy for metrics collection, aggregation and reporting?
#5 Define the alerting/notification requirements.
- What is the SLA on the service?
- What is the SLA on the response to an alert?
- Who is to be notified when the service fails?
- What is the escalation policy?
#6 Define the notification escalation policy based on the SLA for the response.
- When an alert is not answered who is it escalated to?
- How is it escalated (email, sms, etc)
- What are the incentives for timely alert/notification response?
- Disincentives for non-timely response?
#7 Pick the right tools.
- I like Nagios. It is pluggable, supports metrics collection and it is reliable.
- Try not to reinvent the wheel too much.
- Try to come up with integrated monitoring/metrics collection and reporting. Keep as little operational data in your operational databases as possible. Limit operational monitoring databases to one month of historical data, in addition to maintaining status and notification information. Move historical information into a dedicated metrics reporting database schema.
PlanetMySQL Voting: Vote UP / Vote DOWN