Contexte
Pour des besoins concernant l’hébergement que propose ma société, j'ai été amené à gérer un serveur ESXi, du coup, il n'y a pas de raison de ne pas le surveiller, je dirai même que c'est encore plus nécessaire ! On a vite tendance à tomber dans les pièges de la virtualisation qui consistent à charger le serveur avec beaucoup VM s'imaginant que celui-ci augmente ces performances au fur et à mesure de la charge ... :D
Prérequis
Installation du vSphere SDK Perl :
Vous téléchargez le tar.gz : VMware-vSphere-Perl-SDK-5.1.0-780721.x86_64.tar.gz
$ tar xvfz VMware-vSphere-Perl-SDK-5.1.0-780721.x86_64.tar.gz
$ cd vmware-vsphere-cli-distrib
Il y a 2 variables à changer afin de permettre sans encombre l'installation du SDK :
my $httpproxy =0;
my $ftpproxy =0;
par :
my $httpproxy =1;
my $ftpproxy =1;
# ./vmware-install.pl
Installation du plugin Nagios
Télécharger le plugin ici :http://www.op5.org/community/plugin-inventory/op5-projects/check-esx-plugin
$ cd /usr/local/nagios/libexec/
$ wget http://git.op5.org/git/?p=nagios/op5plugins.git;a=blob_plain;f=check_vmware_api.pl;hb=HEAD
# chown nagios:nagios check_vmware_api.pl
# chmod 755 check_vmware_api.pl
Lançons la commande une première fois et nous obtenons ceci :
$ ./check_vmware_api.pl --help
check_vmware_api.pl 0.7.0
This nagios plugin is free software, and comes with ABSOLUTELY NO WARRANTY.
It may be used, redistributed and/or modified under the terms of the GNU
General Public Licence (see http://www.fsf.org/licensing/licenses/gpl.txt).
VMWare Infrastructure plugin
Usage: check_vmware_api.pl -D | -H [ -C ] [ -N ]
-u -p | -f
-l [ -s ] [ -T ] [ -i ]
[ -x ] [ -o ]
[ -t ] [ -w ] [ -c ]
[ -V ] [ -h ]
-?, --usage
Print usage information
-h, --help
Print detailed help screen
-V, --version
Print version information
--extra-opts=[section][@file]
Read options from an ini file. See http://nagiosplugins.org/extra-opts
for usage and examples.
-H, --host=
ESX or ESXi hostname.
-C, --cluster=
ESX or ESXi clustername.
-D, --datacenter=
Datacenter hostname.
-N, --name=
Virtual machine name.
-u, --username=
Username to connect with.
-p, --password=
Password to use with the username.
-f, --authfile=
Authentication file with login and password. File syntax :
username=
password=
-w, --warning=THRESHOLD
Warning threshold. See
http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
for the threshold format.
-c, --critical=THRESHOLD
Critical threshold. See
http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
for the threshold format.
-l, --command=COMMAND
Specify command type (CPU, MEM, NET, IO, VMFS, RUNTIME, ...)
-s, --subcommand=SUBCOMMAND
Specify subcommand
-S, --sessionfile=SESSIONFILE
Specify a filename to store sessions for faster authentication
-x, --exclude=
Specify black list
-o, --options=
Specify additional command options (quickstats, ...)
-T, --timestamp=
Timeshift in seconds that could fix issues with "Unknown error". Use values like 5, 10, 20, etc
-i, --interval=
Sampling Period in seconds. Basic historic intervals: 300, 1800, 7200 or 86400. See config for any changes.
Supports literval values to autonegotiate interval value: r - realtime interval, h - historical interval specified by position.
Default value is 20 (realtime). Since cluster does not have realtime stats interval other than 20(default realtime) is mandatory.
-M, --maxsamples=
Maximum number of samples to retrieve. Max sample number is ignored for historic intervals.
Default value is 1 (latest available sample).
--trace=
Set verbosity level of vSphere API request/respond trace
-t, --timeout=INTEGER
Seconds before plugin times out (default: 30)
-v, --verbose
Show details for command-line debugging (can repeat up to 3 times)
Supported commands(^ - blank or not specified parameter, o - options, T - timeshift value, b - blacklist) :
VM specific :
* cpu - shows cpu info
+ usage - CPU usage in percentage
+ usagemhz - CPU usage in MHz
+ wait - CPU wait time in ms
+ ready - CPU ready time in ms
^ all cpu info(no thresholds)
* mem - shows mem info
+ usage - mem usage in percentage
+ usagemb - mem usage in MB
+ swap - swap mem usage in MB
+ swapin - swapin mem usage in MB
+ swapout - swapout mem usage in MB
+ overhead - additional mem used by VM Server in MB
+ overall - overall mem used by VM Server in MB
+ active - active mem usage in MB
+ memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
^ all mem info(except overall and no thresholds)
* net - shows net info
+ usage - overall network usage in KBps(Kilobytes per Second)
+ receive - receive in KBps(Kilobytes per Second)
+ send - send in KBps(Kilobytes per Second)
^ all net info(except usage and no thresholds)
* io - shows disk I/O info
+ usage - overall disk usage in MB/s
+ read - read latency in ms (totalReadLatency.average)
+ write - write latency in ms (totalWriteLatency.average)
^ all disk io info(no thresholds)
* runtime - shows runtime info
+ con - connection state
+ cpu - allocated CPU in MHz
+ mem - allocated mem in MB
+ state - virtual machine state (UP, DOWN, SUSPENDED)
+ status - overall object status (gray/green/red/yellow)
+ consoleconnections - console connections to VM
+ guest - guest OS status, needs VMware Tools
+ tools - VMWare Tools status
+ issues - all issues for the host
^ all runtime info(except con and no thresholds)
Host specific :
* cpu - shows cpu info
+ usage - CPU usage in percentage
o quickstats - switch for query either PerfCounter values or Runtime info
+ usagemhz - CPU usage in MHz
o quickstats - switch for query either PerfCounter values or Runtime info
^ all cpu info
o quickstats - switch for query either PerfCounter values or Runtime info
* mem - shows mem info
+ usage - mem usage in percentage
o quickstats - switch for query either PerfCounter values or Runtime info
+ usagemb - mem usage in MB
o quickstats - switch for query either PerfCounter values or Runtime info
+ swap - swap mem usage in MB
o listvm - turn on/off output list of swapping VM's
+ overhead - additional mem used by VM Server in MB
+ overall - overall mem used by VM Server in MB
+ memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
o listvm - turn on/off output list of ballooning VM's
^ all mem info(except overall and no thresholds)
* net - shows net info
+ usage - overall network usage in KBps(Kilobytes per Second)
+ receive - receive in KBps(Kilobytes per Second)
+ send - send in KBps(Kilobytes per Second)
+ nic - makes sure all active NICs are plugged in
^ all net info(except usage and no thresholds)
* io - shows disk io info
+ aborted - aborted commands count
+ resets - bus resets count
+ read - read latency in ms (totalReadLatency.average)
+ write - write latency in ms (totalWriteLatency.average)
+ kernel - kernel latency in ms
+ device - device latency in ms
+ queue - queue latency in ms
^ all disk io info
* vmfs - shows Datastore info
+ (name) - free space info for datastore with name (name)
o used - output used space instead of free
o breif - list only alerting volumes
o regexp - whether to treat name as regexp
o blacklistregexp - whether to treat blacklist as regexp
b - blacklist VMFS's
T (value) - timeshift to detemine if we need to refresh
^ all datastore info
o used - output used space instead of free
o breif - list only alerting volumes
o blacklistregexp - whether to treat blacklist as regexp
b - blacklist VMFS's
T (value) - timeshift to detemine if we need to refresh
* runtime - shows runtime info
+ con - connection state
+ health - checks cpu/storage/memory/sensor status
o listitems - list all available sensors(use for listing purpose only)
o blackregexpflag - whether to treat blacklist as regexp
b - blacklist status objects
+ storagehealth - storage status check
o blackregexpflag - whether to treat blacklist as regexp
b - blacklist status objects
+ temperature - temperature sensors
o blackregexpflag - whether to treat blacklist as regexp
b - blacklist status objects
+ sensor - threshold specified sensor
+ maintenance - shows whether host is in maintenance mode
+ list(vm) - list of VMWare machines and their statuses
+ status - overall object status (gray/green/red/yellow)
+ issues - all issues for the host
b - blacklist issues
^ all runtime info(health, storagehealth, temperature and sensor are represented as one value and no thresholds)
* service - shows Host service info
+ (names) - check the state of one or several services specified by (names), syntax for (names):,,...,
^ show all services
* storage - shows Host storage info
+ adapter - list bus adapters
b - blacklist adapters
+ lun - list SCSI logical units
b - blacklist LUN's
+ path - list logical unit paths
b - blacklist paths
^ show all storage info
* uptime - shows Host uptime
o quickstats - switch for query either PerfCounter values or Runtime info
* device - shows Host specific device info
+ cd/dvd - list vm's with attached cd/dvd drives
o listall - list all available devices(use for listing purpose only)
DC specific :
* cpu - shows cpu info
+ usage - CPU usage in percentage
o quickstats - switch for query either PerfCounter values or Runtime info
+ usagemhz - CPU usage in MHz
o quickstats - switch for query either PerfCounter values or Runtime info
^ all cpu info
o quickstats - switch for query either PerfCounter values or Runtime info
* mem - shows mem info
+ usage - mem usage in percentage
o quickstats - switch for query either PerfCounter values or Runtime info
+ usagemb - mem usage in MB
o quickstats - switch for query either PerfCounter values or Runtime info
+ swap - swap mem usage in MB
+ overhead - additional mem used by VM Server in MB
+ overall - overall mem used by VM Server in MB
+ memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
^ all mem info(except overall and no thresholds)
* net - shows net info
+ usage - overall network usage in KBps(Kilobytes per Second)
+ receive - receive in KBps(Kilobytes per Second)
+ send - send in KBps(Kilobytes per Second)
^ all net info(except usage and no thresholds)
* io - shows disk io info
+ aborted - aborted commands count
+ resets - bus resets count
+ read - read latency in ms (totalReadLatency.average)
+ write - write latency in ms (totalWriteLatency.average)
+ kernel - kernel latency in ms
+ device - device latency in ms
+ queue - queue latency in ms
^ all disk io info
* vmfs - shows Datastore info
+ (name) - free space info for datastore with name (name)
o used - output used space instead of free
o breif - list only alerting volumes
o regexp - whether to treat name as regexp
o blacklistregexp - whether to treat blacklist as regexp
b - blacklist VMFS's
T (value) - timeshift to detemine if we need to refresh
^ all datastore info
o used - output used space instead of free
o breif - list only alerting volumes
o blacklistregexp - whether to treat blacklist as regexp
b - blacklist VMFS's
T (value) - timeshift to detemine if we need to refresh
* runtime - shows runtime info
+ list(vm) - list of VMWare machines and their statuses
+ listhost - list of VMWare esx host servers and their statuses
+ listcluster - list of VMWare clusters and their statuses
+ tools - VMWare Tools status
b - blacklist VM's
+ status - overall object status (gray/green/red/yellow)
+ issues - all issues for the host
b - blacklist issues
^ all runtime info(except cluster and tools and no thresholds)
* recommendations - shows recommendations for cluster
+ (name) - recommendations for cluster with name (name)
^ all clusters recommendations
Cluster specific :
* cpu - shows cpu info
+ usage - CPU usage in percentage
+ usagemhz - CPU usage in MHz
^ all cpu info
* mem - shows mem info
+ usage - mem usage in percentage
+ usagemb - mem usage in MB
+ swap - swap mem usage in MB
o listvm - turn on/off output list of swapping VM's
+ memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
o listvm - turn on/off output list of ballooning VM's
^ all mem info(plus overhead and no thresholds)
* cluster - shows cluster services info
+ effectivecpu - total available cpu resources of all hosts within cluster
+ effectivemem - total amount of machine memory of all hosts in the cluster
+ failover - VMWare HA number of failures that can be tolerated
+ cpufainess - fairness of distributed cpu resource allocation
+ memfainess - fairness of distributed mem resource allocation
^ only effectivecpu and effectivemem values for cluster services
* runtime - shows runtime info
+ list(vm) - list of VMWare machines in cluster and their statuses
+ listhost - list of VMWare esx host servers in cluster and their statuses
+ status - overall cluster status (gray/green/red/yellow)
+ issues - all issues for the cluster
b - blacklist issues
^ all cluster runtime info
* vmfs - shows Datastore info
+ (name) - free space info for datastore with name (name)
o used - output used space instead of free
o breif - list only alerting volumes
o regexp - whether to treat name as regexp
o blacklistregexp - whether to treat blacklist as regexp
b - blacklist VMFS's
T (value) - timeshift to detemine if we need to refresh
^ all datastore info
o used - output used space instead of free
o breif - list only alerting volumes
o blacklistregexp - whether to treat blacklist as regexp
b - blacklist VMFS's
T (value) - timeshift to detemine if we need to refresh
Copyright (c) 2008 op5
Après un test rapide, nous obtenons une erreur de ce type :
CHECK_VMWARE_API.PL CRITICAL - Server version unavailable at ...
La vérification du certificat pose problème, si vous ne voulez pas le passer en paramètre, utiliser cette option :
--no-certificate-checkingou rajoutez ceci au début du script perl :
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0;
Configuration de Nagios
Nous allons stocker les identifiants de connexions de l'ESXi dans le fichiers etc/resource.cfg qui ne doit pas être accessible via les CGI
$USER09$=username
$USER10$=password
Ensuite reste à configurer les commandes :
# 'check_esx_cpu' command definition
define command{
command_name check_esx_cpu
command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l cpu -s usage -w $ARG1$ -c $ARG2$
}
# 'check_esx_mem' command definition
define command{
command_name check_esx_mem
command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l mem -s usage -w $ARG1$ -c $ARG2$
}
# 'check_esx_net' command definition
define command{
command_name check_esx_net
command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l net -s usage -w $ARG1$ -c $ARG2$
}
# 'check_esx_runtime' command definition
define command{
command_name check_esx_runtime
command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l runtime -s status
}
# 'check_esx_ioread' command definition
define command{
command_name check_esx_ioread
command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l io -s read -w $ARG1$ -c $ARG2$
}
# 'check_esx_iowrite' command definition
define command{
command_name check_esx_iowrite
command_line $USER1$/check_vmware_api.pl -H $HOSTADDRESS$ -u $USER09$ -p $USER10$ -l io -s write -w $ARG1$ -c $ARG2$
}
Puis la traditionnelle configuration :
define host{
use generic-host
host_name myesx1
alias myesx1
address XXX.XXX.XXX.XXX
}
Et la définition des services :
define service{
use generic-service
host_name myesx1
service_description ESXi CPU Load
check_command check_esx_cpu!80!90
}
define service{
use generic-service
host_name myesx1
service_description ESXi Memory usage
check_command check_esx_mem!80!90
}
define service{
use generic-service
host_name myesx1
service_description ESXi Network usage
check_command check_esx_net!102400!204800
}
define service{
use generic-service
host_name myesx1
service_description ESXi Runtime status
check_command check_esx_runtime
}
define service{
use generic-service
host_name myesx1
service_description ESXi IO read
check_command check_esx_ioread!40!90
}
define service{
use generic-service
host_name myesx1
service_description ESXi IO write
check_command check_esx_iowrite!40!90
}
Conclusion
Voilà, le tour est joué, vous avez un début de supervision de votre serveur ESX ! Pour avoir un monitoring plus fin, je vous invite à parcourir cette documentation : http://www.op5.com/how-to/monitoring-vmware-esx-3-x-esxi-vsphere-4-and-vcenter-server
Original post of Slobberbone.Votez pour ce billet sur Planet Libre.