We will cover the following steps:
- Install/configure check_by_ssh plugin
- Configure service check_http to check jenkins web interface availability
- Configure service to check for jenkins number of open files (ulimit issue)
- Configure service to check for jenkins number of threads
- client side
- install client package
- apt-get install nagios-nrpe-server
- enable nagios shell
- usermod -s /bin/bash nagios
- server side
- install server package
- apt-get install nagios3
- enable nagios shell
- usermod -s /bin/bash nagios
- create ssh keypair
- su - nagios
- ssh-keygen -N ""
- copy public key to clients using scp or other methods e.g. salt, chef
- add public key to ~nagios/.ssh/authorized_keys
- check that the ssh connection is working
- su - nagios
- ssh <client-ip-address>
- check that check_by_ssh is working
- /usr/lib/nagios/plugins/check_by_ssh -l nagios -H <client-dns-name> -C "hostname"
- output: <client-dns-name>
For the http check I was using the check_http command which is already provided as a nagios plugin (/usr/lib/nagios/plugins/). The corresponding nagios service section (services_nagios2.cfg) looks like this:
define service { hostgroup_name jenkins-servers service_description jenkins nginx http redirect check_command check_http!-p 8080 use generic-service notification_interval 0; }Actually in my case I have nginx proxies in front of the jenkins servers to do the https stuff and to redirect from non http addresses to https. This check here will check if the http redirect which is installed on port 8080 is available. However, this can be also used to check for normal jenkins instances running.
3. Configure service to check for jenkins number of open files (ulimit issue)
Checking the number of open files for a specific user is a little bit trickier. I was using a perl script found at http://exchange.nagios.org/directory/Plugins/Uncategorized/Operating-Systems/Linux/check-open-files/details which I have placed in a plugin sub folder in the nagios home directory (on the clients). Then I have added a check_by_ssh_open_files custom command to the custom_commands.cfg nagios configuration file. The command uses the check_by_ssh command to call the plugin which has been installed on the client side.
define command { command_name check_by_ssh_open_files command_line $USER1$/check_by_ssh -o StrictHostKeyChecking=no -l nagios -H $HOSTADDRESS$ -C "/var/lib/nagios/plugins/check_unix_open_files.pl -a $ARG1$ -w $ARG2$,$ARG2$ -c $ARG3$,$ARG3$" }After that I have defined a custom service that uses the check_by_ssh_open_files command with the username jenkins and warning level set to 2048 and critical level set to 4096 threads. Note that on standard installations the ulimits have been typically set to 4096 open files at most.
define service { hostgroup_name jenkins-servers service_description check open files jenkins check_command check_by_ssh_open_files!jenkins!2048!4096 use generic-service notification_interval 0; }
4.Configure service to check for jenkins number of threads
Checking the number of threads for a specific process is even more complex. I found two solutions to do that, the first is to use the check_proc plugin which is part of nagios. The problem is that you have to recompile that plugin with a special ps-command syntax to display also all threads of a process instead of only the processes. Also I figured out how to download, configure and compile the plugin code, I wasn't able to figure out the specific ps options. You somehow have to define the command, parse paremeters and so on... I have found the following parameters on the web:
--with-ps-command="/bin/ps -eo 's uid pid ppid vsz rss pcpu etime comm args'" \ --with-ps-format='%s %d %d %d %d %d %f %s %s %n' \ --with-ps-cols=10 \ --with-ps-varlist='procstat,&;procuid,&;procpid,&;procppid,&;procvsz,&;procrss,&;procpcpu,procetime,procprog,&pos'
After wasting more than one hour with that I decided to write a simple bash script which will suffice my and nagios requirements. Here it is, my first nagios checker script...
#!/bin/bash RET_OK=0 RET_WARN=1 RET_CRIT=2 RET_UNKNOWN=3 user="$1" warn="$2" crit="$3" id=`id -u $user` if [ $? -ne 0 ] then echo "UNKNOWN: USAGE ./check_threads.shThis is the corresponding custom command..." fi count=`ps auxH | grep $user | wc -l` if [ $? -ne 0 ] then echo "UNKNOWN: USAGE ./check_threads.sh " fi if [ $count -lt $warn ] then echo "THREADS OK: $count processes/threads with UID = $id ($user)" exit $RET_OK elif [ $count -lt $crit ] then echo "WARNING - $count threads processes/threads with UID = $id ($user)" exit $RET_WARN else echo "CRITICAL - $count threads processes/threads with UID = $id ($user)" exit $RET_CRIT fi
define command { command_name check_by_ssh_threads command_line $USER1$/check_by_ssh -o StrictHostKeyChecking=no -l nagios -H $HOSTADDRESS$ -C "/var/lib/nagios/plugins/check_threads.sh $ARG1$ $ARG2$ $ARG3$" }and this is the corresponding service that will check if the number of jenkins threads exceed 512 (warning) or 1024 (critical).
define service { hostgroup_name jenkins-servers service_description check number threads jenkins check_command check_by_ssh_threads!jenkins!512!1024 use generic-service flap_detection_enabled 0 notification_interval 0; }
Resources:
- http://www.nagios-wiki.de/nagios/plugins/check_by_ssh
- http://exchange.nagios.org/directory/Plugins/Uncategorized/Operating-Systems/Linux/check-open-files/details
- http://esisteinfehleraufgetreten.wordpress.com/2009/09/25/installing-nagios-or-icinga/
- http://www.nagios-wiki.de/nagios/plugins/check_http
- http://www.nagios.org/documentation
Here is a version of your script that checks for the count of threads with a specific name:
AntwortenLöschen#!/bin/bash
RET_OK=0
RET_WARN=1
RET_CRIT=2
RET_UNKNOWN=3
name="$1"
warn="$2"
crit="$3"
if [[ -z "$1" ]]
then
echo Please set threadname / program to search for
exit $RET_UNKNOWN
fi
if [[ -z "$2" ]]
then
echo Please set warn threshold
exit $RET_UNKNOWN
fi
if [[ -z "$3" ]]
then
echo Please set critical threshold
exit $RET_UNKNOWN
fi
count=`ps -eLf | grep $name | wc -l`
if [ $? -ne 0 ]
then
echo "UNKNOWN: USAGE ./check_threads.sh "
exit $RET_UNKNOWN
fi
if [ $count -lt $warn ]
then
echo "THREADS OK: $count processes/threads with name $name|threads=$count;$warn;$crit;0"
exit $RET_OK
elif [ $count -lt $crit ]
then
echo "WARNING - $count threads processes/threads with name $name|threads=$count;$warn;$crit;0"
exit $RET_WARN
else
echo "CRITICAL - $count threads processes/threads with name $name|threads=$count;$warn;$crit;0"
exit $RET_CRIT
fi
exit $RET_UNKNOWN
Hi, thx for sharing your version... Cheers
AntwortenLöschen