Donnerstag, 12. Dezember 2013

nagios: check number of threads,open files and http for service using check_by_ssh and a simple bash script

Recently, I had the problem that some of our jenkins servers in our build farm sometimes spawned up to 16k threads. In order to diagnose the problem and in order to monitor the services for availability issues I have created some monitoring scripts for nagios. This tutorial will show you how to monitor jenkins using nagios. Nagios is a nice tool for monitoring infrastructure and services. We will use the check_by_ssh service to execute the checks on the target system.

We will cover the following steps:
  1. Install/configure check_by_ssh plugin
  2. Configure service check_http to check jenkins web interface availability
  3. Configure service to check for jenkins number of open files (ulimit issue)
  4. Configure service to check for jenkins number of threads
1.Install/Configure check_by_ssh
  •  client side
    • install client package
      • apt-get install nagios-nrpe-server
    • enable nagios shell
      • usermod -s /bin/bash nagios
  • server side
    • install server package
      • apt-get install nagios3
    • enable nagios shell
      • usermod -s /bin/bash nagios
    • create ssh keypair
      • su - nagios
      • ssh-keygen -N ""
    • copy public key to clients using scp or other methods e.g. salt, chef
      •  add public key to ~nagios/.ssh/authorized_keys
  • check that the ssh connection is working
    • su - nagios
    • ssh <client-ip-address>
  • check that check_by_ssh is working
    • /usr/lib/nagios/plugins/check_by_ssh -l nagios -H <client-dns-name> -C "hostname"
      • output: <client-dns-name>
 2. Configure service check_http to check jenkins web interface availability
For the http check I was using the check_http command which is already provided as a nagios plugin (/usr/lib/nagios/plugins/). The corresponding nagios service section (services_nagios2.cfg) looks like this:

define service {
        hostgroup_name                  jenkins-servers
        service_description             jenkins nginx http redirect
        check_command                   check_http!-p 8080
        use                             generic-service
        notification_interval           0;
}
Actually in my case I have nginx proxies in front of the jenkins servers to do the https stuff and to redirect from non http addresses to https. This check here will check if the http redirect which is installed on port 8080 is available. However, this can be also used to check for normal jenkins instances running.

3.  Configure service to check for jenkins number of open files (ulimit issue)

Checking the number of open files for a specific user is a little bit trickier. I was using a perl script found at http://exchange.nagios.org/directory/Plugins/Uncategorized/Operating-Systems/Linux/check-open-files/details which I have placed in a plugin sub folder in the nagios home directory (on the clients). Then I have added a check_by_ssh_open_files custom command to the custom_commands.cfg nagios configuration file. The command uses the check_by_ssh command to call the plugin which has been installed on the client side.
define command {
        command_name    check_by_ssh_open_files
        command_line    $USER1$/check_by_ssh -o StrictHostKeyChecking=no -l nagios -H $HOSTADDRESS$ -C "/var/lib/nagios/plugins/check_unix_open_files.pl -a $ARG1$ -w $ARG2$,$ARG2$ -c $ARG3$,$ARG3$"
        }
After that I have defined a custom service that uses the check_by_ssh_open_files command with the username jenkins and warning level set to 2048 and critical level set to 4096 threads. Note that on standard installations the ulimits have been typically set to 4096 open files at most.
define service {
        hostgroup_name                  jenkins-servers
        service_description             check open files jenkins
        check_command                   check_by_ssh_open_files!jenkins!2048!4096
        use                             generic-service
        notification_interval           0;
}

4.Configure service to check for jenkins number of threads
Checking the number of threads for a specific process is even more complex. I found two solutions to do that, the first is to use the check_proc plugin which is part of nagios. The problem is that you have to recompile that plugin with a special ps-command syntax to display also all threads of a process instead of only the processes. Also I figured out how to download, configure and compile the plugin code, I wasn't able to figure out the specific ps options. You somehow have to define the command, parse paremeters and so on... I have found the following parameters on the web:

    --with-ps-command="/bin/ps -eo 's uid pid ppid vsz rss pcpu etime comm args'" \
    --with-ps-format='%s %d %d %d %d %d %f %s %s %n' \
    --with-ps-cols=10 \
    --with-ps-varlist='procstat,&;procuid,&;procpid,&;procppid,&;procvsz,&;procrss,&;procpcpu,procetime,procprog,&pos'

After wasting more than one hour with that I decided to write a simple bash script which will suffice my and nagios requirements. Here it is, my first nagios checker script...


#!/bin/bash

RET_OK=0
RET_WARN=1
RET_CRIT=2
RET_UNKNOWN=3

user="$1"
warn="$2"
crit="$3"

id=`id -u $user`
if [ $? -ne 0 ]
then
echo "UNKNOWN: USAGE ./check_threads.sh   "
fi
count=`ps auxH | grep $user | wc -l`
if [ $? -ne 0 ]
then
echo "UNKNOWN: USAGE ./check_threads.sh   "
fi
if [ $count -lt $warn ]
then
echo "THREADS OK: $count processes/threads with UID = $id ($user)"
exit $RET_OK
elif [ $count -lt $crit ]
then
echo "WARNING - $count threads processes/threads with UID = $id ($user)"
exit $RET_WARN
else
echo "CRITICAL - $count threads processes/threads with UID = $id ($user)"
exit $RET_CRIT
fi
This is the corresponding custom command...
define command {
        command_name    check_by_ssh_threads
        command_line    $USER1$/check_by_ssh -o StrictHostKeyChecking=no -l nagios -H $HOSTADDRESS$ -C "/var/lib/nagios/plugins/check_threads.sh $ARG1$ $ARG2$ $ARG3$"
        }

and this is the corresponding service that will check if the number of jenkins threads exceed 512 (warning) or 1024 (critical).
 
define service {
        hostgroup_name                  jenkins-servers
        service_description             check number threads jenkins
        check_command                   check_by_ssh_threads!jenkins!512!1024
        use                             generic-service
        flap_detection_enabled          0
        notification_interval           0;
}

Resources: 

  • http://www.nagios-wiki.de/nagios/plugins/check_by_ssh
  • http://exchange.nagios.org/directory/Plugins/Uncategorized/Operating-Systems/Linux/check-open-files/details
  • http://esisteinfehleraufgetreten.wordpress.com/2009/09/25/installing-nagios-or-icinga/
  • http://www.nagios-wiki.de/nagios/plugins/check_http
  • http://www.nagios.org/documentation

Kommentare:

  1. Here is a version of your script that checks for the count of threads with a specific name:

    #!/bin/bash

    RET_OK=0
    RET_WARN=1
    RET_CRIT=2
    RET_UNKNOWN=3

    name="$1"
    warn="$2"
    crit="$3"

    if [[ -z "$1" ]]
    then
    echo Please set threadname / program to search for
    exit $RET_UNKNOWN
    fi

    if [[ -z "$2" ]]
    then
    echo Please set warn threshold
    exit $RET_UNKNOWN
    fi

    if [[ -z "$3" ]]
    then
    echo Please set critical threshold
    exit $RET_UNKNOWN
    fi

    count=`ps -eLf | grep $name | wc -l`
    if [ $? -ne 0 ]
    then
    echo "UNKNOWN: USAGE ./check_threads.sh "
    exit $RET_UNKNOWN
    fi
    if [ $count -lt $warn ]
    then
    echo "THREADS OK: $count processes/threads with name $name|threads=$count;$warn;$crit;0"
    exit $RET_OK
    elif [ $count -lt $crit ]
    then
    echo "WARNING - $count threads processes/threads with name $name|threads=$count;$warn;$crit;0"
    exit $RET_WARN
    else
    echo "CRITICAL - $count threads processes/threads with name $name|threads=$count;$warn;$crit;0"
    exit $RET_CRIT
    fi
    exit $RET_UNKNOWN

    AntwortenLöschen
  2. Hi, thx for sharing your version... Cheers

    AntwortenLöschen