Chield dies

Eivind Bengtsson ebe at dmi.dk
Wed May 12 12:25:40 CEST 2010


Hi guys I've just had an issue on one of our 4 varnishservers - that I 
cannot explain.. The varnishd-child just dies.

I have a guess though - Might it have something to do with me (a check 
on my behalf) telnetting to the serviceport to check the backend.health?
Is my perl-fu not strong enough?


After the child is dead, I do not see the alarm until 10 mins later ... 
and startup varnishd - I have no varnishlog from the incident but as I 
start up varnish everything works again...

Every 5 minutes my nagios does four checks on the varnishmachines, to 
check health of the backend.
command[check_varnish_backends]=/usr/lib/nagios/plugins/check_varnish_backend_health 
-all
command[check_varnish_backend_euros08]=/usr/lib/nagios/plugins/check_varnish_backend_health 
-h euros08
command[check_varnish_backend_euros12]=/usr/lib/nagios/plugins/check_varnish_backend_health 
-h euros12
command[check_varnish_backend_carbon]=/usr/lib/nagios/plugins/check_varnish_backend_health 
-h carbon

I've inlined the .pl doing the nagioscheck and the syslog output from 
varnishd.

Does anyone have a clue ? Sorry in advance for the rathe long mail ..
Regards

/Eivind

# varnishd -V
varnishd (varnish-2.0.6)
Copyright (c) 2006-2009 Linpro AS / Verdens Gang A
______________________________________
SYSLOG:

May 12 10:05:54 varnish1 varnishd[19448]: CLI 7 open from telnet 
127.0.0.1:46759 127.0.0.1:6082
May 12 10:05:54 varnish1 varnishd[19448]: CLI 7 result 200 "debug.health"
May 12 10:05:54 varnish1 varnishd[19448]: CLI 7 closed
May 12 10:05:54 varnish1 varnishd[19448]: CLI 7 open from telnet 
127.0.0.1:46761 127.0.0.1:6082
May 12 10:05:54 varnish1 varnishd[19448]: CLI 7 result 200 "debug.health"
May 12 10:05:54 varnish1 varnishd[19448]: CLI 7 closed
May 12 10:05:54 varnish1 varnishd[19448]: CLI 7 open from telnet 
127.0.0.1:46762 127.0.0.1:6082
May 12 10:05:54 varnish1 varnishd[19448]: CLI 7 result 200 "debug.health"
May 12 10:05:54 varnish1 varnishd[19448]: CLI 7 closed
May 12 10:05:54 varnish1 varnishd[19448]: CLI 7 open from telnet 
127.0.0.1:46764 127.0.0.1:6082
May 12 10:05:54 varnish1 varnishd[19448]: CLI 7 result 200 "debug.health"
May 12 10:05:54 varnish1 varnishd[19448]: CLI 7 closed
....
May 12 10:07:58 varnish1 varnishd[19448]: Child (19449) not responding 
to ping, killing it.
May 12 10:08:00 varnish1 varnishd[19448]: Child (19449) not responding 
to ping, killing it.
May 12 10:08:00 varnish1 varnishd[19448]: Child (19449) died signal=3
May 12 10:08:02 varnish1 varnishd[19448]: Child cleanup complete
May 12 10:08:02 varnish1 varnishd[19448]: child (18174) Started
May 12 10:08:04 varnish1 kernel: [6214551.451581] Inbound IN=eth0 OUT= 
MAC=00:14:22:1e:ba:b8:00:0e:84:3b:ec:c0:08:00 SRC=66.151.125.8 
DST=130.225.244.83 LEN=32 TOS=0x00 PREC=0x00 TTL=1 ID=1802 PROTO=UDP 
SPT=11057 DPT=33442 LEN=12
May 12 10:08:06 varnish1 kernel: [6214553.449867] Inbound IN=eth0 OUT= 
MAC=00:14:22:1e:ba:b8:00:0e:84:3b:ec:c0:08:00 SRC=66.151.125.8 
DST=130.225.244.83 LEN=32 TOS=0x00 PREC=0x00 TTL=1 ID=1802 PROTO=UDP 
SPT=11057 DPT=33442 LEN=12
May 12 10:08:07 varnish1 varnishd[19448]: Pushing vcls failed: CLI 
communication error
May 12 10:08:07 varnish1 varnishd[19448]: Stopping Child
May 12 10:08:07 varnish1 varnishd[19448]: Child (18174) said Closed fds: 
4 5 6 10 11 13 14
May 12 10:08:07 varnish1 varnishd[19448]: Child (18174) said Child starts
May 12 10:08:07 varnish1 varnishd[19448]: Child (18174) said managed to 
mmap 2147483648 bytes of 2147483648
May 12 10:08:07 varnish1 varnishd[19448]: Child (18174) said Ready
....
May 12 10:08:07 varnish1 varnishd[19448]: Pushing vcls failed: CLI 
communication error
May 12 10:08:07 varnish1 varnishd[19448]: Stopping Child
May 12 10:08:07 varnish1 varnishd[19448]: Child (18174) said Closed fds: 
4 5 6 10 11 13 14
May 12 10:08:07 varnish1 varnishd[19448]: Child (18174) said Child starts
May 12 10:08:07 varnish1 varnishd[19448]: Child (18174) said managed to 
mmap 2147483648 bytes of 2147483648
May 12 10:08:07 varnish1 varnishd[19448]: Child (18174) said Ready
....
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said Probe("GET 
/vandstand/servlet/ImageServlet HTTP/1.1
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said Host: 
carbon.dmi.dk
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said Connection: 
close
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said
May 12 10:08:08 varnish1 varnishd[19448]: last message repeated 2 times
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said ", 3, 10)
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said Probe("GET 
/ HTTP/1.1
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said Host: localhost
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said Connection: 
close
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said
May 12 10:08:08 varnish1 varnishd[19448]: last message repeated 2 times
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said ", 3, 5)
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said Probe("GET 
/dmi/dmi.css HTTP/1.1
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said Host: 
euros08.dmi.dk
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said Connection: 
close
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said
May 12 10:08:08 varnish1 varnishd[19448]: last message repeated 2 times
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) said ", 2, 10)
May 12 10:08:08 varnish1 varnishd[19448]: Child (18174) ended
May 12 10:08:08 varnish1 varnishd[19448]: Child (-1) said Probe("GET 
/dmi/dmi.css HTTP/1.1
May 12 10:08:08 varnish1 varnishd[19448]: Child (-1) said
May 12 10:08:08 varnish1 varnishd[19448]: Child (-1) said Host: 
euros12.dmi.dk
May 12 10:08:08 varnish1 varnishd[19448]: Child (-1) said
May 12 10:08:08 varnish1 varnishd[19448]: Child (-1) said Connection: close
May 12 10:08:08 varnish1 varnishd[19448]: Child (-1) said
May 12 10:08:08 varnish1 varnishd[19448]: last message repeated 2 times
May 12 10:08:08 varnish1 varnishd[19448]: Child (-1) said ", 2, 10)
May 12 10:08:08 varnish1 varnishd[19448]: Child cleanup complete
....
May 12 10:21:42 varnish1 varnishd[19448]: Manager got SIGINT
May 12 10:21:43 varnish1 varnishd[19184]: child (19185) Started
May 12 10:21:43 varnish1 varnishd[19184]: Child (19185) said Closed fds: 
4 5 6 10 11 13 14
May 12 10:21:43 varnish1 varnishd[19184]: Child (19185) said Child starts
May 12 10:21:43 varnish1 varnishd[19184]: Child (19185) said managed to 
mmap 2147483648 bytes of 2147483648
May 12 10:21:43 varnish1 varnishd[19184]: Child (19185) said Ready
May 12 10:21:43 varnish1 varnishd[19184]: Child (19185) said Probe("GET 
/vandstand/servlet/ImageServlet HTTP/1.1
May 12 10:21:43 varnish1 varnishd[19184]: Child (19185) said
May 12 10:21:43 varnish1 varnishd[19184]: Child (19185) said Host: 
carbon.dmi.dk
May 12 10:21:43 varnish1 varnishd[19184]: Child (19185) said
May 12 10:21:43 varnish1 varnishd[19184]: Child (19185) said Connection: 
close
May 12 10:21:43 varnish1 varnishd[19184]: Child (19185) said
...And so on....

_________________________________
# cat /usr/lib/nagios/plugins/check_varnish_backend_health
#!/usr/bin/perl

use strict;
use Net::Telnet ();
use Getopt::Long;
my $version = '';
my $help = '';
my $result = '';
my $status = '';
my $host = '';
my $all = '';
my $verbose = '';
my $handle = new Net::Telnet ( Port=> '6082', Timeout => 8, prompt => 
'/\.\n|\n\n/' );

## Print Help
sub usage {
   my $msg = shift;
   if (defined $msg) {
      print "$msg\n";
   }
   print << "END";
        Usage: check_varnish_backend_health option
        Options:
                --help
                -h|--host hostname
                --all
                --verbose
                --version
        NB! Only one option at the time! :-)
END
   exit(3);
}

#Get options from command line
$result = GetOptions(
        "host|h=s"      => \$host,
        "all"           => \$all,
        "version"       => \$version,
        "verbose"       => \$verbose,
        "help"          => \$help
);

if ($help)      {usage();}
if ($version)   {usage('Version is 0.9_EBE');}
if ($all)       {if ($host){usage('Either --all _or_ --host hostname')};}
if ($verbose)   {if ($host){usage('Either --verbose _or_ --host 
hostname')};}

my $ip = "127.0.0.1";
$handle->open("$ip");
my @lines = $handle->cmd("debug\.health");
$handle->close;
my %backends_state_hash;
my @sick_backends;
foreach (@lines) {
if (/Backend/) {
        my @line = split(' ',$_);
        $status = pop(@line);
        my $be = @line[1];
        if ($verbose) {
                print "Host: $be => $status\n";
                }
        if ($status eq 'Sick'){
                push(@sick_backends,$be);
                }
        $backends_state_hash{$be} = $status;
        }
}

if ($host) {
        if ( $backends_state_hash{$host} eq 'Healthy'){
                print "OK: Host $host is Healthy";
                exit 0;
        }
        elsif ( $backends_state_hash{$host} eq 'Sick'){
                print "Critical: Host $host is Sick";
                exit 2;
        }
        else {
                print "Warning: Host $host is not Healthy, nor is it Sick";
                exit 1;
        }
}
my $number_of_sick_backends=@sick_backends;
my $number_of_backends = keys %backends_state_hash;
if ( $number_of_backends ==  0 ) {
        print "Warning: No backends are probed";
        exit 1;
}
if ( $number_of_sick_backends == 0 ) {
        print "OK: All hosts are Healthy";
        exit 0;
}
if ( $number_of_sick_backends == $number_of_backends ) {
        print "Critical: All hosts are Sick - @sick_backends";
        exit 2;
}
else {
        print "Warning: Some hosts are sick: @sick_backends";
        exit 1;
}
______________________________________

-- 
Eivind Bengtsson
Systemadministrator - Cand.merc.(dat)
Danmarks Meteorologiske Institut
Lyngbyvej 100
2100 København Ø
Direkte tlf. : 39157544
Email: ebe at dmi.dk
echo 'This is not a pipe.' | cat -> /dev/tty




More information about the varnish-misc mailing list