two varnishd fails

Sat May 22 18:49:56 CEST 2010

Hello all,

I am a varnish-cache user and most of the time everything goes smoothly, 
but today I have encountered 2 problems with it.
I have two deployments of varnish, on different web clusters:
- number1 web cluster
- number2 web cluster
Both have pretty much the same vcl.

The first problem was today with number1 web cluster.
/var/log/messages:
May 22 11:13:03 kernel: varnishd[21817] general protection 
rip:3497e00b79 rsp:62ed6970 error:0
May 22 11:13:32 kernel: varnishd[21958] general protection 
rip:3497e00b79 rsp:721f5970 error:0
May 22 11:14:48 kernel: varnishd[22115] general protection 
rip:3497e00b79 rsp:42fe2970 error:0
May 22 11:15:21 kernel: varnishd[22476] general protection 
rip:3497e00b79 rsp:7264e970 error:0
May 22 11:15:32 kernel: varnishd[22575] general protection 
rip:3497e00b79 rsp:5b135970 error:0
May 22 11:16:06 kernel: varnishd[22611]: segfault at 0000000000000010 
rip 0000003497e00b6a rsp 0000000045d40970 error 4
May 22 11:16:21 kernel: varnishd[22746] general protection 
rip:3497e00b79 rsp:4c5bd970 error:0
May 22 11:16:33 kernel: varnishd[22876] general protection 
rip:3497e00b79 rsp:5b6ad970 error:0
May 22 11:16:39 kernel: varnishd[22889] general protection 
rip:3497e00b79 rsp:485a5970 error:0
May 22 11:16:51 kernel: varnishd[22969] general protection 
rip:3497e00b79 rsp:5d4c9970 error:0

When trying to access any website hosted on number1 cluster, I would get 
Connection refused.

The second problem was also today, with number2 web cluster.
/var/log/messages:
May 22 16:17:16 varnishd[2451]: Child (25232) not responding to ping, 
killing it.
May 22 16:17:18 varnishd[2451]: Child (25232) not responding to ping, 
killing it.
May 22 16:17:18 varnishd[2451]: Child (25232) died signal=3
May 22 16:17:18 varnishd[2451]: child (22499) Started
May 22 16:17:23 varnishd[2451]: Pushing vcls failed: CLI communication error
May 22 16:17:30 varnishd[2451]: Child (22499) said Closed fds: 4 5 6 11 
12 14 15
May 22 16:17:30 varnishd[2451]: Child (22499) said Child starts
May 22 16:17:34 varnishd[2451]: Child (22499) said managed to mmap 
2147483648 bytes of 2147483648
May 22 16:17:34 varnishd[2451]: Child (22499) said Ready
May 22 16:17:36 varnishd[2451]: Child (22499) ended

Here also I got Connection refused.
What I noticed in the second log was that the hours were stuck somehow. 
It failed at exactly 19:18, and I restarted varnish around at 19:24, but 
the logs matching the restart were also "in-the-past".
May 22 16:23:05 varnishd[2451]: Manager got SIGINT
May 22 16:23:07 varnishd[23228]: child (23229) Started
May 22 16:23:07 varnishd[23228]: Child (23229) said Closed fds: 4 5 6 10 
11 13 14
May 22 16:23:07 varnishd[23228]: Child (23229) said Child starts
May 22 16:23:07 varnishd[23228]: Child (23229) said managed to mmap 
2147483648 bytes of 2147483648
May 22 16:23:07 varnishd[23228]: Child (23229) said Ready
Everything else was logged at the correct hours (I am using ntp), so I 
am dismissing the fact that the server's time was not synched.

Also, it is very strange that two completely unrelated varnish clusters 
would fail in the same day, at 9 hours distance.
Using CentOs 5.4, 2.6.18-164.15.1.el5 and varnishd (varnish-2.0.5) on 
both machines.

Any help on debugging these problems is kindly appreciated!