two varnishd fails

Thu May 27 12:15:38 CEST 2010

Hello all,

I am a varnish-cache user and most of the time everything goes smoothly,
but I have encountered 2 serious problems with it.
I have two deployments of varnish on different web clusters:
- number1 web cluster
- number2 web cluster
Both have pretty much the same vcl.

  The first problem was last week with number1 web cluster.
/var/log/messages:
May 22 11:13:03 kernel: varnishd[21817] general protection
rip:3497e00b79 rsp:62ed6970 error:0
May 22 11:13:32 kernel: varnishd[21958] general protection
rip:3497e00b79 rsp:721f5970 error:0
May 22 11:14:48 kernel: varnishd[22115] general protection
rip:3497e00b79 rsp:42fe2970 error:0
May 22 11:15:21 kernel: varnishd[22476] general protection
rip:3497e00b79 rsp:7264e970 error:0
May 22 11:15:32 kernel: varnishd[22575] general protection
rip:3497e00b79 rsp:5b135970 error:0
May 22 11:16:06 kernel: varnishd[22611]: segfault at 0000000000000010
rip 0000003497e00b6a rsp 0000000045d40970 error 4
May 22 11:16:21 kernel: varnishd[22746] general protection
rip:3497e00b79 rsp:4c5bd970 error:0
May 22 11:16:33 kernel: varnishd[22876] general protection
rip:3497e00b79 rsp:5b6ad970 error:0
May 22 11:16:39 kernel: varnishd[22889] general protection
rip:3497e00b79 rsp:485a5970 error:0
May 22 11:16:51 kernel: varnishd[22969] general protection
rip:3497e00b79 rsp:5d4c9970 error:0

  When trying to access any website hosted on number1 cluster, I would get
Connection refused.
   Based on what the log says, it seems that I get these protection faults everyday.

The second problem was also on the same day, with number2 web cluster.
/var/log/messages:
May 22 16:17:16 varnishd[2451]: Child (25232) not responding to ping,
killing it.
May 22 16:17:18 varnishd[2451]: Child (25232) not responding to ping,
killing it.
May 22 16:17:18 varnishd[2451]: Child (25232) died signal=3
May 22 16:17:18 varnishd[2451]: child (22499) Started
May 22 16:17:23 varnishd[2451]: Pushing vcls failed: CLI communication error
May 22 16:17:30 varnishd[2451]: Child (22499) said Closed fds: 4 5 6 11
12 14 15
May 22 16:17:30 varnishd[2451]: Child (22499) said Child starts
May 22 16:17:34 varnishd[2451]: Child (22499) said managed to mmap
2147483648 bytes of 2147483648
May 22 16:17:34 varnishd[2451]: Child (22499) said Ready
May 22 16:17:36 varnishd[2451]: Child (22499) ended

  Here also I got Connection refused.
  What I noticed in the second log was that the hours were stuck somehow.
  It failed at exactly 19:18, and I restarted varnish around at 19:24, but
the logs matching the restart were also "in-the-past".
May 22 16:23:05 varnishd[2451]: Manager got SIGINT
May 22 16:23:07 varnishd[23228]: child (23229) Started
May 22 16:23:07 varnishd[23228]: Child (23229) said Closed fds: 4 5 6 10 11 13 14
May 22 16:23:07 varnishd[23228]: Child (23229) said Child starts
May 22 16:23:07 varnishd[23228]: Child (23229) said managed to mmap
2147483648 bytes of 2147483648
May 22 16:23:07 varnishd[23228]: Child (23229) said Ready

Everything else was logged at the correct time (I am using ntp), so I
am dismissing the fact that the server was not in sync.

Also, it is very strange that two completely unrelated varnish clusters
would fail in the same day, at 9 hours distance.
Using CentOs 5.4, 2.6.18-164.15.1.el5 and varnishd (varnish-2.0.5) on
both machines.

Any help on debugging these problems is kindly appreciated!