Child Died

Mon Sep 7 11:58:01 CEST 2009

On Wed, Sep 02, 2009 at 10:12:00AM -0400, maillists0 at gmail.com wrote:
> I just started my first instance of varnish in production. Within 12 hours,
> there were alerts from our monitoring system that Varnish was taking 90% of
> the cpu. Right after that, I find these messages in /var/log/messages,
> several times over a 2 minute period:

Did you check syslog for assert errors too?

> varnishd[12461]: Child (20086) not responding to ping, killing it.
> 
> The child restarted, and the stats and cache all disappeared.
> 
> This is a machine with 8 gigs of ram and a pair of slightly older quad core
> xeons. The storage method is file with a 50 gig limit. At its peak, the
> machine is serving around 40 requests a second, about 5000k a second. The
> configs are the defaults.
> 
> What should my first steps be to troubleshoot this? Is there a likely
> culprit?

The first I'd do is check syslog for assert errors. If it's being killed in
the same place, something must be wrong (... ).

Secondly, I'd check the value of cli_timeout. This default has changed over
time, but a very busy varnish can be slow to reply to pings from the
management thread, and thus get killed needlessly. You can check it with
the telnet interface or «varnishadm -T localhost:yourmangementport
param.show cli_timeout». The new default is 10s, which should be enough,
though it still might be too low for extremely busy threads.

You may also want to supply a varnishstat -1 (after varnish has had a
chance to warm up) and any custom VCL to the list.

-- 
Kristian Lyngstøl
Redpill Linpro AS
Tlf: +47 21544179
Mob: +47 99014497
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20090907/367597c4/attachment-0003.pgp>