Varnish hangs / requests time out
kristian at redpill-linpro.com
Thu Mar 5 13:30:23 CET 2009
On Wed, Mar 04, 2009 at 06:16:10PM +1300, Ross Brown wrote:
> When this problem happens, the backend is still reachable and happily
> serving images. It is not a particularly busy period for us (600
> requests/sec/Varnish server - approx 350Mbps outbound each - we got up to
> nearly 3 times that level without incident previously) but for some
> reason unknown to us, the servers just suddenly stop processing requests
> and worker processes increase dramatically.
> After the lockup happened last time, I tried firing up varnishlog and
> hitting the server directly - my requests were not showing up at all. The
> *only* entries in the varnish log were related to worker processes being
> killed over time - no PINGs, PONGs, load balancer healthchecks or
> anything related to 'normal' varnish activity. It's as if varnishd has
> completely locked up, but we can't understand what causes both our
> varnish servers to exhibit this behaviour at exactly the same time, nor
> why varnish does not detect it and attempt a restart. After a restart,
> varnish is fine and behaves itself.
> There is nothing to indicate an error with the backend, nor anything in
> syslog to indicate a Varnish problem. Pointers of any kind would be
> appreciated :)
Have you checked dmesg? Do you have any estimate of how simultaneous these
freezes are? (seconds, minutes or tens of minutes apart for instance?).
Your hit rate is quite low (78%ish) and it doesn't seem like you have grace
enabled, which I strongly recommend. If dmesg doesn't reveal any troubles,
I'd start by setting up grace (req.grace = 30s; and obj.grace = 30s; will get
you far) and focusing on getting that hit rate up. If all you're serving is
images, chances are that you should be able to top 99% which would make
Varnish considerably more resilient to hiccoughs from backends.
You also have a few backend failures which could easily trigger Bad Things
with no grace and a low hit rate.
You should also consider starting with -p cli_timeout=20 or similar, as the
default can be far too aggressive on a busy site. Any entries in the syslog
or varnishlog entries related to this would be helpful for further
Redpill Linpro AS
Tlf: +47 21544179
Mob: +47 99014497
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
More information about the varnish-misc