varnish crashes

Sun Jan 24 18:57:20 CET 2010

On Jan 24, 2010, at 7:23 AM, Angelo Höngens wrote:
>> What is thread_pool_max set to?  Have you tried lowering it?   We have
>> found that on systems with very high cache-hit ratios, 16 threads per
>> CPU is the sweet spot to avoid context-switch saturation.
> 
> [angelo at nmt-nlb-03 ~]$ varnishadm -T localhost:81 param.show| grep
> thread_pool
> 
> thread_pool_add_delay      20 [milliseconds]
> thread_pool_add_threshold  2 [requests]
> thread_pool_fail_delay     200 [milliseconds]
> thread_pool_max            500 [threads]
> thread_pool_min            5 [threads]
> thread_pool_purge_delay    1000 [milliseconds]
> thread_pool_timeout        300 [seconds]
> thread_pools               2 [pools]
> 
> Thread_pool_max is set to 500 threads.. But I just increased it to 4000
> (as per http://varnish.projects.linpro.no/wiki/Performance), as 'top'
> shows me it's using around 480~490 threads now..
> 
> You suggest lowering it, what would be the effect of that? I would think
> it would run out of threads or something? Well, we'll see what happens
> with the increased threads..

Increasing concurrency is unlikely to solve the problem, although setting the number of thread pools to the number of CPUs is probably a good idea.

Assuming a high hit ratio and high CPU utilization (you haven't posted either), lowering concurrency (i.e. reducing thread_pool_max) can help reduce CPU contention incurred by context switching.  

If maximum concurrency is reached, incoming connections will be deferred to the TCP listen(2) backlog (the overflowed_requests counter in varnishstat increases when this happens).   When the request reaches the head of the queue, it will then be picked up by a processing thread.  The net effect is some additional latency, but probably not as much as you're experiencing if your CPU is swamped with context switches.

There are a few cases where increasing thread_pool_max can help, in particular, where you have a high cache-miss ratio and you have slow origin servers.  But if CPU is already high, it will only make the problem worse.

BTW, on FreeBSD you can view the current length of the listen(2) backlog via "netstat -aL"  By default, varnishd's listen(2) backlog is 512; as long as you don't see the length hit that value you should be ok.

--Michael