thread pool issues

Tue Jun 14 17:20:36 CEST 2011

Hi Kristian,

Thank you for your suggestions. We've upgraded our Varnish config to
2.1.5 which decreases the thread_pool_add_delay from 20ms to 2ms. I've
included a varnishstat listing below. The numbers reflect live testing,
(our experiences with synthetic tests is that it is very hard to imitate
real life behavior)

> I would typically recommend something closer to minimum 500, pools 2 and
> max 5000.

Currently we use 8 pools because the server has 2x4 CPU cores. Is there
an advantage to use less pools than the number of CPU cores? When we increase 
the number of threads the problem with "N worker threads limited" is solved! :-)

> How many connections (not requests) are you doing during these tests?
ls -1 /proc/<varnish pid>/fd | wc -l gives us ~1300 (single load) and
~2600 (double load) file descriptors (=connections?).

> Do you use keep-alive and long-lasting connections? You may want to see
> if reducing session_linger helps.
Requests mostly arrive from web browsers.
netstat -tna | wc -l
~12000 tcp connections (single load)

Unfortunately Varnish, after facing double load, now gets very
'unresponsive' after a while. Client requests are not answered by
varnish resulting in long waiting times (10+ seconds) or timeouts. We
do not have bandwidth issues.   

Is it possible that in our use case we've reached the limit of what
Varnish can handle?

Greetings and thanks for the help so far!
Dennis

varnishstat -1
client_conn            696307       177.40 Client connections accepted
client_drop                 0         0.00 Connection dropped, no
sess/wrk
client_req             965174       245.90 Client requests received
cache_hit              925943       235.91 Cache hits
cache_hitpass               5         0.00 Cache hits for pass
cache_miss              39125         9.97 Cache misses
backend_conn             4568         1.16 Backend conn. success
backend_unhealthy            0         0.00 Backend conn. not attempted
backend_busy                0         0.00 Backend conn. too many
backend_fail                3         0.00 Backend conn. failures
backend_reuse           34683         8.84 Backend conn. reuses
backend_toolate            79         0.02 Backend conn. was closed
backend_recycle         34768         8.86 Backend conn. recycles
backend_unused              0         0.00 Backend conn. unused
fetch_head                  0         0.00 Fetch head
fetch_length            24818         6.32 Fetch with Length
fetch_chunked           14426         3.68 Fetch chunked
fetch_eof                   0         0.00 Fetch EOF
fetch_bad                   0         0.00 Fetch had bad headers
fetch_close                 1         0.00 Fetch wanted close
fetch_oldhttp               0         0.00 Fetch pre HTTP/1.1 closed
fetch_zero                  0         0.00 Fetch zero len
fetch_failed                0         0.00 Fetch failed
n_sess_mem               2235          .   N struct sess_mem
n_sess                   1787          .   N struct sess
n_object                34379          .   N struct object
n_vampireobject             0          .   N unresurrected objects
n_objectcore            34516          .   N struct objectcore
n_objecthead            22424          .   N struct objecthead
n_smf                       0          .   N struct smf
n_smf_frag                  0          .   N small free smf
n_smf_large                 0          .   N large free smf
n_vbe_conn                  6          .   N struct vbe_conn
n_wrk                     280          .   N worker threads
n_wrk_create              280         0.07 N worker threads created
n_wrk_failed                0         0.00 N worker threads not created
n_wrk_max                9693         2.47 N worker threads limited
n_wrk_queue                 0         0.00 N queued work requests
n_wrk_overflow              0         0.00 N overflowed work requests
n_wrk_drop                  0         0.00 N dropped work requests
n_backend                   4          .   N backends
n_expired                 385          .   N expired objects
n_lru_nuked                 0          .   N LRU nuked objects
n_lru_saved                 0          .   N LRU saved objects
n_lru_moved            370058          .   N LRU moved objects
n_deathrow                  0          .   N objects on deathrow
losthdr                     0         0.00 HTTP header overflows
n_objsendfile               0         0.00 Objects sent with sendfile
n_objwrite             815230       207.70 Objects sent with write
n_objoverflow               0         0.00 Objects overflowing workspace
s_sess                 696245       177.39 Total Sessions
s_req                  965174       245.90 Total Requests
s_pipe                      4         0.00 Total pipe
s_pass                    120         0.03 Total pass
s_fetch                 39245        10.00 Total fetch
s_hdrbytes          285675067     72783.46 Total header bytes
s_bodybytes       10667879292   2717931.03 Total body bytes
sess_closed             30597         7.80 Session Closed
sess_pipeline            1238         0.32 Session Pipeline
sess_readahead            537         0.14 Session Read Ahead
sess_linger            955973       243.56 Session Linger
sess_herd              891554       227.15 Session herd
shm_records          39223429      9993.23 SHM records
shm_writes            4022999      1024.97 SHM writes
shm_flushes                 0         0.00 SHM flushes due to overflow
shm_cont                 1578         0.40 SHM MTX contention
shm_cycles                 15         0.00 SHM cycles through buffer
sm_nreq                     0         0.00 allocator requests
sm_nobj                     0          .   outstanding allocations
sm_balloc                   0          .   bytes allocated
sm_bfree                    0          .   bytes free
sma_nreq                71633        18.25 SMA allocator requests
sma_nobj                66455          .   SMA outstanding allocations
sma_nbytes          608883602          .   SMA outstanding bytes
sma_balloc         2206748168          .   SMA bytes allocated
sma_bfree          1597864566          .   SMA bytes free
sms_nreq                    0         0.00 SMS allocator requests
sms_nobj                    0          .   SMS outstanding allocations
sms_nbytes                  0          .   SMS outstanding bytes
sms_balloc                  0          .   SMS bytes allocated
sms_bfree                   0          .   SMS bytes freed
backend_req             39247        10.00 Backend requests made
n_vcl                       2         0.00 N vcl total
n_vcl_avail                 1         0.00 N vcl available
n_vcl_discard               1         0.00 N vcl discarded
n_purge                     1          .   N total active purges
n_purge_add                 1         0.00 N new purges added
n_purge_retire              0         0.00 N old purges deleted
n_purge_obj_test            0         0.00 N objects tested
n_purge_re_test             0         0.00 N regexps tested against
n_purge_dups                0         0.00 N duplicate purges removed
hcb_nolock                  0         0.00 HCB Lookups without lock
hcb_lock                    0         0.00 HCB Lookups with lock
hcb_insert                  0         0.00 HCB Inserts
esi_parse                   0         0.00 Objects ESI parsed (unlock)
esi_errors                  0         0.00 ESI parse errors (unlock)
accept_fail                 0         0.00 Accept failures
client_drop_late            0         0.00 Connection dropped late
uptime                   3925         1.00 Client uptime
backend_retry               2         0.00 Backend conn. retry
dir_dns_lookups             0         0.00 DNS director lookups
dir_dns_failed              0         0.00 DNS director failed lookups
dir_dns_hit                 0         0.00 DNS director cached lookups
hit
dir_dns_cache_full            0         0.00 DNS director full dnscache
fetch_1xx                   0         0.00 Fetch no body (1xx)
fetch_204                   0         0.00 Fetch no body (204)
fetch_304                   0         0.00 Fetch no body (304)

On Fri, 2011-06-10 at 16:29 +0200, Kristian Lyngstol wrote:
> Greetings,
> 
> On Fri, Jun 10, 2011 at 08:32:11AM +0200, Dennis Hendriksen wrote:
> > We're running Varnish 2.0.6 on a dual quad core server which is doing
> > about 500 req/s with a 97% hit ratio serving mostly images with. When we
> > increase the load to about 800 req/s than we encounter two problems that
> > seem to be related with the thread pool increase.
> 
> You really should see if you can't move to at least Varnish 2.1.5.
> 
> > When we double the varnish load then the "N worker threads limited"
> > increases rapidly (100k+) while the "N worker threads created" does not
> > increase (8 pools, min pool size 25, max pool size 1000). Varnish is
> > unresponsive and client connections hang.
> 
> That'll give you 200 threads at startup.
> 
> I would typically recommend something closer to minimum 500, pools 2 and
> max 5000.
> 
> You also want to reduce the thread_pool_add_delay from the (2.0.6)
> default 20ms to 2ms for instance. That will limit the rate that threads
> are started at, and 20ms is often way too slow.
> 
> How many connections (not requests) are you doing during these tests?
> 
> > At other times we see the number of worker threads increasing but again
> > connections 'hang' while Varnish doesn't show any dropped connections
> > (only overflows).
> 
> Do you use keep-alive and long-lasting connections? You may want to see
> if reducing session_linger helps.
> 
> Are you testing with real traffic or synthetic tests?
> 
> If possible, varnishstat -1 output would be useful.
> 
> - Kristian
>