lru nuke causes varnish to stop respnding to client requests

Fri Sep 30 20:06:40 CEST 2011

Hey Damon,

Our backends are ok. I have two other varnish servers in front of the same backends, varnish1 and varnish2. varmish1 is a VM with 32 GB of ram that is using a file storage backend of 40GB. varnish2 is a physical machine with 48GB of RAM and a file storage backend of 48GB (this is identical to varnish3, the problem server). varnish1 and varnish2  seem to start to lru nuke much earlier than varnish3. To me it seems like the problem is that varnish3 lets the cache get as full as possible then starts to kind of panic and lru nuke as much as it can which causes the worker threads to skyrocket. That's just my interpretation of it anyways. I could be completely wrong. I also noticed that the n_object counter gets way higher on varnish3. Both varnish1 and varnish2 keep ~ 1.5 million objects in the cache. For varnish3 the number of objects kept climbing past 6 million, at which point it started to lru_nuke and stop responding to clients. I observed this happen three consecutive times on varnsih3. The number of objects keeps climbing. There is no lru activity whatsoever. Then it hits ~ 6 million objects then it starts to nuke. This usually takes about 48 hours to occur.

I originally tried using the malloc storage type but kept having problems determining how big I should make the cache. On a server with 32GB I tried setting malloc to 30GB. I noticed that the server used a lot of swap space so I decreased malloc to 28GB. I still saw a lot of swap usage. In fact it used up to 13GB of swap. Since I read that mmap is better performance than swapping I decided to use a storage type of file.

Matt Schurenko
Systems Administrator

airG(r) Share Your World
Suite 710, 1133 Melville Street
Vancouver, BC  V6E 4E5
P: +1.604.408.2228
F: +1.866.874.8136
E: MSchurenko at airg.com
W: www.airg.com<http://www.airg.com>
From: Damon Snyder [mailto:damon at huddler-inc.com]
Sent: September-30-11 9:56 AM
To: Matt Schurenko
Cc: varnish-misc at varnish-cache.org
Subject: Re: lru nuke causes varnish to stop respnding to client requests

Hi Matt,
Oh, sorry, I didn't notice the worker thread count at 3000. I would suggest switching to -smalloc,38G (total mem * 0.8). If you have a lot of small objects or objects of the same size, you could be encountering some excessive nuking.

Also, what are your backends doing when this happens? Is the nuking a coincidence or is there an issue further down the stack?

Damon
On Thu, Sep 29, 2011 at 5:42 PM, Matt Schurenko <MSchurenko at airg.com<mailto:MSchurenko at airg.com>> wrote:
Sorry. I forgot to mention that I already adjusted thread behaviour via varniahadm while varnish was running. I had it set to use min 50, max 3000 and a thread timeout of 120s. I think the reason why n_wrk_overflow and n_wrk_drop are so high is due to this problem. Before the lru nuke happens the number of worker threads are ~ 100. As soon as it starts to nuke the number of threads jumps to the max. I am monitoring some stats with MRTG. I seem to remember that on the other varnish server it would being to lru nuke long before the cache got full. For this one there is no lru nuke activity until it reaches a certain point and then boom. 3000 threads are used up and no more new clients can connect.

Matt Schurenko
Systems Administrator

airG(r) Share Your World
Suite 710, 1133 Melville Street
Vancouver, BC  V6E 4E5
P: +1.604.408.2228<tel:%2B1.604.408.2228>
F: +1.866.874.8136<tel:%2B1.866.874.8136>
E: MSchurenko at airg.com<mailto:MSchurenko at airg.com>
W: www.airg.com<http://www.airg.com>
From: Damon Snyder [mailto:damon at huddler-inc.com<mailto:damon at huddler-inc.com>]
Sent: September-29-11 5:30 PM
To: Matt Schurenko
Cc: varnish-misc at varnish-cache.org<mailto:varnish-misc at varnish-cache.org>
Subject: Re: lru nuke causes varnish to stop respnding to client requests

Hi Matt,
It looks like you really need to bump up the number of worker threads. From your stats:

n_wrk_queue              2861         0.02 N queued work requests
n_wrk_overflow          83534         0.52 N overflowed work requests
n_wrk_drop              10980         0.07 N dropped work requests

You have a lot of requests that are on the queue waiting for a worker and you have a lot of requests where varnish has given up trying to fullfil with a worker. You can bump the number of works up using the -w command line option to varnishd. I would suggest something like -w 400,1000,120 to start with (the default is -w2,500,300). This says use 400 at a minimum, 1000 as the maximum, and set the timeout to 120s. According to the stats explanation doc<https://www.varnish-cache.org/trac/wiki/StatsExplained> your n_wrk_queue and n_wrk_drop should be 0. If you see these numbers going up again, use -w 500,2000,120 or something like that.

Hope this helps,
Damon

On Thu, Sep 29, 2011 at 4:34 PM, Matt Schurenko <MSchurenko at airg.com<mailto:MSchurenko at airg.com>> wrote:
I've been having this problem for a couple weeks now on one of our varnish servers. I have posted a couple times already. What happens is that the server in questions runs fine until the cache gets full. When it starts to lru nuke the number of worker threads jumps up to thread_pool_max and varnish stops responding to any client requests. I have tried this with Centos5.4, 5.7 and now Slackware (all 64-bit ) and the behaviour is the same.

I am using varnish version 2.1.5 on a Dell C6105 with 48GB of RAM.

Here is my varnishd command line:

/usr/local/sbin/varnishd -s file,/tmp/varnish-cache,48g -T 127.0.0.1:2000<http://127.0.0.1:2000> -a 0.0.0.0:80<http://0.0.0.0:80> -t 604800 -f /usr/local/etc/varnish/default.vcl -p http_headers 384 -p connect_timeout 4.0

Here is the output from varnishstat -1:

client_conn          38582763       240.38 Client connections accepted
client_drop             10950         0.07 Connection dropped, no sess/wrk
client_req           38298994       238.61 Client requests received
cache_hit            32513762       202.57 Cache hits
cache_hitpass               0         0.00 Cache hits for pass
cache_miss            5784476        36.04 Cache misses
backend_conn          5725540        35.67 Backend conn. success
backend_unhealthy            0         0.00 Backend conn. not attempted
backend_busy                0         0.00 Backend conn. too many
backend_fail             1383         0.01 Backend conn. failures
backend_reuse           60837         0.38 Backend conn. reuses
backend_toolate            33         0.00 Backend conn. was closed
backend_recycle         60870         0.38 Backend conn. recycles
backend_unused              0         0.00 Backend conn. unused
fetch_head                  6         0.00 Fetch head
fetch_length            93631         0.58 Fetch with Length
fetch_chunked         5689433        35.45 Fetch chunked
fetch_eof                   0         0.00 Fetch EOF
fetch_bad                   0         0.00 Fetch had bad headers
fetch_close               107         0.00 Fetch wanted close
fetch_oldhttp               0         0.00 Fetch pre HTTP/1.1 closed
fetch_zero                  0         0.00 Fetch zero len
fetch_failed                1         0.00 Fetch failed
n_sess_mem               7138          .   N struct sess_mem
n_sess                   6970          .   N struct sess
n_object              5047123          .   N struct object
n_vampireobject             0          .   N unresurrected objects
n_objectcore          5048435          .   N struct objectcore
n_objecthead          4955641          .   N struct objecthead
n_smf                10139770          .   N struct smf
n_smf_frag             295671          .   N small free smf
n_smf_large                 0          .   N large free smf
n_vbe_conn               2997          .   N struct vbe_conn
n_wrk                    3000          .   N worker threads
n_wrk_create             5739         0.04 N worker threads created
n_wrk_failed                0         0.00 N worker threads not created
n_wrk_max                4063         0.03 N worker threads limited
n_wrk_queue              2861         0.02 N queued work requests
n_wrk_overflow          83534         0.52 N overflowed work requests
n_wrk_drop              10980         0.07 N dropped work requests
n_backend                   2          .   N backends
n_expired                2179          .   N expired objects
n_lru_nuked            862615          .   N LRU nuked objects
n_lru_saved                 0          .   N LRU saved objects
n_lru_moved          27156180          .   N LRU moved objects
n_deathrow                  0          .   N objects on deathrow
losthdr                     0         0.00 HTTP header overflows
n_objsendfile               0         0.00 Objects sent with sendfile
n_objwrite           37294888       232.35 Objects sent with write
n_objoverflow               0         0.00 Objects overflowing workspace
s_sess               38566049       240.27 Total Sessions
s_req                38298994       238.61 Total Requests
s_pipe                      0         0.00 Total pipe
s_pass                    266         0.00 Total pass
s_fetch               5783176        36.03 Total fetch
s_hdrbytes        12570989864     78319.53 Total header bytes
s_bodybytes      151327304604    942796.38 Total body bytes
sess_closed          34673984       216.03 Session Closed
sess_pipeline             187         0.00 Session Pipeline
sess_readahead            321         0.00 Session Read Ahead
sess_linger           3929378        24.48 Session Linger
sess_herd             3929559        24.48 Session herd
shm_records        2025645664<tel:2025645664>     12620.14 SHM records
shm_writes          169640580      1056.89 SHM writes
shm_flushes                41         0.00 SHM flushes due to overflow
shm_cont               580515         3.62 SHM MTX contention
shm_cycles                933         0.01 SHM cycles through buffer
sm_nreq              12431620        77.45 allocator requests
sm_nobj               9844099          .   outstanding allocations
sm_balloc         43855261696          .   bytes allocated
sm_bfree           7684345856          .   bytes free
sma_nreq                    0         0.00 SMA allocator requests
sma_nobj                    0          .   SMA outstanding allocations
sma_nbytes                  0          .   SMA outstanding bytes
sma_balloc                  0          .   SMA bytes allocated
sma_bfree                   0          .   SMA bytes free
sms_nreq                 1566         0.01 SMS allocator requests
sms_nobj                    0          .   SMS outstanding allocations
sms_nbytes                  0          .   SMS outstanding bytes
sms_balloc             656154          .   SMS bytes allocated
sms_bfree              656154          .   SMS bytes freed
backend_req           5786381        36.05 Backend requests made
n_vcl                       1         0.00 N vcl total
n_vcl_avail                 1         0.00 N vcl available
n_vcl_discard               0         0.00 N vcl discarded
n_purge                   218          .   N total active purges
n_purge_add               218         0.00 N new purges added
n_purge_retire              0         0.00 N old purges deleted
n_purge_obj_test       588742         3.67 N objects tested
n_purge_re_test     120444323       750.39 N regexps tested against
n_purge_dups                0         0.00 N duplicate purges removed
hcb_nolock           38301670       238.63 HCB Lookups without lock
hcb_lock              5786309        36.05 HCB Lookups with lock
hcb_insert            5786305        36.05 HCB Inserts
esi_parse                   0         0.00 Objects ESI parsed (unlock)
esi_errors                  0         0.00 ESI parse errors (unlock)
accept_fail                 0         0.00 Accept failures
client_drop_late           30         0.00 Connection dropped late
uptime                 160509         1.00 Client uptime
backend_retry              25         0.00 Backend conn. retry
dir_dns_lookups             0         0.00 DNS director lookups
dir_dns_failed              0         0.00 DNS director failed lookups
dir_dns_hit                 0         0.00 DNS director cached lookups hit
dir_dns_cache_full            0         0.00 DNS director full dnscache
fetch_1xx                   0         0.00 Fetch no body (1xx)
fetch_204                   0         0.00 Fetch no body (204)
fetch_304                   0         0.00 Fetch no body (304)

Even though I have removed the server from our load balancer there are still a lot of requests going to the backend. Maybe these are all queued up requests that varnish is trying to fulfill? Here is some output from varnishlog -c when I try to connect with curl:

root at mvp14:~# varnishlog -c
   26 SessionOpen  c 192.168.8.41 41942 0.0.0.0:80<http://0.0.0.0:80>
   26 ReqStart     c 192.168.8.41 41942 2108342803
   26 RxRequest    c GET
   26 RxURL        c /
   26 RxProtocol   c HTTP/1.1
   26 RxHeader     c User-Agent: curl/7.21.4 (x86_64-unknown-linux-gnu) libcurl/7.21.4 OpenSSL/0.9.8n zlib/1.2.5 libidn/1.19
   26 RxHeader     c Host: mvp14.airg.com<http://mvp14.airg.com>
   26 RxHeader     c Accept: */*
   26 VCL_call     c recv
   26 VCL_return   c lookup
   26 VCL_call     c hash
   26 VCL_return   c hash

The connection just hangs here until it times out.

Any help would be appreciated. We are trying to replace our squid caching layer with varnish. However if I can't resolve this issue we will have to go back to squid.

Thanks,

Matt Schurenko
Systems Administrator

_______________________________________________
varnish-misc mailing list
varnish-misc at varnish-cache.org<mailto:varnish-misc at varnish-cache.org>
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20110930/7b4ca1a0/attachment-0003.html>