lru nuke causes varnish to stop respnding to client requests
Matt Schurenko
MSchurenko at airg.com
Fri Sep 30 20:06:40 CEST 2011
Hey Damon,
Our backends are ok. I have two other varnish servers in front of the same backends, varnish1 and varnish2. varmish1 is a VM with 32 GB of ram that is using a file storage backend of 40GB. varnish2 is a physical machine with 48GB of RAM and a file storage backend of 48GB (this is identical to varnish3, the problem server). varnish1 and varnish2 seem to start to lru nuke much earlier than varnish3. To me it seems like the problem is that varnish3 lets the cache get as full as possible then starts to kind of panic and lru nuke as much as it can which causes the worker threads to skyrocket. That's just my interpretation of it anyways. I could be completely wrong. I also noticed that the n_object counter gets way higher on varnish3. Both varnish1 and varnish2 keep ~ 1.5 million objects in the cache. For varnish3 the number of objects kept climbing past 6 million, at which point it started to lru_nuke and stop responding to clients. I observed this happen three consecutive times on varnsih3. The number of objects keeps climbing. There is no lru activity whatsoever. Then it hits ~ 6 million objects then it starts to nuke. This usually takes about 48 hours to occur.
I originally tried using the malloc storage type but kept having problems determining how big I should make the cache. On a server with 32GB I tried setting malloc to 30GB. I noticed that the server used a lot of swap space so I decreased malloc to 28GB. I still saw a lot of swap usage. In fact it used up to 13GB of swap. Since I read that mmap is better performance than swapping I decided to use a storage type of file.
Matt Schurenko
Systems Administrator
airG(r) Share Your World
Suite 710, 1133 Melville Street
Vancouver, BC V6E 4E5
P: +1.604.408.2228
F: +1.866.874.8136
E: MSchurenko at airg.com
W: www.airg.com<http://www.airg.com>
From: Damon Snyder [mailto:damon at huddler-inc.com]
Sent: September-30-11 9:56 AM
To: Matt Schurenko
Cc: varnish-misc at varnish-cache.org
Subject: Re: lru nuke causes varnish to stop respnding to client requests
Hi Matt,
Oh, sorry, I didn't notice the worker thread count at 3000. I would suggest switching to -smalloc,38G (total mem * 0.8). If you have a lot of small objects or objects of the same size, you could be encountering some excessive nuking.
Also, what are your backends doing when this happens? Is the nuking a coincidence or is there an issue further down the stack?
Damon
On Thu, Sep 29, 2011 at 5:42 PM, Matt Schurenko <MSchurenko at airg.com<mailto:MSchurenko at airg.com>> wrote:
Sorry. I forgot to mention that I already adjusted thread behaviour via varniahadm while varnish was running. I had it set to use min 50, max 3000 and a thread timeout of 120s. I think the reason why n_wrk_overflow and n_wrk_drop are so high is due to this problem. Before the lru nuke happens the number of worker threads are ~ 100. As soon as it starts to nuke the number of threads jumps to the max. I am monitoring some stats with MRTG. I seem to remember that on the other varnish server it would being to lru nuke long before the cache got full. For this one there is no lru nuke activity until it reaches a certain point and then boom. 3000 threads are used up and no more new clients can connect.
Matt Schurenko
Systems Administrator
airG(r) Share Your World
Suite 710, 1133 Melville Street
Vancouver, BC V6E 4E5
P: +1.604.408.2228<tel:%2B1.604.408.2228>
F: +1.866.874.8136<tel:%2B1.866.874.8136>
E: MSchurenko at airg.com<mailto:MSchurenko at airg.com>
W: www.airg.com<http://www.airg.com>
From: Damon Snyder [mailto:damon at huddler-inc.com<mailto:damon at huddler-inc.com>]
Sent: September-29-11 5:30 PM
To: Matt Schurenko
Cc: varnish-misc at varnish-cache.org<mailto:varnish-misc at varnish-cache.org>
Subject: Re: lru nuke causes varnish to stop respnding to client requests
Hi Matt,
It looks like you really need to bump up the number of worker threads. From your stats:
n_wrk_queue 2861 0.02 N queued work requests
n_wrk_overflow 83534 0.52 N overflowed work requests
n_wrk_drop 10980 0.07 N dropped work requests
You have a lot of requests that are on the queue waiting for a worker and you have a lot of requests where varnish has given up trying to fullfil with a worker. You can bump the number of works up using the -w command line option to varnishd. I would suggest something like -w 400,1000,120 to start with (the default is -w2,500,300). This says use 400 at a minimum, 1000 as the maximum, and set the timeout to 120s. According to the stats explanation doc<https://www.varnish-cache.org/trac/wiki/StatsExplained> your n_wrk_queue and n_wrk_drop should be 0. If you see these numbers going up again, use -w 500,2000,120 or something like that.
Hope this helps,
Damon
On Thu, Sep 29, 2011 at 4:34 PM, Matt Schurenko <MSchurenko at airg.com<mailto:MSchurenko at airg.com>> wrote:
I've been having this problem for a couple weeks now on one of our varnish servers. I have posted a couple times already. What happens is that the server in questions runs fine until the cache gets full. When it starts to lru nuke the number of worker threads jumps up to thread_pool_max and varnish stops responding to any client requests. I have tried this with Centos5.4, 5.7 and now Slackware (all 64-bit ) and the behaviour is the same.
I am using varnish version 2.1.5 on a Dell C6105 with 48GB of RAM.
Here is my varnishd command line:
/usr/local/sbin/varnishd -s file,/tmp/varnish-cache,48g -T 127.0.0.1:2000<http://127.0.0.1:2000> -a 0.0.0.0:80<http://0.0.0.0:80> -t 604800 -f /usr/local/etc/varnish/default.vcl -p http_headers 384 -p connect_timeout 4.0
Here is the output from varnishstat -1:
client_conn 38582763 240.38 Client connections accepted
client_drop 10950 0.07 Connection dropped, no sess/wrk
client_req 38298994 238.61 Client requests received
cache_hit 32513762 202.57 Cache hits
cache_hitpass 0 0.00 Cache hits for pass
cache_miss 5784476 36.04 Cache misses
backend_conn 5725540 35.67 Backend conn. success
backend_unhealthy 0 0.00 Backend conn. not attempted
backend_busy 0 0.00 Backend conn. too many
backend_fail 1383 0.01 Backend conn. failures
backend_reuse 60837 0.38 Backend conn. reuses
backend_toolate 33 0.00 Backend conn. was closed
backend_recycle 60870 0.38 Backend conn. recycles
backend_unused 0 0.00 Backend conn. unused
fetch_head 6 0.00 Fetch head
fetch_length 93631 0.58 Fetch with Length
fetch_chunked 5689433 35.45 Fetch chunked
fetch_eof 0 0.00 Fetch EOF
fetch_bad 0 0.00 Fetch had bad headers
fetch_close 107 0.00 Fetch wanted close
fetch_oldhttp 0 0.00 Fetch pre HTTP/1.1 closed
fetch_zero 0 0.00 Fetch zero len
fetch_failed 1 0.00 Fetch failed
n_sess_mem 7138 . N struct sess_mem
n_sess 6970 . N struct sess
n_object 5047123 . N struct object
n_vampireobject 0 . N unresurrected objects
n_objectcore 5048435 . N struct objectcore
n_objecthead 4955641 . N struct objecthead
n_smf 10139770 . N struct smf
n_smf_frag 295671 . N small free smf
n_smf_large 0 . N large free smf
n_vbe_conn 2997 . N struct vbe_conn
n_wrk 3000 . N worker threads
n_wrk_create 5739 0.04 N worker threads created
n_wrk_failed 0 0.00 N worker threads not created
n_wrk_max 4063 0.03 N worker threads limited
n_wrk_queue 2861 0.02 N queued work requests
n_wrk_overflow 83534 0.52 N overflowed work requests
n_wrk_drop 10980 0.07 N dropped work requests
n_backend 2 . N backends
n_expired 2179 . N expired objects
n_lru_nuked 862615 . N LRU nuked objects
n_lru_saved 0 . N LRU saved objects
n_lru_moved 27156180 . N LRU moved objects
n_deathrow 0 . N objects on deathrow
losthdr 0 0.00 HTTP header overflows
n_objsendfile 0 0.00 Objects sent with sendfile
n_objwrite 37294888 232.35 Objects sent with write
n_objoverflow 0 0.00 Objects overflowing workspace
s_sess 38566049 240.27 Total Sessions
s_req 38298994 238.61 Total Requests
s_pipe 0 0.00 Total pipe
s_pass 266 0.00 Total pass
s_fetch 5783176 36.03 Total fetch
s_hdrbytes 12570989864 78319.53 Total header bytes
s_bodybytes 151327304604 942796.38 Total body bytes
sess_closed 34673984 216.03 Session Closed
sess_pipeline 187 0.00 Session Pipeline
sess_readahead 321 0.00 Session Read Ahead
sess_linger 3929378 24.48 Session Linger
sess_herd 3929559 24.48 Session herd
shm_records 2025645664<tel:2025645664> 12620.14 SHM records
shm_writes 169640580 1056.89 SHM writes
shm_flushes 41 0.00 SHM flushes due to overflow
shm_cont 580515 3.62 SHM MTX contention
shm_cycles 933 0.01 SHM cycles through buffer
sm_nreq 12431620 77.45 allocator requests
sm_nobj 9844099 . outstanding allocations
sm_balloc 43855261696 . bytes allocated
sm_bfree 7684345856 . bytes free
sma_nreq 0 0.00 SMA allocator requests
sma_nobj 0 . SMA outstanding allocations
sma_nbytes 0 . SMA outstanding bytes
sma_balloc 0 . SMA bytes allocated
sma_bfree 0 . SMA bytes free
sms_nreq 1566 0.01 SMS allocator requests
sms_nobj 0 . SMS outstanding allocations
sms_nbytes 0 . SMS outstanding bytes
sms_balloc 656154 . SMS bytes allocated
sms_bfree 656154 . SMS bytes freed
backend_req 5786381 36.05 Backend requests made
n_vcl 1 0.00 N vcl total
n_vcl_avail 1 0.00 N vcl available
n_vcl_discard 0 0.00 N vcl discarded
n_purge 218 . N total active purges
n_purge_add 218 0.00 N new purges added
n_purge_retire 0 0.00 N old purges deleted
n_purge_obj_test 588742 3.67 N objects tested
n_purge_re_test 120444323 750.39 N regexps tested against
n_purge_dups 0 0.00 N duplicate purges removed
hcb_nolock 38301670 238.63 HCB Lookups without lock
hcb_lock 5786309 36.05 HCB Lookups with lock
hcb_insert 5786305 36.05 HCB Inserts
esi_parse 0 0.00 Objects ESI parsed (unlock)
esi_errors 0 0.00 ESI parse errors (unlock)
accept_fail 0 0.00 Accept failures
client_drop_late 30 0.00 Connection dropped late
uptime 160509 1.00 Client uptime
backend_retry 25 0.00 Backend conn. retry
dir_dns_lookups 0 0.00 DNS director lookups
dir_dns_failed 0 0.00 DNS director failed lookups
dir_dns_hit 0 0.00 DNS director cached lookups hit
dir_dns_cache_full 0 0.00 DNS director full dnscache
fetch_1xx 0 0.00 Fetch no body (1xx)
fetch_204 0 0.00 Fetch no body (204)
fetch_304 0 0.00 Fetch no body (304)
Even though I have removed the server from our load balancer there are still a lot of requests going to the backend. Maybe these are all queued up requests that varnish is trying to fulfill? Here is some output from varnishlog -c when I try to connect with curl:
root at mvp14:~# varnishlog -c
26 SessionOpen c 192.168.8.41 41942 0.0.0.0:80<http://0.0.0.0:80>
26 ReqStart c 192.168.8.41 41942 2108342803
26 RxRequest c GET
26 RxURL c /
26 RxProtocol c HTTP/1.1
26 RxHeader c User-Agent: curl/7.21.4 (x86_64-unknown-linux-gnu) libcurl/7.21.4 OpenSSL/0.9.8n zlib/1.2.5 libidn/1.19
26 RxHeader c Host: mvp14.airg.com<http://mvp14.airg.com>
26 RxHeader c Accept: */*
26 VCL_call c recv
26 VCL_return c lookup
26 VCL_call c hash
26 VCL_return c hash
The connection just hangs here until it times out.
Any help would be appreciated. We are trying to replace our squid caching layer with varnish. However if I can't resolve this issue we will have to go back to squid.
Thanks,
Matt Schurenko
Systems Administrator
_______________________________________________
varnish-misc mailing list
varnish-misc at varnish-cache.org<mailto:varnish-misc at varnish-cache.org>
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20110930/7b4ca1a0/attachment-0003.html>
More information about the varnish-misc
mailing list