varnish stops responding and threads do not decrease
Matt Schurenko
MSchurenko at airg.com
Thu Sep 22 19:05:26 CEST 2011
Thanks Damon. I'll put the cache size down. On the other varnish server cache size is equal to physical ram and we haven't had this problem. I have read Kristian's blog and was using malloc before. It was on a server with 32 GB of ram. I set malloc to 28 GB but for some reason it ended up using ~ 15 GB of swap space. From what I understand varnish performs better using mmap then swap so that's why I changed to the file storage type. I'll also try adjusting the thread configuration on all our varnish nodes.
Sent from my iPhone
On 2011-09-22, at 9:47 AM, "Damon Snyder" <damon at huddler-inc.com<mailto:damon at huddler-inc.com>> wrote:
Hi Matt,
We had some instability when running varnish using a file backed cache that was larger than RAM. This was a year or more ago, and I don't recall the exact details of the issue. We resolved it by reducing the size of the file cache to be smaller than memory and the problems went away.
It looks like you can also increase your thread pool size. We are using something like -w 100,1600,120 at startup and have a similar connection rate. I would also suggest moving to malloc storage type (we are migrating from file to malloc). Kristian is quoted as saying:
If you can not contain your cache in memory, first consider if you really need that big of a cache. Then consider buying more memory. Then sleep on it.
<http://kristianlyng.wordpress.com/2010/01/26/varnish-best-practices/>http://kristianlyng.wordpress.com/2010/01/26/varnish-best-practices/
If you do switch to malloc on this box, try something like -smalloc,40G. You don't want to allocate up to the amount of RAM as varnish needs some additional space for other data and you need to make sure your OS and other processes has some margin as well.
For my part, I wish there was an interface into the size of the objects being cached so that you can better estimate the size of the cache you need. I understand varnishsizes can do something like but it doesn't tell you what is currently stored in aggregate or allow you to traverse the objects and their sizes.
In any case, you can monitor your hit rate and see what impact dropping the extra 12 GB has. I suppose you could argue that your hit rate is moot if you are not responding to connections so just getting rid of the no-response issue should be net positive.
Hope this helps,
Damon
On Thu, Sep 22, 2011 at 9:14 AM, Matt Schurenko <<mailto:MSchurenko at airg.com>MSchurenko at airg.com<mailto:MSchurenko at airg.com>> wrote:
I posted yesterday regarding this issue. It has happened again; however this time I have not restarted the problem varnish node. I am using the default configuration with regards to threads. It seems that once thread_pool_max hits 500 that the server stops responding to any requests. I stopped all connections from the load balancer to the server; however the threads do not decrease. They remain stuck at ~ 509:
[root at mvp14 16328]# grep Threads /proc/`pgrep varnishd|tail -n1`/status
Threads: 509
The server has been idle for ~ 45 minutes now and there are only a couple of established connections:
[root at mvp14 16328]# netstat -ant | grep -w .*:80 | grep EST
tcp 254 0 204.92.101.119:80<http://204.92.101.119:80> 192.168.105.32:37554<http://192.168.105.32:37554> ESTABLISHED
tcp 532 0 204.92.101.119:80<http://204.92.101.119:80> 192.168.100.153:57722<http://192.168.100.153:57722> ESTABLISHED
tcp 0 0 192.168.100.56:38818<http://192.168.100.56:38818> 204.92.101.124:80<http://204.92.101.124:80> ESTABLISHED
There are however quite a number of connections in CLOSE_WAIT:
[root at mvp14 16328]# netstat -ant | grep -w .*:80 | grep CLOSE_WAIT | wc -l
1118
Here is my varnishd version:
[root at mvp14 16328]# varnishd -V
varnishd (varnish-2.1.5 SVN )
Copyright (c) 2006-2009 Linpro AS / Verdens Gang ASrsion:
Here are the system limits for varnishd:
[root at mvp14 16328]# cat /proc/`pgrep varnishd|tail -n1`/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 397312 397312 processes
Max open files 131072 131072 files
Max locked memory 32768 32768 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 397312 397312 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Here is some memory info on the server:
[root at mvp14 16328]# free -m
total used free shared buffers cached
Mem: 48299 48170 129 0 92 42815
-/+ buffers/cache: 5262 43037
Swap: 15147 0 15147
Here is my varnishd command line:
/usr/local/sbin/varnishd -s file,/tmp/varnish-cache,60G -T 127.0.0.1:2000<http://127.0.0.1:2000> -a 0.0.0.0:80<http://0.0.0.0:80> -t 604800 -f /usr/local/etc/varnish/default.vcl -p http_headers 384 -p connect_timeout 4.0
Here is the output from 'varnishstat -1':
client_conn 32772547 265.09 Client connections accepted
client_drop 13103 0.11 Connection dropped, no sess/wrk
client_req 32531681 263.14 Client requests received
cache_hit 27525134 222.64 Cache hits
cache_hitpass 0 0.00 Cache hits for pass
cache_miss 5005404 40.49 Cache misses
backend_conn 4954451 40.07 Backend conn. success
backend_unhealthy 0 0.00 Backend conn. not attempted
backend_busy 0 0.00 Backend conn. too many
backend_fail 853 0.01 Backend conn. failures
backend_reuse 51728 0.42 Backend conn. reuses
backend_toolate 13 0.00 Backend conn. was closed
backend_recycle 51742 0.42 Backend conn. recycles
backend_unused 0 0.00 Backend conn. unused
fetch_head 5 0.00 Fetch head
fetch_length 81316 0.66 Fetch with Length
fetch_chunked 4924086 39.83 Fetch chunked
fetch_eof 0 0.00 Fetch EOF
fetch_bad 0 0.00 Fetch had bad headers
fetch_close 186 0.00 Fetch wanted close
fetch_oldhttp 0 0.00 Fetch pre HTTP/1.1 closed
fetch_zero 0 0.00 Fetch zero len
fetch_failed 0 0.00 Fetch failed
n_sess_mem 1268 . N struct sess_mem
n_sess 1174 . N struct sess
n_object 4922732 . N struct object
n_vampireobject 0 . N unresurrected objects
n_objectcore 4923144 . N struct objectcore
n_objecthead 4642001 . N struct objecthead
n_smf 9639146 . N struct smf
n_smf_frag 394705 . N small free smf
n_smf_large 0 . N large free smf
n_vbe_conn 501 . N struct vbe_conn
n_wrk 500 . N worker threads
n_wrk_create 3622 0.03 N worker threads created
n_wrk_failed 0 0.00 N worker threads not created
n_wrk_max 4079 0.03 N worker threads limited
n_wrk_queue 502 0.00 N queued work requests
n_wrk_overflow 65305 0.53 N overflowed work requests
n_wrk_drop 13102 0.11 N dropped work requests
n_backend 2 . N backends
n_expired 1347 . N expired objects
n_lru_nuked 381454 . N LRU nuked objects
n_lru_saved 0 . N LRU saved objects
n_lru_moved 23327252 . N LRU moved objects
n_deathrow 0 . N objects on deathrow
losthdr 0 0.00 HTTP header overflows
n_objsendfile 0 0.00 Objects sent with sendfile
n_objwrite 31912510 258.13 Objects sent with write
n_objoverflow 0 0.00 Objects overflowing workspace
s_sess 32758443 264.97 Total Sessions
s_req 32531681 263.14 Total Requests
s_pipe 0 0.00 Total pipe
s_pass 1134 0.01 Total pass
s_fetch 5005593 40.49 Total fetch
s_hdrbytes 10659824012 86223.60 Total header bytes
s_bodybytes 129812627152 1050009.12 Total body bytes
sess_closed 29276120 236.80 Session Closed
sess_pipeline 17 0.00 Session Pipeline
sess_readahead 32 0.00 Session Read Ahead
sess_linger 3510104 28.39 Session Linger
sess_herd 3554241 28.75 Session herd
shm_records 1725429324 13956.40 SHM records
shm_writes 144491896 1168.74 SHM writes
shm_flushes 750 0.01 SHM flushes due to overflow
shm_cont 494654 4.00 SHM MTX contention
shm_cycles 794 0.01 SHM cycles through buffer
sm_nreq 10391973 84.06 allocator requests
sm_nobj 9244441 . outstanding allocations
sm_balloc 41184530432 . bytes allocated
sm_bfree 23239979008 . bytes free
sma_nreq 0 0.00 SMA allocator requests
sma_nobj 0 . SMA outstanding allocations
sma_nbytes 0 . SMA outstanding bytes
sma_balloc 0 . SMA bytes allocated
sma_bfree 0 . SMA bytes free
sms_nreq 945 0.01 SMS allocator requests
sms_nobj 0 . SMS outstanding allocations
sms_nbytes 0 . SMS outstanding bytes
sms_balloc 395010 . SMS bytes allocated
sms_bfree 395010 . SMS bytes freed
backend_req 5006185 40.49 Backend requests made
n_vcl 1 0.00 N vcl total
n_vcl_avail 1 0.00 N vcl available
n_vcl_discard 0 0.00 N vcl discarded
n_purge 1 . N total active purges
n_purge_add 1 0.00 N new purges added
n_purge_retire 0 0.00 N old purges deleted
n_purge_obj_test 0 0.00 N objects tested
n_purge_re_test 0 0.00 N regexps tested against
n_purge_dups 0 0.00 N duplicate purges removed
hcb_nolock 32531033 263.13 HCB Lookups without lock
hcb_lock 5005369 40.49 HCB Lookups with lock
hcb_insert 5005363 40.49 HCB Inserts
esi_parse 0 0.00 Objects ESI parsed (unlock)
esi_errors 0 0.00 ESI parse errors (unlock)
accept_fail 0 0.00 Accept failures
client_drop_late 1 0.00 Connection dropped late
uptime 123630 1.00 Client uptime
backend_retry 0 0.00 Backend conn. retry
dir_dns_lookups 0 0.00 DNS director lookups
dir_dns_failed 0 0.00 DNS director failed lookups
dir_dns_hit 0 0.00 DNS director cached lookups hit
dir_dns_cache_full 0 0.00 DNS director full dnscache
fetch_1xx 0 0.00 Fetch no body (1xx)
fetch_204 0 0.00 Fetch no body (204)
fetch_304 0 0.00 Fetch no body (304)
Here is some output from varnishlog (Is this normal?):
1442 TTL - 216036249 RFC 604800 1316705246 0 0 0 0
1442 VCL_call - fetch
1442 VCL_return - deliver
1442 ObjProtocol - HTTP/1.1
1442 ObjStatus - 200
1442 ObjResponse - OK
1442 ObjHeader - Date: Thu, 22 Sep 2011 15:27:25 GMT
1442 ObjHeader - Server: Apache/1.3.41 (Unix) mod_perl/1.30
1442 ObjHeader - x-airg-hasbinary: 1
1442 ObjHeader - x-airg-return-contentType: image%2Fjpeg
1442 ObjHeader - x-airg-interfacestatus: 200
1442 ObjHeader - Content-Type: image/jpeg
0 ExpKill - 184966508 LRU
0 ExpKill - 184967370 LRU
0 ExpKill - 184969553 LRU
0 ExpKill - 184970764 LRU
0 ExpKill - 184971732 LRU
0 ExpKill - 184976538 LRU
0 ExpKill - 184977972 LRU
0 ExpKill - 184988825 LRU
0 ExpKill - 184917719 LRU
0 ExpKill - 184997163 LRU
0 ExpKill - 184940621 LRU
0 ExpKill - 185000270 LRU
0 ExpKill - 185001314 LRU
0 ExpKill - 185003793 LRU
0 ExpKill - 185004913 LRU
0 ExpKill - 183651304 LRU
0 ExpKill - 185010145 LRU
0 ExpKill - 185012162 LRU
I also noticed in MRTG that when this happens there is a sudden spike in lru nuked activity. It looks like it went from 0 nukes/sec to ~ 200.
Do I have something misconfigured? Is varnish running into some kind of resource limitation (memory, file descriptors) which is causing it to hang? Did I set the cache size to large compared to the amount of physcial RAM I have? I am running the same version of varnish on the exact same server and this has not happened. The only difference is that I am using '-s file,/tmp/varnish-cache,48G ' instead of '-s file,/tmp/varnish-cache,60G'.
Any help here would be most appreciated. This is in production right now.
Matt Schurenko
Systems Administrator
airG share your world
Suite 710, 1133 Melville Street
Vancouver, BC V6E 4E5
P: +1.604.408.2228
F: +1.866.874.8136
E: <mailto:MSchurenko at airg.com> MSchurenko at airg.com<mailto:MSchurenko at airg.com>
W: <http://www.airg.com> www.airg.com<http://www.airg.com>
_______________________________________________
varnish-misc mailing list
<mailto:varnish-misc at varnish-cache.org>varnish-misc at varnish-cache.org<mailto:varnish-misc at varnish-cache.org>
<https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc>https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
More information about the varnish-misc
mailing list