varnish stops responding and threads do not decrease
Damon Snyder
damon at huddler-inc.com
Thu Sep 22 18:47:22 CEST 2011
Hi Matt,
We had some instability when running varnish using a file backed cache that
was larger than RAM. This was a year or more ago, and I don't recall the
exact details of the issue. We resolved it by reducing the size of the file
cache to be smaller than memory and the problems went away.
It looks like you can also increase your thread pool size. We are using
something like -w 100,1600,120 at startup and have a similar connection
rate. I would also suggest moving to malloc storage type (we are migrating
from file to malloc). Kristian is quoted as saying:
If you can not contain your cache in memory, first consider if you really
> need that big of a cache. Then consider buying more memory. Then sleep on
> it.
http://kristianlyng.wordpress.com/2010/01/26/varnish-best-practices/
If you do switch to malloc on this box, try something like -smalloc,40G. You
don't want to allocate up to the amount of RAM as varnish needs some
additional space for other data and you need to make sure your OS and other
processes has some margin as well.
For my part, I wish there was an interface into the size of the objects
being cached so that you can better estimate the size of the cache you need.
I understand varnishsizes can do something like but it doesn't tell you what
is currently stored in aggregate or allow you to traverse the objects and
their sizes.
In any case, you can monitor your hit rate and see what impact dropping the
extra 12 GB has. I suppose you could argue that your hit rate is moot if you
are not responding to connections so just getting rid of the no-response
issue should be net positive.
Hope this helps,
Damon
On Thu, Sep 22, 2011 at 9:14 AM, Matt Schurenko <MSchurenko at airg.com> wrote:
> I posted yesterday regarding this issue. It has happened again; however
> this time I have not restarted the problem varnish node. I am using the
> default configuration with regards to threads. It seems that once
> thread_pool_max hits 500 that the server stops responding to any requests. I
> stopped all connections from the load balancer to the server; however the
> threads do not decrease. They remain stuck at ~ 509:
>
> [root at mvp14 16328]# grep Threads /proc/`pgrep varnishd|tail -n1`/status
> Threads: 509
>
> The server has been idle for ~ 45 minutes now and there are only a couple
> of established connections:
> [root at mvp14 16328]# netstat -ant | grep -w .*:80 | grep EST
> tcp 254 0 204.92.101.119:80 192.168.105.32:37554
> ESTABLISHED
> tcp 532 0 204.92.101.119:80 192.168.100.153:57722
> ESTABLISHED
> tcp 0 0 192.168.100.56:38818 204.92.101.124:80
> ESTABLISHED
>
> There are however quite a number of connections in CLOSE_WAIT:
>
> [root at mvp14 16328]# netstat -ant | grep -w .*:80 | grep CLOSE_WAIT | wc -l
> 1118
>
> Here is my varnishd version:
>
> [root at mvp14 16328]# varnishd -V
> varnishd (varnish-2.1.5 SVN )
> Copyright (c) 2006-2009 Linpro AS / Verdens Gang ASrsion:
>
> Here are the system limits for varnishd:
>
> [root at mvp14 16328]# cat /proc/`pgrep varnishd|tail -n1`/limits
> Limit Soft Limit Hard Limit Units
> Max cpu time unlimited unlimited seconds
> Max file size unlimited unlimited bytes
> Max data size unlimited unlimited bytes
> Max stack size 10485760 unlimited bytes
> Max core file size 0 unlimited bytes
> Max resident set unlimited unlimited bytes
> Max processes 397312 397312
> processes
> Max open files 131072 131072 files
> Max locked memory 32768 32768 bytes
> Max address space unlimited unlimited bytes
> Max file locks unlimited unlimited locks
> Max pending signals 397312 397312 signals
> Max msgqueue size 819200 819200 bytes
> Max nice priority 0 0
> Max realtime priority 0 0
>
> Here is some memory info on the server:
>
> [root at mvp14 16328]# free -m
> total used free shared buffers cached
> Mem: 48299 48170 129 0 92 42815
> -/+ buffers/cache: 5262 43037
> Swap: 15147 0 15147
>
> Here is my varnishd command line:
>
> /usr/local/sbin/varnishd -s file,/tmp/varnish-cache,60G -T 127.0.0.1:2000-a
> 0.0.0.0:80 -t 604800 -f /usr/local/etc/varnish/default.vcl -p http_headers
> 384 -p connect_timeout 4.0
>
> Here is the output from 'varnishstat -1':
>
> client_conn 32772547 265.09 Client connections accepted
> client_drop 13103 0.11 Connection dropped, no sess/wrk
> client_req 32531681 263.14 Client requests received
> cache_hit 27525134 222.64 Cache hits
> cache_hitpass 0 0.00 Cache hits for pass
> cache_miss 5005404 40.49 Cache misses
> backend_conn 4954451 40.07 Backend conn. success
> backend_unhealthy 0 0.00 Backend conn. not attempted
> backend_busy 0 0.00 Backend conn. too many
> backend_fail 853 0.01 Backend conn. failures
> backend_reuse 51728 0.42 Backend conn. reuses
> backend_toolate 13 0.00 Backend conn. was closed
> backend_recycle 51742 0.42 Backend conn. recycles
> backend_unused 0 0.00 Backend conn. unused
> fetch_head 5 0.00 Fetch head
> fetch_length 81316 0.66 Fetch with Length
> fetch_chunked 4924086 39.83 Fetch chunked
> fetch_eof 0 0.00 Fetch EOF
> fetch_bad 0 0.00 Fetch had bad headers
> fetch_close 186 0.00 Fetch wanted close
> fetch_oldhttp 0 0.00 Fetch pre HTTP/1.1 closed
> fetch_zero 0 0.00 Fetch zero len
> fetch_failed 0 0.00 Fetch failed
> n_sess_mem 1268 . N struct sess_mem
> n_sess 1174 . N struct sess
> n_object 4922732 . N struct object
> n_vampireobject 0 . N unresurrected objects
> n_objectcore 4923144 . N struct objectcore
> n_objecthead 4642001 . N struct objecthead
> n_smf 9639146 . N struct smf
> n_smf_frag 394705 . N small free smf
> n_smf_large 0 . N large free smf
> n_vbe_conn 501 . N struct vbe_conn
> n_wrk 500 . N worker threads
> n_wrk_create 3622 0.03 N worker threads created
> n_wrk_failed 0 0.00 N worker threads not created
> n_wrk_max 4079 0.03 N worker threads limited
> n_wrk_queue 502 0.00 N queued work requests
> n_wrk_overflow 65305 0.53 N overflowed work requests
> n_wrk_drop 13102 0.11 N dropped work requests
> n_backend 2 . N backends
> n_expired 1347 . N expired objects
> n_lru_nuked 381454 . N LRU nuked objects
> n_lru_saved 0 . N LRU saved objects
> n_lru_moved 23327252 . N LRU moved objects
> n_deathrow 0 . N objects on deathrow
> losthdr 0 0.00 HTTP header overflows
> n_objsendfile 0 0.00 Objects sent with sendfile
> n_objwrite 31912510 258.13 Objects sent with write
> n_objoverflow 0 0.00 Objects overflowing workspace
> s_sess 32758443 264.97 Total Sessions
> s_req 32531681 263.14 Total Requests
> s_pipe 0 0.00 Total pipe
> s_pass 1134 0.01 Total pass
> s_fetch 5005593 40.49 Total fetch
> s_hdrbytes 10659824012 86223.60 Total header bytes
> s_bodybytes 129812627152 1050009.12 Total body bytes
> sess_closed 29276120 236.80 Session Closed
> sess_pipeline 17 0.00 Session Pipeline
> sess_readahead 32 0.00 Session Read Ahead
> sess_linger 3510104 28.39 Session Linger
> sess_herd 3554241 28.75 Session herd
> shm_records 1725429324 13956.40 SHM records
> shm_writes 144491896 1168.74 SHM writes
> shm_flushes 750 0.01 SHM flushes due to overflow
> shm_cont 494654 4.00 SHM MTX contention
> shm_cycles 794 0.01 SHM cycles through buffer
> sm_nreq 10391973 84.06 allocator requests
> sm_nobj 9244441 . outstanding allocations
> sm_balloc 41184530432 . bytes allocated
> sm_bfree 23239979008 . bytes free
> sma_nreq 0 0.00 SMA allocator requests
> sma_nobj 0 . SMA outstanding allocations
> sma_nbytes 0 . SMA outstanding bytes
> sma_balloc 0 . SMA bytes allocated
> sma_bfree 0 . SMA bytes free
> sms_nreq 945 0.01 SMS allocator requests
> sms_nobj 0 . SMS outstanding allocations
> sms_nbytes 0 . SMS outstanding bytes
> sms_balloc 395010 . SMS bytes allocated
> sms_bfree 395010 . SMS bytes freed
> backend_req 5006185 40.49 Backend requests made
> n_vcl 1 0.00 N vcl total
> n_vcl_avail 1 0.00 N vcl available
> n_vcl_discard 0 0.00 N vcl discarded
> n_purge 1 . N total active purges
> n_purge_add 1 0.00 N new purges added
> n_purge_retire 0 0.00 N old purges deleted
> n_purge_obj_test 0 0.00 N objects tested
> n_purge_re_test 0 0.00 N regexps tested against
> n_purge_dups 0 0.00 N duplicate purges removed
> hcb_nolock 32531033 263.13 HCB Lookups without lock
> hcb_lock 5005369 40.49 HCB Lookups with lock
> hcb_insert 5005363 40.49 HCB Inserts
> esi_parse 0 0.00 Objects ESI parsed (unlock)
> esi_errors 0 0.00 ESI parse errors (unlock)
> accept_fail 0 0.00 Accept failures
> client_drop_late 1 0.00 Connection dropped late
> uptime 123630 1.00 Client uptime
> backend_retry 0 0.00 Backend conn. retry
> dir_dns_lookups 0 0.00 DNS director lookups
> dir_dns_failed 0 0.00 DNS director failed lookups
> dir_dns_hit 0 0.00 DNS director cached lookups hit
> dir_dns_cache_full 0 0.00 DNS director full dnscache
> fetch_1xx 0 0.00 Fetch no body (1xx)
> fetch_204 0 0.00 Fetch no body (204)
> fetch_304 0 0.00 Fetch no body (304)
>
> Here is some output from varnishlog (Is this normal?):
> 1442 TTL - 216036249 RFC 604800 1316705246 0 0 0 0
> 1442 VCL_call - fetch
> 1442 VCL_return - deliver
> 1442 ObjProtocol - HTTP/1.1
> 1442 ObjStatus - 200
> 1442 ObjResponse - OK
> 1442 ObjHeader - Date: Thu, 22 Sep 2011 15:27:25 GMT
> 1442 ObjHeader - Server: Apache/1.3.41 (Unix) mod_perl/1.30
> 1442 ObjHeader - x-airg-hasbinary: 1
> 1442 ObjHeader - x-airg-return-contentType: image%2Fjpeg
> 1442 ObjHeader - x-airg-interfacestatus: 200
> 1442 ObjHeader - Content-Type: image/jpeg
> 0 ExpKill - 184966508 LRU
> 0 ExpKill - 184967370 LRU
> 0 ExpKill - 184969553 LRU
> 0 ExpKill - 184970764 LRU
> 0 ExpKill - 184971732 LRU
> 0 ExpKill - 184976538 LRU
> 0 ExpKill - 184977972 LRU
> 0 ExpKill - 184988825 LRU
> 0 ExpKill - 184917719 LRU
> 0 ExpKill - 184997163 LRU
> 0 ExpKill - 184940621 LRU
> 0 ExpKill - 185000270 LRU
> 0 ExpKill - 185001314 LRU
> 0 ExpKill - 185003793 LRU
> 0 ExpKill - 185004913 LRU
> 0 ExpKill - 183651304 LRU
> 0 ExpKill - 185010145 LRU
> 0 ExpKill - 185012162 LRU
>
>
> I also noticed in MRTG that when this happens there is a sudden spike in
> lru nuked activity. It looks like it went from 0 nukes/sec to ~ 200.
>
> Do I have something misconfigured? Is varnish running into some kind of
> resource limitation (memory, file descriptors) which is causing it to hang?
> Did I set the cache size to large compared to the amount of physcial RAM I
> have? I am running the same version of varnish on the exact same server and
> this has not happened. The only difference is that I am using '-s
> file,/tmp/varnish-cache,48G ' instead of '-s file,/tmp/varnish-cache,60G'.
>
> Any help here would be most appreciated. This is in production right now.
>
>
>
> Matt Schurenko
> Systems Administrator
>
> airG share your world
> Suite 710, 1133 Melville Street
> Vancouver, BC V6E 4E5
> P: +1.604.408.2228
> F: +1.866.874.8136
> E: MSchurenko at airg.com
> W: www.airg.com
> _______________________________________________
> varnish-misc mailing list
> varnish-misc at varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20110922/f22c8e2e/attachment-0003.html>
More information about the varnish-misc
mailing list