varnish stops responding and threads do not decrease

Matt Schurenko MSchurenko at airg.com
Thu Sep 22 19:05:26 CEST 2011


Thanks Damon. I'll put the cache size down. On the other varnish server cache size is equal to physical ram and we haven't had this problem. I have read Kristian's blog and was using malloc before. It was on a server with 32 GB of ram. I set malloc to 28 GB but for some reason it ended up using ~ 15 GB of swap space. From what I understand varnish performs better using mmap then swap so that's why I changed to the file storage type. I'll also try adjusting the thread configuration on all our varnish nodes.

Sent from my iPhone

On 2011-09-22, at 9:47 AM, "Damon Snyder" <damon at huddler-inc.com<mailto:damon at huddler-inc.com>> wrote:

Hi Matt,
We had some instability when running varnish using a file backed cache that was larger than RAM. This was a year or more ago, and I don't recall the exact details of the issue. We resolved it by reducing the size of the file cache to be smaller than memory and the problems went away.

It looks like you can also increase your thread pool size. We are using something like -w 100,1600,120 at startup and have a similar connection rate. I would also suggest moving to malloc storage type (we are migrating from file to malloc). Kristian is quoted as saying:

If you can not contain your cache in memory, first consider if you really need that big of a cache. Then consider buying more memory. Then sleep on it.

<http://kristianlyng.wordpress.com/2010/01/26/varnish-best-practices/>http://kristianlyng.wordpress.com/2010/01/26/varnish-best-practices/

If you do switch to malloc on this box, try something like -smalloc,40G. You don't want to allocate up to the amount of RAM as varnish needs some additional space for other data and you need to make sure your OS and other processes has some margin as well.

For my part, I wish there was an interface into the size of the objects being cached so that you can better estimate the size of the cache you need. I understand varnishsizes can do something like but it doesn't tell you what is currently stored in aggregate or allow you to traverse the objects and their sizes.

In any case, you can monitor your hit rate and see what impact dropping the extra 12 GB has. I suppose you could argue that your hit rate is moot if you are not responding to connections so just getting rid of the no-response issue should be net positive.

Hope this helps,
Damon

On Thu, Sep 22, 2011 at 9:14 AM, Matt Schurenko <<mailto:MSchurenko at airg.com>MSchurenko at airg.com<mailto:MSchurenko at airg.com>> wrote:
I posted yesterday regarding this issue. It has happened again; however this time I have not restarted the problem varnish node. I am using the default configuration with regards to threads. It seems that once thread_pool_max hits 500 that the server stops responding to any requests. I stopped all connections from the load balancer to the server; however the threads do not decrease. They remain stuck at ~ 509:

 [root at mvp14 16328]# grep Threads /proc/`pgrep varnishd|tail -n1`/status
Threads:        509

The server has been idle for ~ 45 minutes now and there are only a couple of established connections:
[root at mvp14 16328]# netstat -ant | grep -w .*:80 | grep EST
tcp      254      0 204.92.101.119:80<http://204.92.101.119:80>           192.168.105.32:37554<http://192.168.105.32:37554>        ESTABLISHED
tcp      532      0 204.92.101.119:80<http://204.92.101.119:80>           192.168.100.153:57722<http://192.168.100.153:57722>       ESTABLISHED
tcp        0      0 192.168.100.56:38818<http://192.168.100.56:38818>        204.92.101.124:80<http://204.92.101.124:80>           ESTABLISHED

There are however quite a number of connections in CLOSE_WAIT:

[root at mvp14 16328]# netstat -ant | grep -w .*:80 | grep CLOSE_WAIT | wc -l
1118

Here is my varnishd version:

[root at mvp14 16328]# varnishd -V
varnishd (varnish-2.1.5 SVN )
Copyright (c) 2006-2009 Linpro AS / Verdens Gang ASrsion:

Here are the system limits for varnishd:

[root at mvp14 16328]# cat /proc/`pgrep varnishd|tail -n1`/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            10485760             unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             397312               397312               processes
Max open files            131072               131072               files
Max locked memory         32768                32768                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       397312               397312               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0

Here is some memory info on the server:

[root at mvp14 16328]# free -m
            total       used       free     shared    buffers     cached
Mem:         48299      48170        129          0         92      42815
-/+ buffers/cache:       5262      43037
Swap:        15147          0      15147

Here is my varnishd command line:

/usr/local/sbin/varnishd -s file,/tmp/varnish-cache,60G -T 127.0.0.1:2000<http://127.0.0.1:2000> -a 0.0.0.0:80<http://0.0.0.0:80> -t 604800 -f /usr/local/etc/varnish/default.vcl -p http_headers 384 -p connect_timeout 4.0

Here is the output from 'varnishstat -1':

client_conn          32772547       265.09 Client connections accepted
client_drop             13103         0.11 Connection dropped, no sess/wrk
client_req           32531681       263.14 Client requests received
cache_hit            27525134       222.64 Cache hits
cache_hitpass               0         0.00 Cache hits for pass
cache_miss            5005404        40.49 Cache misses
backend_conn          4954451        40.07 Backend conn. success
backend_unhealthy            0         0.00 Backend conn. not attempted
backend_busy                0         0.00 Backend conn. too many
backend_fail              853         0.01 Backend conn. failures
backend_reuse           51728         0.42 Backend conn. reuses
backend_toolate            13         0.00 Backend conn. was closed
backend_recycle         51742         0.42 Backend conn. recycles
backend_unused              0         0.00 Backend conn. unused
fetch_head                  5         0.00 Fetch head
fetch_length            81316         0.66 Fetch with Length
fetch_chunked         4924086        39.83 Fetch chunked
fetch_eof                   0         0.00 Fetch EOF
fetch_bad                   0         0.00 Fetch had bad headers
fetch_close               186         0.00 Fetch wanted close
fetch_oldhttp               0         0.00 Fetch pre HTTP/1.1 closed
fetch_zero                  0         0.00 Fetch zero len
fetch_failed                0         0.00 Fetch failed
n_sess_mem               1268          .   N struct sess_mem
n_sess                   1174          .   N struct sess
n_object              4922732          .   N struct object
n_vampireobject             0          .   N unresurrected objects
n_objectcore          4923144          .   N struct objectcore
n_objecthead          4642001          .   N struct objecthead
n_smf                 9639146          .   N struct smf
n_smf_frag             394705          .   N small free smf
n_smf_large                 0          .   N large free smf
n_vbe_conn                501          .   N struct vbe_conn
n_wrk                     500          .   N worker threads
n_wrk_create             3622         0.03 N worker threads created
n_wrk_failed                0         0.00 N worker threads not created
n_wrk_max                4079         0.03 N worker threads limited
n_wrk_queue               502         0.00 N queued work requests
n_wrk_overflow          65305         0.53 N overflowed work requests
n_wrk_drop              13102         0.11 N dropped work requests
n_backend                   2          .   N backends
n_expired                1347          .   N expired objects
n_lru_nuked            381454          .   N LRU nuked objects
n_lru_saved                 0          .   N LRU saved objects
n_lru_moved          23327252          .   N LRU moved objects
n_deathrow                  0          .   N objects on deathrow
losthdr                     0         0.00 HTTP header overflows
n_objsendfile               0         0.00 Objects sent with sendfile
n_objwrite           31912510       258.13 Objects sent with write
n_objoverflow               0         0.00 Objects overflowing workspace
s_sess               32758443       264.97 Total Sessions
s_req                32531681       263.14 Total Requests
s_pipe                      0         0.00 Total pipe
s_pass                   1134         0.01 Total pass
s_fetch               5005593        40.49 Total fetch
s_hdrbytes        10659824012     86223.60 Total header bytes
s_bodybytes      129812627152   1050009.12 Total body bytes
sess_closed          29276120       236.80 Session Closed
sess_pipeline              17         0.00 Session Pipeline
sess_readahead             32         0.00 Session Read Ahead
sess_linger           3510104        28.39 Session Linger
sess_herd             3554241        28.75 Session herd
shm_records        1725429324     13956.40 SHM records
shm_writes          144491896      1168.74 SHM writes
shm_flushes               750         0.01 SHM flushes due to overflow
shm_cont               494654         4.00 SHM MTX contention
shm_cycles                794         0.01 SHM cycles through buffer
sm_nreq              10391973        84.06 allocator requests
sm_nobj               9244441          .   outstanding allocations
sm_balloc         41184530432          .   bytes allocated
sm_bfree          23239979008          .   bytes free
sma_nreq                    0         0.00 SMA allocator requests
sma_nobj                    0          .   SMA outstanding allocations
sma_nbytes                  0          .   SMA outstanding bytes
sma_balloc                  0          .   SMA bytes allocated
sma_bfree                   0          .   SMA bytes free
sms_nreq                  945         0.01 SMS allocator requests
sms_nobj                    0          .   SMS outstanding allocations
sms_nbytes                  0          .   SMS outstanding bytes
sms_balloc             395010          .   SMS bytes allocated
sms_bfree              395010          .   SMS bytes freed
backend_req           5006185        40.49 Backend requests made
n_vcl                       1         0.00 N vcl total
n_vcl_avail                 1         0.00 N vcl available
n_vcl_discard               0         0.00 N vcl discarded
n_purge                     1          .   N total active purges
n_purge_add                 1         0.00 N new purges added
n_purge_retire              0         0.00 N old purges deleted
n_purge_obj_test            0         0.00 N objects tested
n_purge_re_test             0         0.00 N regexps tested against
n_purge_dups                0         0.00 N duplicate purges removed
hcb_nolock           32531033       263.13 HCB Lookups without lock
hcb_lock              5005369        40.49 HCB Lookups with lock
hcb_insert            5005363        40.49 HCB Inserts
esi_parse                   0         0.00 Objects ESI parsed (unlock)
esi_errors                  0         0.00 ESI parse errors (unlock)
accept_fail                 0         0.00 Accept failures
client_drop_late            1         0.00 Connection dropped late
uptime                 123630         1.00 Client uptime
backend_retry               0         0.00 Backend conn. retry
dir_dns_lookups             0         0.00 DNS director lookups
dir_dns_failed              0         0.00 DNS director failed lookups
dir_dns_hit                 0         0.00 DNS director cached lookups hit
dir_dns_cache_full            0         0.00 DNS director full dnscache
fetch_1xx                   0         0.00 Fetch no body (1xx)
fetch_204                   0         0.00 Fetch no body (204)
fetch_304                   0         0.00 Fetch no body (304)

Here is some output from varnishlog (Is this normal?):
1442 TTL          - 216036249 RFC 604800 1316705246 0 0 0 0
 1442 VCL_call     - fetch
 1442 VCL_return   - deliver
 1442 ObjProtocol  - HTTP/1.1
 1442 ObjStatus    - 200
 1442 ObjResponse  - OK
 1442 ObjHeader    - Date: Thu, 22 Sep 2011 15:27:25 GMT
 1442 ObjHeader    - Server: Apache/1.3.41 (Unix) mod_perl/1.30
 1442 ObjHeader    - x-airg-hasbinary: 1
 1442 ObjHeader    - x-airg-return-contentType: image%2Fjpeg
 1442 ObjHeader    - x-airg-interfacestatus: 200
 1442 ObjHeader    - Content-Type: image/jpeg
   0 ExpKill      - 184966508 LRU
   0 ExpKill      - 184967370 LRU
   0 ExpKill      - 184969553 LRU
   0 ExpKill      - 184970764 LRU
   0 ExpKill      - 184971732 LRU
   0 ExpKill      - 184976538 LRU
   0 ExpKill      - 184977972 LRU
   0 ExpKill      - 184988825 LRU
   0 ExpKill      - 184917719 LRU
   0 ExpKill      - 184997163 LRU
   0 ExpKill      - 184940621 LRU
   0 ExpKill      - 185000270 LRU
   0 ExpKill      - 185001314 LRU
   0 ExpKill      - 185003793 LRU
   0 ExpKill      - 185004913 LRU
   0 ExpKill      - 183651304 LRU
   0 ExpKill      - 185010145 LRU
   0 ExpKill      - 185012162 LRU


I also noticed in MRTG that when this happens there is a sudden spike in lru nuked activity. It looks like it went from 0 nukes/sec to ~ 200.

Do I have something misconfigured? Is varnish running into some kind of resource limitation (memory, file descriptors) which is causing it to hang? Did I set the cache size to large compared to the amount of physcial RAM I have? I am running the same version of varnish on the exact same server and this has not happened. The only difference is that I am using '-s file,/tmp/varnish-cache,48G ' instead of '-s file,/tmp/varnish-cache,60G'.

Any help here would be most appreciated. This is in production right now.



Matt Schurenko
Systems Administrator

airG share your world
Suite 710, 1133 Melville Street
Vancouver, BC  V6E 4E5
P: +1.604.408.2228
F: +1.866.874.8136
E: <mailto:MSchurenko at airg.com> MSchurenko at airg.com<mailto:MSchurenko at airg.com>
W: <http://www.airg.com> www.airg.com<http://www.airg.com>
_______________________________________________
varnish-misc mailing list
<mailto:varnish-misc at varnish-cache.org>varnish-misc at varnish-cache.org<mailto:varnish-misc at varnish-cache.org>
<https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc>https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc



More information about the varnish-misc mailing list