lru nuke causes varnish to stop respnding to client requests

Fri Sep 30 18:55:45 CEST 2011

Hi Matt,
Oh, sorry, I didn't notice the worker thread count at 3000. I would suggest
switching to -smalloc,38G (total mem * 0.8). If you have a lot of small
objects or objects of the same size, you could be encountering some
excessive nuking.

Also, what are your backends doing when this happens? Is the nuking a
coincidence or is there an issue further down the stack?

Damon

On Thu, Sep 29, 2011 at 5:42 PM, Matt Schurenko <MSchurenko at airg.com> wrote:

> Sorry. I forgot to mention that I already adjusted thread behaviour via
> varniahadm while varnish was running. I had it set to use min 50, max 3000
> and a thread timeout of 120s. I think the reason why n_wrk_overflow and
> n_wrk_drop are so high is due to this problem. Before the lru nuke happens
> the number of worker threads are ~ 100. As soon as it starts to nuke the
> number of threads jumps to the max. I am monitoring some stats with MRTG. I
> seem to remember that on the other varnish server it would being to lru nuke
> long before the cache got full. For this one there is no lru nuke activity
> until it reaches a certain point and then boom. 3000 threads are used up and
> no more new clients can connect. ****
>
> ** **
>
>
> Matt Schurenko
> Systems Administrator
>   ****
>
> *air**G**®** **Share Your World*
> Suite 710, 1133 Melville Street****
>
> Vancouver, BC  V6E 4E5****
>
> P: +1.604.408.2228 ****
>
> F: +1.866.874.8136****
>
> E: MSchurenko at airg.com****
>
> W: www.airg.com****
>
> *From:* Damon Snyder [mailto:damon at huddler-inc.com]
> *Sent:* September-29-11 5:30 PM
> *To:* Matt Schurenko
> *Cc:* varnish-misc at varnish-cache.org
> *Subject:* Re: lru nuke causes varnish to stop respnding to client
> requests****
>
> ** **
>
> Hi Matt,****
>
> It looks like you really need to bump up the number of worker threads. From
> your stats:****
>
> ** **
>
> n_wrk_queue              2861         0.02 N queued work requests****
>
> n_wrk_overflow          83534         0.52 N overflowed work requests****
>
> n_wrk_drop              10980         0.07 N dropped work requests****
>
> ** **
>
> You have a lot of requests that are on the queue waiting for a worker and
> you have a lot of requests where varnish has given up trying to fullfil with
> a worker. You can bump the number of works up using the -w command line
> option to varnishd. I would suggest something like -w 400,1000,120 to start
> with (the default is -w2,500,300). This says use 400 at a minimum, 1000 as
> the maximum, and set the timeout to 120s. According to the stats
> explanation doc <https://www.varnish-cache.org/trac/wiki/StatsExplained> your
> n_wrk_queue and n_wrk_drop should be 0. If you see these numbers going up
> again, use -w 500,2000,120 or something like that.****
>
> ** **
>
> Hope this helps,****
>
> Damon****
>
> ** **
>
> On Thu, Sep 29, 2011 at 4:34 PM, Matt Schurenko <MSchurenko at airg.com>
> wrote:****
>
> I’ve been having this problem for a couple weeks now on one of our varnish
> servers. I have posted a couple times already. What happens is that the
> server in questions runs fine until the cache gets full. When it starts to
> lru nuke the number of worker threads jumps up to thread_pool_max and
> varnish stops responding to any client requests. I have tried this with
> Centos5.4, 5.7 and now Slackware (all 64-bit ) and the behaviour is the
> same. ****
>
>  ****
>
> I am using varnish version 2.1.5 on a Dell C6105 with 48GB of RAM.****
>
>  ****
>
> Here is my varnishd command line:****
>
>  ****
>
> /usr/local/sbin/varnishd -s file,/tmp/varnish-cache,48g -T 127.0.0.1:2000-a
> 0.0.0.0:80 -t 604800 -f /usr/local/etc/varnish/default.vcl -p http_headers
> 384 -p connect_timeout 4.0****
>
>  ****
>
> Here is the output from varnishstat -1:****
>
>  ****
>
> client_conn          38582763       240.38 Client connections accepted****
>
> client_drop             10950         0.07 Connection dropped, no sess/wrk
> ****
>
> client_req           38298994       238.61 Client requests received****
>
> cache_hit            32513762       202.57 Cache hits****
>
> cache_hitpass               0         0.00 Cache hits for pass****
>
> cache_miss            5784476        36.04 Cache misses****
>
> backend_conn          5725540        35.67 Backend conn. success****
>
> backend_unhealthy            0         0.00 Backend conn. not attempted***
> *
>
> backend_busy                0         0.00 Backend conn. too many****
>
> backend_fail             1383         0.01 Backend conn. failures****
>
> backend_reuse           60837         0.38 Backend conn. reuses****
>
> backend_toolate            33         0.00 Backend conn. was closed****
>
> backend_recycle         60870         0.38 Backend conn. recycles****
>
> backend_unused              0         0.00 Backend conn. unused****
>
> fetch_head                  6         0.00 Fetch head****
>
> fetch_length            93631         0.58 Fetch with Length****
>
> fetch_chunked         5689433        35.45 Fetch chunked****
>
> fetch_eof                   0         0.00 Fetch EOF****
>
> fetch_bad                   0         0.00 Fetch had bad headers****
>
> fetch_close               107         0.00 Fetch wanted close****
>
> fetch_oldhttp               0         0.00 Fetch pre HTTP/1.1 closed****
>
> fetch_zero                  0         0.00 Fetch zero len****
>
> fetch_failed                1         0.00 Fetch failed****
>
> n_sess_mem               7138          .   N struct sess_mem****
>
> n_sess                   6970          .   N struct sess****
>
> n_object              5047123          .   N struct object****
>
> n_vampireobject             0          .   N unresurrected objects****
>
> n_objectcore          5048435          .   N struct objectcore****
>
> n_objecthead          4955641          .   N struct objecthead****
>
> n_smf                10139770          .   N struct smf****
>
> n_smf_frag             295671          .   N small free smf****
>
> n_smf_large                 0          .   N large free smf****
>
> n_vbe_conn               2997          .   N struct vbe_conn****
>
> n_wrk                    3000          .   N worker threads****
>
> n_wrk_create             5739         0.04 N worker threads created****
>
> n_wrk_failed                0         0.00 N worker threads not created***
> *
>
> n_wrk_max                4063         0.03 N worker threads limited****
>
> n_wrk_queue              2861         0.02 N queued work requests****
>
> n_wrk_overflow          83534         0.52 N overflowed work requests****
>
> n_wrk_drop              10980         0.07 N dropped work requests****
>
> n_backend                   2          .   N backends****
>
> n_expired                2179          .   N expired objects****
>
> n_lru_nuked            862615          .   N LRU nuked objects****
>
> n_lru_saved                 0          .   N LRU saved objects****
>
> n_lru_moved          27156180          .   N LRU moved objects****
>
> n_deathrow                  0          .   N objects on deathrow****
>
> losthdr                     0         0.00 HTTP header overflows****
>
> n_objsendfile               0         0.00 Objects sent with sendfile****
>
> n_objwrite           37294888       232.35 Objects sent with write****
>
> n_objoverflow               0         0.00 Objects overflowing workspace**
> **
>
> s_sess               38566049       240.27 Total Sessions****
>
> s_req                38298994       238.61 Total Requests****
>
> s_pipe                      0         0.00 Total pipe****
>
> s_pass                    266         0.00 Total pass****
>
> s_fetch               5783176        36.03 Total fetch****
>
> s_hdrbytes        12570989864     78319.53 Total header bytes****
>
> s_bodybytes      151327304604    942796.38 Total body bytes****
>
> sess_closed          34673984       216.03 Session Closed****
>
> sess_pipeline             187         0.00 Session Pipeline****
>
> sess_readahead            321         0.00 Session Read Ahead****
>
> sess_linger           3929378        24.48 Session Linger****
>
> sess_herd             3929559        24.48 Session herd****
>
> shm_records        2025645664     12620.14 SHM records****
>
> shm_writes          169640580      1056.89 SHM writes****
>
> shm_flushes                41         0.00 SHM flushes due to overflow****
>
> shm_cont               580515         3.62 SHM MTX contention****
>
> shm_cycles                933         0.01 SHM cycles through buffer****
>
> sm_nreq              12431620        77.45 allocator requests****
>
> sm_nobj               9844099          .   outstanding allocations****
>
> sm_balloc         43855261696          .   bytes allocated****
>
> sm_bfree           7684345856          .   bytes free****
>
> sma_nreq                    0         0.00 SMA allocator requests****
>
> sma_nobj                    0          .   SMA outstanding allocations****
>
> sma_nbytes                  0          .   SMA outstanding bytes****
>
> sma_balloc                  0          .   SMA bytes allocated****
>
> sma_bfree                   0          .   SMA bytes free****
>
> sms_nreq                 1566         0.01 SMS allocator requests****
>
> sms_nobj                    0          .   SMS outstanding allocations****
>
> sms_nbytes                  0          .   SMS outstanding bytes****
>
> sms_balloc             656154          .   SMS bytes allocated****
>
> sms_bfree              656154          .   SMS bytes freed****
>
> backend_req           5786381        36.05 Backend requests made****
>
> n_vcl                       1         0.00 N vcl total****
>
> n_vcl_avail                 1         0.00 N vcl available****
>
> n_vcl_discard               0         0.00 N vcl discarded****
>
> n_purge                   218          .   N total active purges****
>
> n_purge_add               218         0.00 N new purges added****
>
> n_purge_retire              0         0.00 N old purges deleted****
>
> n_purge_obj_test       588742         3.67 N objects tested****
>
> n_purge_re_test     120444323       750.39 N regexps tested against****
>
> n_purge_dups                0         0.00 N duplicate purges removed****
>
> hcb_nolock           38301670       238.63 HCB Lookups without lock****
>
> hcb_lock              5786309        36.05 HCB Lookups with lock****
>
> hcb_insert            5786305        36.05 HCB Inserts****
>
> esi_parse                   0         0.00 Objects ESI parsed (unlock)****
>
> esi_errors                  0         0.00 ESI parse errors (unlock)****
>
> accept_fail                 0         0.00 Accept failures****
>
> client_drop_late           30         0.00 Connection dropped late****
>
> uptime                 160509         1.00 Client uptime****
>
> backend_retry              25         0.00 Backend conn. retry****
>
> dir_dns_lookups             0         0.00 DNS director lookups****
>
> dir_dns_failed              0         0.00 DNS director failed lookups****
>
> dir_dns_hit                 0         0.00 DNS director cached lookups hit
> ****
>
> dir_dns_cache_full            0         0.00 DNS director full dnscache***
> *
>
> fetch_1xx                   0         0.00 Fetch no body (1xx)****
>
> fetch_204                   0         0.00 Fetch no body (204)****
>
> fetch_304                   0         0.00 Fetch no body (304)****
>
>  ****
>
> Even though I have removed the server from our load balancer there are
> still a lot of requests going to the backend. Maybe these are all queued up
> requests that varnish is trying to fulfill? Here is some output from
> varnishlog –c when I try to connect with curl:****
>
>  ****
>
> root at mvp14:~# varnishlog -c****
>
>    26 SessionOpen  c 192.168.8.41 41942 0.0.0.0:80****
>
>    26 ReqStart     c 192.168.8.41 41942 2108342803****
>
>    26 RxRequest    c GET****
>
>    26 RxURL        c /****
>
>    26 RxProtocol   c HTTP/1.1****
>
>    26 RxHeader     c User-Agent: curl/7.21.4 (x86_64-unknown-linux-gnu)
> libcurl/7.21.4 OpenSSL/0.9.8n zlib/1.2.5 libidn/1.19****
>
>    26 RxHeader     c Host: mvp14.airg.com****
>
>    26 RxHeader     c Accept: */*****
>
>    26 VCL_call     c recv****
>
>    26 VCL_return   c lookup****
>
>    26 VCL_call     c hash****
>
>    26 VCL_return   c hash****
>
>  ****
>
>  ****
>
> The connection just hangs here until it times out.****
>
>  ****
>
> Any help would be appreciated. We are trying to replace our squid caching
> layer with varnish. However if I can’t resolve this issue we will have to go
> back to squid.****
>
>  ****
>
> Thanks,****
>
>  ****
>
>  ****
>
>  ****
>
> *Matt Schurenko*
> Systems Administrator ****
>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc at varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc****
>
> ** **
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20110930/41112534/attachment-0003.html>