Poll() on shared pipe() high context switching issue affect varnish? :: was Feedback on session_linger (high load/n_wrk_overflow issues)

Thu Feb 14 19:37:14 CET 2013

My apologies for spamming the mailing list. I hope I save someone some time
by providing this monolog.

I'm not sure exactly what is causing the high load and context switching.
Adding s-maxage=0 to the Cache-Control header on multiple esi tags in 100s
or 1000s of high traffic pages that are cached appears to introduce it.

The apparent fix for this issue is to (pass) in vcl_fetch when
Cache-Control contains s-maxage=0. Alternatively, you can match the
specific URLs of the esi includes that you need to bypass. We have been
doing the latter for a few days in production without issue. I'm testing
the former now.

Adding this eliminated the persistently high 100k context switches per
second and reduced the csw/s down to a steady state of less than 4k csw/s.
The load is also a lot more stable. n_wrk_overflow is also now stable and
very low. Previously, when the csw/s would exceed ~100k we would start
seeing n_wrk_overflow increment and timeouts.

As part of this process, I created a simple varnish testbed to experiment
with different varnish and VCL configurations to try and drive up the csw
in a controlled environment and test the changes that might bring it back
down. It's not fun (or profitable) trying to experiment with the varnish
configuration in production on a high traffic site.

The testbed is here https://github.com/drsnyder/varnish-testbed. I used
sinatra + unicorn for the backend and siege to test the traffic. For the
test, I generate a medium 10k+ page and insert three esi tags that generate
text. I cache the page but not the esi includes. I also introduce a random
sleep to try and simulate a "real" backend request. Other backends or
experiments should be easy to generate. YMMV.

Damon

On Thu, Feb 7, 2013 at 6:09 PM, Damon Snyder <damon at huddler-inc.com> wrote:

> I think the issue we have been seeing *may* be related to the polling on a
> shared pipe() issue that was first mentioned here:
> http://lkml.indiana.edu/hypermail/linux/kernel/1012.1/03515.html.
>
> I think I narrowed the cause of the high csw/s down to multiple esi
> includes (2 most of the time) with s-maxage=0 on a 1M+ pageview per month
> site (~60conn/s according to the first line of varnishstat). Most of the
> page views have these esi includes that aren't being cached (not all of the
> 60/s). If we set s-maxage to something > 0 (say 20s) in the headers for the
> esi module coming from the backend to varnish the context switching issue
> goes away-- the csw/s degrades linearly back to the steady state at about
> 10k/s.
>
> I created a threaded version of the example code from the lkml email (
> https://gist.github.com/drsnyder/4735889). When I compile and run this
> code on a 2.6.32-279 or below kernel the csw/s goes to ~200k/s which defies
> any attempt I have made to reason about why that would be the case. If I
> run the same code on a 3.0+ Linux kernel, I will see about 12k csw/s. Still
> high, but much more reasonable.
>
> We see the same csw/s behavior when we set the s-maxage=0 for the esi
> modules on the production site with about the same number of threads (400).
>
> So does this polling on a shared pipe() issue affect varnish 2.1.5? It
> would appear so, but I'm looking for confirmation. I saw a commit
> referencing context switching here (
> https://www.varnish-cache.org/trac/changeset/05f816345a329ef52644a0239892f8be6314a3fa)
> but I also can't imagine that this would trigger 200k/s in our environment.
>  If it is an issue, is there anything we can do to mitigate it in the
> varnish configuration? Are there versions where this has been fixed?
>
> We have tested sending production traffic with s-maxage=0 to an Ubuntu
> system running 3.2.0-27 and saw a spike in context switching but it peaked
> at about 20k/s with no observed impact on stability or delivery. So,
> switching from Centos to Ubuntu is a possible option.
>
> Thanks,
> Damon
>
>
> ---------- Forwarded message ----------
> From: Damon Snyder <damon at huddler-inc.com>
> Date: Thu, Jan 31, 2013 at 11:10 PM
> Subject: Re: Feedback on session_linger (high load/n_wrk_overflow issues)
> To: "varnish-misc at varnish-cache.org" <varnish-misc at varnish-cache.org>
>
>
> Well, I spoke too soon. Bumping session_linger brought us stability for
> about a day and then the high context switching returned. See the attached
> plot to illustrate the problem. The context switching (pulled from dstat
> and added to varnish) starts to ramp up. As soon it crosses the ~50k mark
> we start start seeing stability and latency issues (10k/s is more
> "normal"). So now I'm at a loss as to how to proceed.
>
> Below is the current varnishstat -1 output when the system is well
> behaved. I wasn't able to capture it when the context switching peaked.
> n_wrk_overflow spiked to about 1500 at the time and the load average over
> the past 5 min was ~15.
>
> This is running on a dual hex core Intel Xeon E5-2620 (24 contexts or
> cores) with 64GB of memory. The hitrate is about 0.65 and we are nuking
> (n_lru_nuked incrementing) once the system has been running for a few
> hours. We have long tail objects in that we have a lot of content, but only
> some of it is hot so it's difficult to predict our precise cache size
> requirements at any given time. We are using varnish-2.1.5 on Centos 5.6
> with kernel 2.6.18-238.
>
> I haven't found anything in syslog that would be of interest.
>
> Thanks,
> Damon
>
> # currently running command line
> /usr/local/varnish-2.1.5/sbin/varnishd -P /var/run/varnish.pid -a
> 10.16.50.150:6081,127.0.0.1:6081 -f /etc/varnish/default.vcl -T
> 127.0.0.1:6082 -t 120 -w 100,2000,120 -u varnish -g varnish -s malloc,48G
> -p thread_pools 8 -p thread_pool_add_delay 2 -p listen_depth 1024 -p
> session_linger 110 -p lru_interval 60 -p sess_workspace 524288
>
>
> # varnishstat -1
> client_conn           5021009       442.11 Client connections accepted
> client_drop                 0         0.00 Connection dropped, no sess/wrk
> client_req            5020366       442.05 Client requests received
> cache_hit             4597754       404.84 Cache hits
> cache_hitpass           24775         2.18 Cache hits for pass
> cache_miss            2664516       234.61 Cache misses
> backend_conn          3101882       273.13 Backend conn. success
> backend_unhealthy            0         0.00 Backend conn. not attempted
> backend_busy                0         0.00 Backend conn. too many
> backend_fail                0         0.00 Backend conn. failures
> backend_reuse              99         0.01 Backend conn. reuses
> backend_toolate         11777         1.04 Backend conn. was closed
> backend_recycle         11877         1.05 Backend conn. recycles
> backend_unused              0         0.00 Backend conn. unused
> fetch_head                 33         0.00 Fetch head
> fetch_length          2134117       187.91 Fetch with Length
> fetch_chunked          811172        71.42 Fetch chunked
> fetch_eof                   0         0.00 Fetch EOF
> fetch_bad                   0         0.00 Fetch had bad headers
> fetch_close            156333        13.77 Fetch wanted close
> fetch_oldhttp               0         0.00 Fetch pre HTTP/1.1 closed
> fetch_zero                  0         0.00 Fetch zero len
> fetch_failed                1         0.00 Fetch failed
> n_sess_mem               3867          .   N struct sess_mem
> n_sess                   3841          .   N struct sess
> n_object               757929          .   N struct object
> n_vampireobject             0          .   N unresurrected objects
> n_objectcore           758158          .   N struct objectcore
> n_objecthead           760076          .   N struct objecthead
> n_smf                       0          .   N struct smf
> n_smf_frag                  0          .   N small free smf
> n_smf_large                 0          .   N large free smf
> n_vbe_conn                 69          .   N struct vbe_conn
> n_wrk                     800          .   N worker threads
> n_wrk_create              800         0.07 N worker threads created
> n_wrk_failed                0         0.00 N worker threads not created
> n_wrk_max                   0         0.00 N worker threads limited
> n_wrk_queue                 0         0.00 N queued work requests
> n_wrk_overflow            171         0.02 N overflowed work requests
> n_wrk_drop                  0         0.00 N dropped work requests
> n_backend                  21          .   N backends
> n_expired             1879027          .   N expired objects
> n_lru_nuked                 0          .   N LRU nuked objects
> n_lru_saved                 0          .   N LRU saved objects
> n_lru_moved            813472          .   N LRU moved objects
> n_deathrow                  0          .   N objects on deathrow
> losthdr                     0         0.00 HTTP header overflows
> n_objsendfile               0         0.00 Objects sent with sendfile
> n_objwrite            5195402       457.46 Objects sent with write
> n_objoverflow               0         0.00 Objects overflowing workspace
> s_sess                5020819       442.09 Total Sessions
> s_req                 5020366       442.05 Total Requests
> s_pipe                      0         0.00 Total pipe
> s_pass                 437300        38.50 Total pass
> s_fetch               3101668       273.11 Total fetch
> s_hdrbytes         2575156242    226746.17 Total header bytes
> s_bodybytes      168406835436  14828461.34 Total body bytes
> sess_closed           5020817       442.09 Session Closed
> sess_pipeline               0         0.00 Session Pipeline
> sess_readahead              0         0.00 Session Read Ahead
> sess_linger                 0         0.00 Session Linger
> sess_herd                   0         0.00 Session herd
> shm_records         430740934     37927.35 SHM records
> shm_writes           29264523      2576.78 SHM writes
> shm_flushes             51411         4.53 SHM flushes due to overflow
> shm_cont                78056         6.87 SHM MTX contention
> shm_cycles                186         0.02 SHM cycles through buffer
> sm_nreq                     0         0.00 allocator requests
> sm_nobj                     0          .   outstanding allocations
> sm_balloc                   0          .   bytes allocated
> sm_bfree                    0          .   bytes free
> sma_nreq              5570153       490.46 SMA allocator requests
> sma_nobj              1810351          .   SMA outstanding allocations
> sma_nbytes        35306367656          .   SMA outstanding bytes
> sma_balloc       154583709612          .   SMA bytes allocated
> sma_bfree        119277341956          .   SMA bytes free
> sms_nreq                32127         2.83 SMS allocator requests
> sms_nobj                    0          .   SMS outstanding allocations
> sms_nbytes                  0          .   SMS outstanding bytes
> sms_balloc           29394423          .   SMS bytes allocated
> sms_bfree            29394423          .   SMS bytes freed
> backend_req           3101977       273.13 Backend requests made
>  n_vcl                       1         0.00 N vcl total
> n_vcl_avail                 1         0.00 N vcl available
> n_vcl_discard               0         0.00 N vcl discarded
> n_purge                 31972          .   N total active purges
> n_purge_add             31972         2.82 N new purges added
> n_purge_retire              0         0.00 N old purges deleted
> n_purge_obj_test      3317282       292.09 N objects tested
> n_purge_re_test     530623465     46722.15 N regexps tested against
> n_purge_dups            28047         2.47 N duplicate purges removed
> hcb_nolock            7285152       641.47 HCB Lookups without lock
> hcb_lock              2543066       223.92 HCB Lookups with lock
> hcb_insert            2542964       223.91 HCB Inserts
> esi_parse              741690        65.31 Objects ESI parsed (unlock)
> esi_errors                  0         0.00 ESI parse errors (unlock)
> accept_fail                 0         0.00 Accept failures
> client_drop_late            0         0.00 Connection dropped late
> uptime                  11357         1.00 Client uptime
> backend_retry              99         0.01 Backend conn. retry
> dir_dns_lookups             0         0.00 DNS director lookups
> dir_dns_failed              0         0.00 DNS director failed lookups
> dir_dns_hit                 0         0.00 DNS director cached lookups hit
> dir_dns_cache_full            0         0.00 DNS director full dnscache
> fetch_1xx                   0         0.00 Fetch no body (1xx)
> fetch_204                   0         0.00 Fetch no body (204)
> fetch_304                  14         0.00 Fetch no body (304)
>
>
>
> On Wed, Jan 30, 2013 at 3:45 PM, Damon Snyder <damon at huddler-inc.com>wrote:
>
>> When you 'varnishadm -T localhost:port param.show session_linger' it
>> indicates at the bottom that "we don't know if this is a good idea... and
>> feeback is welcome."
>>
>> We found that setting session_linger pulled us out of a bind. I wanted to
>> add my feedback to the list in the hope that someone else might benefit
>> from what we experienced.
>>
>> We recently increased the number of esi includes on pages that get ~60-70
>> req/s on our platform. Some of those modules were being rendered with
>> s-maxage set to zero so that they would be refreshed on every page load
>> (this is so we could insert a non-cached partial into the page) which
>> further increased the request load on varnish.
>>
>> What we found is that after a few hours the load on a varnish box went
>> from < 1 to > 10 or more and n_wkr_overflow started incrementing. After
>> investigating further we noticed that the context switching went from
>> ~10k/s to > 100k/s. We are running Linux specifically Centos.
>>
>> No adjusting of threads or thread pools had any impact on the thrashing.
>> After reading Kristian's post<http://kly.no/posts/2009_10_19__High_end_Varnish_tuning__.html> about
>> high-end varnish tuning we decided to try out session_linger. We started by
>> doubling the default from 50 to 100 to test the theory ('varnishadm -T
>> localhost:port param.set session_linger 100'). Once we did that we saw a
>> gradual settling of the context switching (using dstat or sar -w) and
>> a stabilizing of the load.
>>
>> It's such a great feature to be able to change this parameter via the
>> admin interface. We have 50GB malloc'ed and some nuking on our boxes so
>> restarting varnish doesn't come without some impact to the platform.
>>
>> Intuitively increasing session_linger makes sense. If you have several
>> esi modules rendered within a page and the gap between them is > 50ms then
>> they'll be reallocated elsewhere.
>>
>> What is not clear to me is how we should tune session_linger. We started
>> by setting it to the 3rd quantile of render times for the esi module taken
>> from a sampling of backend requests. This turned out to be 110ms.
>>
>> Damon
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20130214/a5c5b18d/attachment-0001.html>