Poll() on shared pipe() high context switching issue affect varnish? :: was Feedback on session_linger (high load/n_wrk_overflow issues)

Fri Feb 8 03:09:23 CET 2013

I think the issue we have been seeing *may* be related to the polling on a
shared pipe() issue that was first mentioned here:
http://lkml.indiana.edu/hypermail/linux/kernel/1012.1/03515.html.

I think I narrowed the cause of the high csw/s down to multiple esi
includes (2 most of the time) with s-maxage=0 on a 1M+ pageview per month
site (~60conn/s according to the first line of varnishstat). Most of the
page views have these esi includes that aren't being cached (not all of the
60/s). If we set s-maxage to something > 0 (say 20s) in the headers for the
esi module coming from the backend to varnish the context switching issue
goes away-- the csw/s degrades linearly back to the steady state at about
10k/s.

I created a threaded version of the example code from the lkml email (
https://gist.github.com/drsnyder/4735889). When I compile and run this code
on a 2.6.32-279 or below kernel the csw/s goes to ~200k/s which defies any
attempt I have made to reason about why that would be the case. If I run
the same code on a 3.0+ Linux kernel, I will see about 12k csw/s. Still
high, but much more reasonable.

We see the same csw/s behavior when we set the s-maxage=0 for the esi
modules on the production site with about the same number of threads (400).

So does this polling on a shared pipe() issue affect varnish 2.1.5? It
would appear so, but I'm looking for confirmation. I saw a commit
referencing context switching here (
https://www.varnish-cache.org/trac/changeset/05f816345a329ef52644a0239892f8be6314a3fa)
but I also can't imagine that this would trigger 200k/s in our environment.
 If it is an issue, is there anything we can do to mitigate it in the
varnish configuration? Are there versions where this has been fixed?

We have tested sending production traffic with s-maxage=0 to an Ubuntu
system running 3.2.0-27 and saw a spike in context switching but it peaked
at about 20k/s with no observed impact on stability or delivery. So,
switching from Centos to Ubuntu is a possible option.

Thanks,
Damon

---------- Forwarded message ----------
From: Damon Snyder <damon at huddler-inc.com>
Date: Thu, Jan 31, 2013 at 11:10 PM
Subject: Re: Feedback on session_linger (high load/n_wrk_overflow issues)
To: "varnish-misc at varnish-cache.org" <varnish-misc at varnish-cache.org>

Well, I spoke too soon. Bumping session_linger brought us stability for
about a day and then the high context switching returned. See the attached
plot to illustrate the problem. The context switching (pulled from dstat
and added to varnish) starts to ramp up. As soon it crosses the ~50k mark
we start start seeing stability and latency issues (10k/s is more
"normal"). So now I'm at a loss as to how to proceed.

Below is the current varnishstat -1 output when the system is well behaved.
I wasn't able to capture it when the context switching peaked.
n_wrk_overflow spiked to about 1500 at the time and the load average over
the past 5 min was ~15.

This is running on a dual hex core Intel Xeon E5-2620 (24 contexts or
cores) with 64GB of memory. The hitrate is about 0.65 and we are nuking
(n_lru_nuked incrementing) once the system has been running for a few
hours. We have long tail objects in that we have a lot of content, but only
some of it is hot so it's difficult to predict our precise cache size
requirements at any given time. We are using varnish-2.1.5 on Centos 5.6
with kernel 2.6.18-238.

I haven't found anything in syslog that would be of interest.

Thanks,
Damon

# currently running command line
/usr/local/varnish-2.1.5/sbin/varnishd -P /var/run/varnish.pid -a
10.16.50.150:6081,127.0.0.1:6081 -f /etc/varnish/default.vcl -T
127.0.0.1:6082 -t 120 -w 100,2000,120 -u varnish -g varnish -s malloc,48G
-p thread_pools 8 -p thread_pool_add_delay 2 -p listen_depth 1024 -p
session_linger 110 -p lru_interval 60 -p sess_workspace 524288

# varnishstat -1
client_conn           5021009       442.11 Client connections accepted
client_drop                 0         0.00 Connection dropped, no sess/wrk
client_req            5020366       442.05 Client requests received
cache_hit             4597754       404.84 Cache hits
cache_hitpass           24775         2.18 Cache hits for pass
cache_miss            2664516       234.61 Cache misses
backend_conn          3101882       273.13 Backend conn. success
backend_unhealthy            0         0.00 Backend conn. not attempted
backend_busy                0         0.00 Backend conn. too many
backend_fail                0         0.00 Backend conn. failures
backend_reuse              99         0.01 Backend conn. reuses
backend_toolate         11777         1.04 Backend conn. was closed
backend_recycle         11877         1.05 Backend conn. recycles
backend_unused              0         0.00 Backend conn. unused
fetch_head                 33         0.00 Fetch head
fetch_length          2134117       187.91 Fetch with Length
fetch_chunked          811172        71.42 Fetch chunked
fetch_eof                   0         0.00 Fetch EOF
fetch_bad                   0         0.00 Fetch had bad headers
fetch_close            156333        13.77 Fetch wanted close
fetch_oldhttp               0         0.00 Fetch pre HTTP/1.1 closed
fetch_zero                  0         0.00 Fetch zero len
fetch_failed                1         0.00 Fetch failed
n_sess_mem               3867          .   N struct sess_mem
n_sess                   3841          .   N struct sess
n_object               757929          .   N struct object
n_vampireobject             0          .   N unresurrected objects
n_objectcore           758158          .   N struct objectcore
n_objecthead           760076          .   N struct objecthead
n_smf                       0          .   N struct smf
n_smf_frag                  0          .   N small free smf
n_smf_large                 0          .   N large free smf
n_vbe_conn                 69          .   N struct vbe_conn
n_wrk                     800          .   N worker threads
n_wrk_create              800         0.07 N worker threads created
n_wrk_failed                0         0.00 N worker threads not created
n_wrk_max                   0         0.00 N worker threads limited
n_wrk_queue                 0         0.00 N queued work requests
n_wrk_overflow            171         0.02 N overflowed work requests
n_wrk_drop                  0         0.00 N dropped work requests
n_backend                  21          .   N backends
n_expired             1879027          .   N expired objects
n_lru_nuked                 0          .   N LRU nuked objects
n_lru_saved                 0          .   N LRU saved objects
n_lru_moved            813472          .   N LRU moved objects
n_deathrow                  0          .   N objects on deathrow
losthdr                     0         0.00 HTTP header overflows
n_objsendfile               0         0.00 Objects sent with sendfile
n_objwrite            5195402       457.46 Objects sent with write
n_objoverflow               0         0.00 Objects overflowing workspace
s_sess                5020819       442.09 Total Sessions
s_req                 5020366       442.05 Total Requests
s_pipe                      0         0.00 Total pipe
s_pass                 437300        38.50 Total pass
s_fetch               3101668       273.11 Total fetch
s_hdrbytes         2575156242    226746.17 Total header bytes
s_bodybytes      168406835436  14828461.34 Total body bytes
sess_closed           5020817       442.09 Session Closed
sess_pipeline               0         0.00 Session Pipeline
sess_readahead              0         0.00 Session Read Ahead
sess_linger                 0         0.00 Session Linger
sess_herd                   0         0.00 Session herd
shm_records         430740934     37927.35 SHM records
shm_writes           29264523      2576.78 SHM writes
shm_flushes             51411         4.53 SHM flushes due to overflow
shm_cont                78056         6.87 SHM MTX contention
shm_cycles                186         0.02 SHM cycles through buffer
sm_nreq                     0         0.00 allocator requests
sm_nobj                     0          .   outstanding allocations
sm_balloc                   0          .   bytes allocated
sm_bfree                    0          .   bytes free
sma_nreq              5570153       490.46 SMA allocator requests
sma_nobj              1810351          .   SMA outstanding allocations
sma_nbytes        35306367656          .   SMA outstanding bytes
sma_balloc       154583709612          .   SMA bytes allocated
sma_bfree        119277341956          .   SMA bytes free
sms_nreq                32127         2.83 SMS allocator requests
sms_nobj                    0          .   SMS outstanding allocations
sms_nbytes                  0          .   SMS outstanding bytes
sms_balloc           29394423          .   SMS bytes allocated
sms_bfree            29394423          .   SMS bytes freed
backend_req           3101977       273.13 Backend requests made
n_vcl                       1         0.00 N vcl total
n_vcl_avail                 1         0.00 N vcl available
n_vcl_discard               0         0.00 N vcl discarded
n_purge                 31972          .   N total active purges
n_purge_add             31972         2.82 N new purges added
n_purge_retire              0         0.00 N old purges deleted
n_purge_obj_test      3317282       292.09 N objects tested
n_purge_re_test     530623465     46722.15 N regexps tested against
n_purge_dups            28047         2.47 N duplicate purges removed
hcb_nolock            7285152       641.47 HCB Lookups without lock
hcb_lock              2543066       223.92 HCB Lookups with lock
hcb_insert            2542964       223.91 HCB Inserts
esi_parse              741690        65.31 Objects ESI parsed (unlock)
esi_errors                  0         0.00 ESI parse errors (unlock)
accept_fail                 0         0.00 Accept failures
client_drop_late            0         0.00 Connection dropped late
uptime                  11357         1.00 Client uptime
backend_retry              99         0.01 Backend conn. retry
dir_dns_lookups             0         0.00 DNS director lookups
dir_dns_failed              0         0.00 DNS director failed lookups
dir_dns_hit                 0         0.00 DNS director cached lookups hit
dir_dns_cache_full            0         0.00 DNS director full dnscache
fetch_1xx                   0         0.00 Fetch no body (1xx)
fetch_204                   0         0.00 Fetch no body (204)
fetch_304                  14         0.00 Fetch no body (304)

On Wed, Jan 30, 2013 at 3:45 PM, Damon Snyder <damon at huddler-inc.com> wrote:

> When you 'varnishadm -T localhost:port param.show session_linger' it
> indicates at the bottom that "we don't know if this is a good idea... and
> feeback is welcome."
>
> We found that setting session_linger pulled us out of a bind. I wanted to
> add my feedback to the list in the hope that someone else might benefit
> from what we experienced.
>
> We recently increased the number of esi includes on pages that get ~60-70
> req/s on our platform. Some of those modules were being rendered with
> s-maxage set to zero so that they would be refreshed on every page load
> (this is so we could insert a non-cached partial into the page) which
> further increased the request load on varnish.
>
> What we found is that after a few hours the load on a varnish box went
> from < 1 to > 10 or more and n_wkr_overflow started incrementing. After
> investigating further we noticed that the context switching went from
> ~10k/s to > 100k/s. We are running Linux specifically Centos.
>
> No adjusting of threads or thread pools had any impact on the thrashing.
> After reading Kristian's post<http://kly.no/posts/2009_10_19__High_end_Varnish_tuning__.html> about
> high-end varnish tuning we decided to try out session_linger. We started by
> doubling the default from 50 to 100 to test the theory ('varnishadm -T
> localhost:port param.set session_linger 100'). Once we did that we saw a
> gradual settling of the context switching (using dstat or sar -w) and
> a stabilizing of the load.
>
> It's such a great feature to be able to change this parameter via the
> admin interface. We have 50GB malloc'ed and some nuking on our boxes so
> restarting varnish doesn't come without some impact to the platform.
>
> Intuitively increasing session_linger makes sense. If you have several esi
> modules rendered within a page and the gap between them is > 50ms then
> they'll be reallocated elsewhere.
>
> What is not clear to me is how we should tune session_linger. We started
> by setting it to the 3rd quantile of render times for the esi module taken
> from a sampling of backend requests. This turned out to be 110ms.
>
> Damon
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20130207/f002c79a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: varnish-context-switching.png
Type: image/png
Size: 55752 bytes
Desc: not available
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20130207/f002c79a/attachment-0001.png>