push button lru nuking

Sat Jan 16 23:46:11 CET 2010

I'm trying to hack my way around a push-button like lru nuking like
feature.  The short description of how I'm doing it follows, I'll
explain why farther down.

I have a job that watches sm_bfree / (sm_bfree + sm_balloc).  Once
storage file utilization is past some percentage(yet to be determined)
I connect to upstream load balancers and slowly drain traffic away
from varnish.

Once traffic is off and I can beat the hell out of that box, it's time
to free up some space.  In the past this has been done with restarts.
Upon restarts, the cache hit ratio is destroyed, but the box can keep
up and rebuild the cache in a stable way.  What I'd like to do is dump
everything in the storage files that have a very low obj.hits.  Lru
nuking on the surface seems like the best thing to initiate, but it
usually only kicks in pretty late and puts the machine into a state
that is unstable while serving.  While not serving, I don't know how
to kick it off, furthermore I want it to run hard and free up lots
more space than it usually does.

ie.  cache file is ~200GB, I'd like it to run until sm_free is like 50GB.

My idea is to load balance as I've described above.   Pull 50GB of
trash files through the cache + enough to kick off lru, purge the
trash files, monitor sm_bfree and once it's high enough instruct the
upstream load balancers to start sending traffic gently for a warm up
period.  Rinse and repeat into infinity replacing the ssd storage
drives as they fail.  Is this crazy?  Am I uninformed on a better way?

Also, I've had to keep making my trash files smaller and smaller.  I
started with a 10 and 1G files which crashed varnish immediately, then
reduced to 500MB files and successfully pulled 200 through - then
crashed both my python interpreter (libcurl) and varnish:
varnishd[2664]: Child (14772) Panic message: Assert error in
STV_alloc(), stevedore.c line 183:#012  Condition((st) != NULL) not
true.#012thread = (cache-worker)#012Backtrace:#012  0x421f95:
pan_ic+85#012  0x4369e5: STV_alloc+125#012  0x41a1b6:
FetchBody+496#012  0x4114dd: cnt_fetch+63d#012  0x412a3d:
CNT_Session+35d#012  0x424273: wrk_do_cnt_sess+93#012  0x42362e:
wrk_thread_real+26e#012  0x7f2cf51b83da: _end+7f2cf4b47c1a#012
0x7f2cf4a862bd: _end+7f2cf4415afd#012sp = 0x7f2ced387008 {#012  fd =
58, id = 58, xid = 1454039386,#012  client = 127.0.0.1:7057,#012  step
= STP_FETCH,#012  handling = deliver,#012  err_code = 200, err_reason
= (null),#012  restarts = 0, esis = 0#012  ws = 0x7f2ced387078 { #012
  id = "sess",#012    {s,f,r,e} =
{0x7f2ced387800,+144,(nil),+4096},#012  },#012  http[req] = {#012
ws = 0x7f2ced387078[sess]#012      "GET",#012
"/lru.10.cache.buster.80.12994",#012      "HTTP/1.1",#012
"User-Agent: PycURL/7.18.2",#012      "Host: localhost:6081",#012
"Accept: */*",#012  },#012  worker = 0x7ef439f06390 {#012    ws =
0x7ef439f068f0 { #012      id = "wrk",#012      {s,f,r,e} =
{0x7ef439f03350,+2143,(nil),+4096},#012    },#012    http[bereq] =
{#012      ws = 0x7ef439f068f0[wrk]#012        "GET",#012
"/lru.10.cache.buster.80.12994",#012        "HTTP/1.1",#012
"User-Agent: PycURL/7.18.2",#012        "Host: localhost:6081",#012
    "Accept: */*",#012        "X-Varnish: 1454039386",#012
"X-Forwarded-For: 127.0.0.1",#012    },#012    http[beresp] = {#012
  ws = 0x7ef439f068f0[wrk]#012        "HTTP/1.1",#012
"200",#012        "OK",#012        "Server: nginx/0.7.64",#012
"Date: Sat, 16 Jan 2010 21:11:09 GMT",#012        "Content-Type:
application/octet-stream",#012        "Content-Length: 524288000",#012
       "Last-Modified: Sat, 16 Jan 2010 21:08:11 GMT",#012
"Connection: keep-alive",#012        "Accept-Ranges: bytes",#012
 "X-Varnish-IP: 127.0.0.1",#012        "X-Varnish-Port: 6081",#012
},#012    },#012

Are big files bad?  I expect that I'll have to close a pretty big gap
normally given that my 4 storage files are 75GB each (SSD). I'd like
to start this process before lru nuking happens on it's own while
varnish is not unloaded by upstream load balancers.  My guess based on
loose recollection is that varnish will start lru nuking at 90%
capacity.  It may just prove not feasible given that I'll have to pull
roughly 60GB through to achieve the goal....perhaps freeing up a
smaller percentage would be acceptable too though.  I'm still playing
with this, but wanted to share my uber-hacky idea and let you guys
tear it apart if it's a dumb idea.

Why:
Identifying the working set has been difficult.  It's large, the long
tail is very long.  I've tried adaptive ttls to expire objects
constantly that shouldn't be in cache:

  in vcl_fetch: set every new object to a 2hr ttl.
  in vcl_hit: if obt.hits == N ; then obj.ttl = 36 hours, where N is
some number that is high enough to cache
another permutation, update the vcl every 30 mins such that obj.ttl
was set to expire exactly at the trough of traffic (2300 - 2350 PST)
  in vcl_hit: if obt.hits = N ; then obj.ttl = 12h or 10h, or 3h
(depending on time of day)

This just ended up affecting cache hit ratio such that it was never
favorable and the box was just busier as it was constantly expiring
objects over the day. Restarts were still better than this.

Setup:

3 haproxy load balancer machines consistently hashing to 6 varnish
instances.  It's a prototype and will be scaled to a larger pool, so
the impact of the downtime of a single varnish instance while it goes
through a cache storage scrubbing is will be greatly reduced.