is 2.0.2 not as efficient as 1.1.2 was?

Wed Feb 4 16:44:29 CET 2009

On Nov 25, 2008, at 5:37 PM, Demitrious Kelly wrote:

> Hello,
>
> We run Gravatar.com and use varnish to cache avatar responses.  There
> are a ton of very small objects and lots of requests per second. Last
> week we were using 1.1.2 compiled against tcmalloc (-t 600 -w 1,4000,5
> -h classic,500009 -p thread_pools 10 -p listen_depth 4096 -s
> malloc,16G). This used an nginx load balancer on a separate host as  
> its
> back end which distributed varnish's requests to our pool of webs.   
> All
> was well.
>
> This week we upgraded to 2.0.2 and are using varnish's back end &
> director configuration for the same work.  What we are seeing is that
> 2.0.2 holds about 60% of the objects in the same amount of cache space
> as 1.1.2 did (we tried tcmalloc, jemalloc, and mmap.)  This caused us
> quite a few problems after the upgrade as varnish would start spiking
> the load on the boxes into the hundreds.  We attempted tuning the
> lru_interval (up) and obj_workspace (down) but we couldn't get varnish
> to hold the same data that it used to on the same machines.
>
> Right now we've reduced the time that we keep cached objects
> drastically, bringing our cache hit rate down to 92% from 96% which
> roughly doubled the requests (and load) on the web servers.  It is,
> however, stable at this point.  Obviously the idea of not keeping up
> with the latest versions of varnish is not what we want to do, however
> effectively doubling requirements for scaling the service is just as
> unappealing.
>
> So, what we're asking is... how do we get varnish 2 to be as efficient
> as varnish 1 was?  We're glad to try things...  It takes a while to  
> fill
> up the cache to the point that it can cause problems so testing and
> reporting back will take some time, but we'd like this fixed and will
> put in some work. We're currently running the following cli options:
>
> -a 0.0.0.0:80 -f ... -P ... -T 10.1.94.43:6969 -t 600 -w 1,4000,5 -h
> classic,500009 -p thread_pools 10 -p listen_depth 4096 -s malloc,16G
>
> And our VCL looks like this (with most of the webs taken out for  
> brevity
> since they're repeated verbatim with only numbers changed)
>
> backend web11 { .host = "xxx"; .port = "8088"; .probe =
>                { .url = "xxx"; .timeout = 50 ms; .interval = 5s;
> .window = 2; .threshold = 1; }
> }
> backend web12 { .host = "xxx"; .port = "8088"; .probe =
>                { .url = "xxx"; .timeout = 50 ms; .interval = 5s;
> .window = 2; .threshold = 1; }
> }
>
> director default random {
>        .retries = 3;
>        { .backend = web11; .weight = 1; }
>        { .backend = web12; .weight = 1; }
> }
>
> sub vcl_recv {
>  set req.backend = default;
>  set req.grace = 30s;
>  if ( req.url ~ "^/(avatar|userimage)" && req.http.cookie )  {
>    lookup;
>  }
> }
>
> sub vcl_fetch {
>  if (obj.ttl < 600s) {
>    set obj.ttl = 600s;
>  }
>  if (obj.status == 404) {
>    set obj.ttl = 30s;
>  }
>  if (obj.status == 500 || obj.status == 503 ) {
>    pass;
>  }
>  set obj.grace = 30s;
>  deliver;
> }
>
> sub vcl_deliver {
>  remove resp.http.Expires;
>  remove resp.http.Cache-Control;
>  set resp.http.Cache-Control = "public, max-age=600, proxy- 
> revalidate";
>  deliver;
> }

Bump :)  Is anyone else seeing the same thing?  I think it may be a  
result of the fact that a lot of the cached responses are just headers  
(302 redirects) and don't have any actual content.  That is the only  
thing I can think of why we would be seeing this issue and others  
wouldn't.  I suspect most people using varnish dont have stats that  
look like this:

  10094887744    960644.65    847668.80 Total header bytes
  22230934332   2174908.58   1866733.93 Total body bytes

I don't really want to revert to 1.1.2 because I like the general  
stability and features of 2.x, but I don't have any real ideas on how  
to troubleshoot why this would be happening.  Any ideas would be  
appreciated.

--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com