Weird situation with unexpected 206 from the backend, hit-for-pass and default_ttl

Sat Feb 27 16:51:13 CET 2010

Hi,

I'm having quite a hard time trying to figure out what is going on
exactly here, sorry for the length but I prefer to give as much detail
as possible.

I'm using varnish on several varnish servers with the same VCL
configuration. I have varnish 2.0.4 and 2.0.6, Debian GNU/Linux on amd64
hardware (etch and lenny). All servers happen to have the same weird
behaviour regardless of those differences :

All is going well, great performances and cache hit ratio, everyone is
happy, unless some servers randomly start to make too many backend
connexions. After eliminating the usual suspects (cookies,
accept-encoding), I started investigating what kind of requests were
there, and I realized that somehow most of the requests where hitpass
for objects that were previously successfully cached on the same varnish
server. But the objects impacted are not necessarily the same on every
varnish server, and are not always the same, nor at the same time.

After some help from the irc channel (thanks everybody), I started to
track some "culprits" with "varnishtop -b -i TxURL", then searching the
last "varnishlog -c -o" entry for this object, and finally searching the
request id referenced in its HitPass entry.

On every case I tracked like this, the scenario was the same, and if I
understand things correctly (I'll give commented varnishlog just after),
it is like this:

  - a request comes from the object, it's a miss so we ask the backend
    for it. It is a pretty simple GET request, WITHOUT "Range" header

  - the backend, I don't know why, sometimes (and sometimes only)
    answers with 206 Partial Content, wich I believe is totally wrong
    and will be tracked down by them, but let's keep this to the varnish
    side

  - so because of that 206 answer, a hit for pass cache entry is
    created, with a certain TTL which in my case seems to be way too
    long

  - later on, for each simple (no Range header) request for this object,
    we hit the hitpass entry and pass to the backend

I think I have different problems here, one on the backend obviously,
but the others with varnish :

  - seeing as wrong as this backend answer is, I think it shouldn't be
    dealt with "normally" by varnish, it would me more interesting to
    let the next request have its chance to populate the cache with a
    valid (200) answer, in order to keep the following requests being
    able to be served from the cache. Is this something that should be
    changed in varnish, or in the vcl (and how) ?

  - the TTL used for the HitPass object seems to be the default_ttl (-t
    86400 in my setup); I tried to keep it low (0s or 10s) from the vcl
    config, like this :

    sub vcl_fetch {
        if (!obj.cacheable) {
                set obj.ttl = 0s;
                return (pass);
        }
	/* ... */
    }

    but I doesn't seem to work. I have the feeling that the HitPass TTL
    is somehow forced to the default_ttl value (which I think is not
    what the documentation implied). What is going on exactly here and
    how can this be changed from the vcl ?

  - I also tried to purge the hit-for-pass culprit with the CLI (purge
    req.host == the.host && req.url == /the/url), but it didn't work
    (the object was still there after). Any idea how to purge such
    objects ? Is this a bug ? 

Here are the varnishlog outputs. Firt, here is a pass because of the
HitPass object, with a correct (200) answer :

   18 ReqStart     c 89.80.252.34 1069 1643273915
   18 RxRequest    c GET
   18 RxURL        c /elements/css/v7/styles_services.css
   18 RxProtocol   c HTTP/1.1
   18 RxHeader     c Accept: */*
   18 RxHeader     c Referer: http://www.sfr.fr/fr/adsl.jsp?version=6004&contener=2
   18 RxHeader     c Accept-Language: fr
   18 RxHeader     c Accept-Encoding: gzip, deflate
   18 RxHeader     c User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
   18 RxHeader     c Host: img.s-sfr.fr
   18 RxHeader     c Connection: Keep-Alive
   18 VCL_call     c recv lookup
   18 VCL_call     c hash hash
   18 HitPass      c 1641180098
   18 VCL_call     c pass pass
   18 Backend      c 30 s_sfr s_sfr
   18 ObjProtocol  c HTTP/1.1
   18 ObjStatus    c 200
   18 ObjResponse  c OK
   18 ObjHeader    c Date: Sat, 27 Feb 2010 14:27:59 GMT
   18 ObjHeader    c Server: Apache/2.2.13 (Fedora)
   18 ObjHeader    c Last-Modified: Tue, 09 Feb 2010 09:59:57 GMT
   18 ObjHeader    c Via: http2
   18 ObjHeader    c Cache-Control: public
   18 ObjHeader    c Expires: Sat, 27 Feb 2010 16:20:53 GMT
   18 ObjHeader    c Vary: Accept-Encoding
   18 ObjHeader    c Content-Type: text/css;charset=ISO-8859-15
   18 TTL          c 1643273915 RFC 6773 1267280879 1267280879 1267287653 0 0
   18 VCL_call     c fetch deliver
   18 Length       c 3284
   18 VCL_call     c deliver deliver
   18 TxProtocol   c HTTP/1.1
   18 TxStatus     c 200
   18 TxResponse   c OK
   18 TxHeader     c Server: Apache/2.2.13 (Fedora)
   18 TxHeader     c Last-Modified: Tue, 09 Feb 2010 09:59:57 GMT
   18 TxHeader     c Via: http2
   18 TxHeader     c Cache-Control: public
   18 TxHeader     c Expires: Sat, 27 Feb 2010 16:20:53 GMT
   18 TxHeader     c Vary: Accept-Encoding
   18 TxHeader     c Content-Type: text/css;charset=ISO-8859-15
   18 TxHeader     c Content-Length: 3284
   18 TxHeader     c Date: Sat, 27 Feb 2010 14:27:59 GMT
   18 TxHeader     c X-Varnish: 1643273915
   18 TxHeader     c Age: 0
   18 TxHeader     c Via: 1.1 varnish
   18 TxHeader     c Connection: keep-alive
   18 ReqEnd       c 1643273915 1267280879.342154026 1267280879.351441145 0.075042009 0.009259224 0.000027895

=> so, the HitPass was created by request id 1641180098. Let's see that
request :

   59 ReqStart     c 77.205.33.140 52401 1641180098
   59 RxRequest    c GET
   59 RxURL        c /elements/css/v7/styles_services.css
   59 RxProtocol   c HTTP/1.1
   59 RxHeader     c Accept: */*
   59 RxHeader     c Referer: http://www.sfr.fr/fr/adsl.jsp
   59 RxHeader     c Accept-Language: fr
   59 RxHeader     c User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; GTB6.4; SLCC1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.5.30729; .NET CLR 3.0.30729)
   59 RxHeader     c Accept-Encoding: gzip, deflate
   59 RxHeader     c Host: img.s-sfr.fr
   59 RxHeader     c Connection: Keep-Alive
   59 VCL_call     c recv lookup
   59 VCL_call     c hash hash
   59 VCL_call     c miss fetch
   59 Backend      c 21 s_sfr s_sfr
   59 ObjProtocol  c HTTP/1.1
   59 ObjStatus    c 206
   59 ObjResponse  c Partial Content
   59 ObjHeader    c Date: Fri, 26 Feb 2010 22:01:08 GMT
   59 ObjHeader    c Server: Apache/2.2.13 (Fedora)
   59 ObjHeader    c Last-Modified: Tue, 09 Feb 2010 09:59:57 GMT
   59 ObjHeader    c Via: http4
   59 ObjHeader    c Cache-Control: public
   59 ObjHeader    c Expires: Fri, 26 Feb 2010 23:08:27 GMT
   59 ObjHeader    c Content-Type: text/css;charset=ISO-8859-15
   59 TTL          c 1641180098 RFC 4038 1267221668 1267221668 1267225707 0 0
   59 VCL_call     c fetch
   59 TTL          c 1641180098 VCL 0 1267221668
   59 VCL_return   c pass
   59 Length       c 2211
   59 VCL_call     c deliver deliver
   59 TxProtocol   c HTTP/1.1
   59 TxStatus     c 206
   59 TxResponse   c Partial Content
   59 TxHeader     c Server: Apache/2.2.13 (Fedora)
   59 TxHeader     c Last-Modified: Tue, 09 Feb 2010 09:59:57 GMT
   59 TxHeader     c Via: http4
   59 TxHeader     c Cache-Control: public
   59 TxHeader     c Expires: Fri, 26 Feb 2010 23:08:27 GMT
   59 TxHeader     c Content-Type: text/css;charset=ISO-8859-15
   59 TxHeader     c Content-Length: 2211
   59 TxHeader     c Date: Fri, 26 Feb 2010 22:01:08 GMT
   59 TxHeader     c X-Varnish: 1641180098
   59 TxHeader     c Age: 0
   59 TxHeader     c Via: 1.1 varnish
   59 TxHeader     c Connection: keep-alive
   59 ReqEnd       c 1641180098 1267221668.117854595 1267221668.127019882 2.395729065 0.009134293 0.000030994

=> Here is the "wrong" 206 answer. No Range header was in the request so
200 and the full content was expected. Also no Content-Range in the
backend answer, that's also very strange. Anyway, its TTL would be 4038
calculated from the HTTP headers (RFC 4038), then it is forced to 0 from
VCL (VCL 0), then we go to pass.

Now pay attention to the Date: headers from the backend answers (yes,
the backend time is correct ;)). The HitPass object was created "Fri, 26
Feb 2010 22:01:08 GMT" with a calculated TTL of 4038 (forced to 0 from
VCL but it doesn't work), so it should be purged about 4038s later, but
it is still there for a request made "Sat, 27 Feb 2010 14:27:59 GMT",
thats much, much more !!

That should not be possible, unless default_ttl (-t 86400 in my setup)
is used instead of the computed TTL...

Many thanks in advance for any explanation or advice on this situation !

Best regards,
Tom

-- 
Thomas Parmelan  -+-  Thomas.Parmelan at free.fr  -+-  tom at ankh.fr.EU.org