Unexplained Cache MISSes

Fri Jun 16 20:27:15 CEST 2017

Sorry for the delay on working on this. I've read your email a few times 
now and am still confused! I need to read the man pages suggested but 
haven't got to it yet. Will let you know when I make some progress on it.

I'm fixing the issue in the interim here by issuing another GET request 
in my cache refresh scripts for any PURGE requests that come back with a 
HIT.

Nigel

On 02/06/2017 18:08, Dridi Boukelmoune wrote:
>>> Amazingly enough I never looked at the logs of a purge, maybe ExpKill
>>> could give us a VXID to then check against the hit. If only SomeoneElse(tm)
>>> could spare me the time and look at it themselves and tell us (wink wink=).
>>
>>
>> I'm very happy to help in any way I can. Please let me know anything I can
>> do or information I can provide. I'm no C programmer (web developer/server
>> admin), so can't help out with coding/patching/debugging[3], but anything
>> else I can do, please let me know what you need.
> 
> Well, luckily I didn't write any C code to find out what purge logs
> look like. I'm certainly not going to debug code I'm not familiar with ;)
> 
> I wrote a dummy test case instead:
> 
>      varnishtest "purge logs"
> 
>      server s1 {
>          rxreq
>          expect req.url == "/to-be-purged"
>          txresp
>      } -start
> 
>      varnish v1 -vcl+backend {
>          sub vcl_recv {
>              if (req.method == "PURGE") {
>                  return (purge);
>              }
>          }
>      } -start
> 
>      client c1 {
>          txreq -url "/to-be-purged"
>          rxresp
> 
>          txreq -req PURGE -url "/to-be-purged"
>          rxresp
> 
>          txreq -req PURGE -url "/unknown"
>          rxresp
>      } -run
> 
> And then looked at the logs manually:
> 
>      varnishtest test.vtc -v | grep vsl | less
> 
> Here's a sample:
> 
>      [...]
>      **** v1    0.4 vsl|       1002 VCL_return      b deliver
>      **** v1    0.4 vsl|       1002 Storage         b malloc s0
>      [...]
>      **** v1    0.4 vsl|          0 ExpKill         - EXP_When
> p=0x7f420b027000 e=1496443420.703764200 f=0xe
>      **** v1    0.4 vsl|          0 ExpKill         - EXP_expire
> p=0x7f420b027000 e=-0.000092268 f=0x0
>      **** v1    0.4 vsl|          0 ExpKill         - EXP_Expired x=1002 t=-0
>      [...]
>      **** v1    0.4 vsl|       1003 ReqMethod       c PURGE
>      **** v1    0.4 vsl|       1003 ReqURL          c /to-be-purged
>      [...]
>      **** v1    0.4 vsl|       1003 VCL_return      c purge
>      **** v1    0.4 vsl|       1003 VCL_call        c HASH
>      **** v1    0.4 vsl|       1003 VCL_return      c lookup
>      **** v1    0.4 vsl|       1003 VCL_call        c PURGE
>      **** v1    0.4 vsl|       1003 VCL_return      c synth
>      [...]
>      **** v1    0.4 vsl|       1004 ReqMethod       c PURGE
>      **** v1    0.4 vsl|       1004 ReqURL          c /unknown
>      [...]
>      **** v1    0.4 vsl|       1004 VCL_return      c purge
>      **** v1    0.4 vsl|       1004 VCL_call        c HASH
>      **** v1    0.4 vsl|       1004 VCL_return      c lookup
>      **** v1    0.4 vsl|       1004 VCL_call        c PURGE
>      **** v1    0.4 vsl|       1004 VCL_return      c synth
>      [...]
> 
> The interesting transaction id (VXID) is 1002.
> 
> So 1) purge-related logs will only show up with raw grouping in
> varnishlog (which I find unfortunate but I should have remembered the
> expiry thread would have been involved) and 2) we don't see in a
> transaction log how many objects were actually purged (moved to the
> expiry inbox).
> 
> The ExpKill records appear before because transactions commit their
> logs when they finish by default.
> 
>> Would a cleanly installed server and absolute minimum VCL to reproduce this
>> be useful? You would be welcome to have access to that server, if useful,
>> once I've got it set up and producing the same problem.
> 
> Not yet, at this point we know that we were looking at an incomplete
> picture so what you need to do is capture raw logs and we will be able
> to get both a VXID and a timestamp from the ExpKill records (although
> the timestamp for EXP_expire puzzles me).
> 
> See man varnishlog to see how to write (-w) and then read (-r) logs
> to/from a file. When you notice the alleged bug, note the transaction
> id and write the current logs (with the -d option) so that you can
> pick up all the interesting bits at rest (instead of doing it on live
> traffic).
> 
>> I can say that in my case there is definitely no Age header coming from the
>> back-end. Also as shown in the example I sent it is the 7th HIT on that
>> object.
> 
> Yes, smells like a bug. But before capturing logs, make sure to remove
> Hash records from the vsl_mask (man varnishd) so we can confirm what's
> being purged too.
> 
> I have a theory, a long shot that will only prove how unfamiliar I am
> with this part of Varnish. Since the purge moves the object to the
> expiry inbox, it could be that under load the restart may happen
> before the expiry thread marks it as expired, thus creating a race
> with the next lookup.
> 
> Cheers,
> Dridi
>