Unexplained Cache MISSes

Sat Jun 3 01:08:17 CEST 2017

>> Amazingly enough I never looked at the logs of a purge, maybe ExpKill
>> could give us a VXID to then check against the hit. If only SomeoneElse(tm)
>> could spare me the time and look at it themselves and tell us (wink wink=).
>
>
> I'm very happy to help in any way I can. Please let me know anything I can
> do or information I can provide. I'm no C programmer (web developer/server
> admin), so can't help out with coding/patching/debugging[3], but anything
> else I can do, please let me know what you need.

Well, luckily I didn't write any C code to find out what purge logs
look like. I'm certainly not going to debug code I'm not familiar with ;)

I wrote a dummy test case instead:

    varnishtest "purge logs"

    server s1 {
        rxreq
        expect req.url == "/to-be-purged"
        txresp
    } -start

    varnish v1 -vcl+backend {
        sub vcl_recv {
            if (req.method == "PURGE") {
                return (purge);
            }
        }
    } -start

    client c1 {
        txreq -url "/to-be-purged"
        rxresp

        txreq -req PURGE -url "/to-be-purged"
        rxresp

        txreq -req PURGE -url "/unknown"
        rxresp
    } -run

And then looked at the logs manually:

    varnishtest test.vtc -v | grep vsl | less

Here's a sample:

    [...]
    **** v1    0.4 vsl|       1002 VCL_return      b deliver
    **** v1    0.4 vsl|       1002 Storage         b malloc s0
    [...]
    **** v1    0.4 vsl|          0 ExpKill         - EXP_When
p=0x7f420b027000 e=1496443420.703764200 f=0xe
    **** v1    0.4 vsl|          0 ExpKill         - EXP_expire
p=0x7f420b027000 e=-0.000092268 f=0x0
    **** v1    0.4 vsl|          0 ExpKill         - EXP_Expired x=1002 t=-0
    [...]
    **** v1    0.4 vsl|       1003 ReqMethod       c PURGE
    **** v1    0.4 vsl|       1003 ReqURL          c /to-be-purged
    [...]
    **** v1    0.4 vsl|       1003 VCL_return      c purge
    **** v1    0.4 vsl|       1003 VCL_call        c HASH
    **** v1    0.4 vsl|       1003 VCL_return      c lookup
    **** v1    0.4 vsl|       1003 VCL_call        c PURGE
    **** v1    0.4 vsl|       1003 VCL_return      c synth
    [...]
    **** v1    0.4 vsl|       1004 ReqMethod       c PURGE
    **** v1    0.4 vsl|       1004 ReqURL          c /unknown
    [...]
    **** v1    0.4 vsl|       1004 VCL_return      c purge
    **** v1    0.4 vsl|       1004 VCL_call        c HASH
    **** v1    0.4 vsl|       1004 VCL_return      c lookup
    **** v1    0.4 vsl|       1004 VCL_call        c PURGE
    **** v1    0.4 vsl|       1004 VCL_return      c synth
    [...]

The interesting transaction id (VXID) is 1002.

So 1) purge-related logs will only show up with raw grouping in
varnishlog (which I find unfortunate but I should have remembered the
expiry thread would have been involved) and 2) we don't see in a
transaction log how many objects were actually purged (moved to the
expiry inbox).

The ExpKill records appear before because transactions commit their
logs when they finish by default.

> Would a cleanly installed server and absolute minimum VCL to reproduce this
> be useful? You would be welcome to have access to that server, if useful,
> once I've got it set up and producing the same problem.

Not yet, at this point we know that we were looking at an incomplete
picture so what you need to do is capture raw logs and we will be able
to get both a VXID and a timestamp from the ExpKill records (although
the timestamp for EXP_expire puzzles me).

See man varnishlog to see how to write (-w) and then read (-r) logs
to/from a file. When you notice the alleged bug, note the transaction
id and write the current logs (with the -d option) so that you can
pick up all the interesting bits at rest (instead of doing it on live
traffic).

> I can say that in my case there is definitely no Age header coming from the
> back-end. Also as shown in the example I sent it is the 7th HIT on that
> object.

Yes, smells like a bug. But before capturing logs, make sure to remove
Hash records from the vsl_mask (man varnishd) so we can confirm what's
being purged too.

I have a theory, a long shot that will only prove how unfamiliar I am
with this part of Varnish. Since the purge moves the object to the
expiry inbox, it could be that under load the restart may happen
before the expiry thread marks it as expired, thus creating a race
with the next lookup.

Cheers,
Dridi