Unexplained Cache MISSes
Nigel Peck
np.lists at sharphosting.uk
Fri Jun 16 21:09:40 CEST 2017
Here's an interesting thing about this. When I refreshed the cache just
now (PURGE) for 204 URLs, 78 of them were a HIT instead of a MISS. All
had been in the cache for 9 hours at least. (a re-issued GET request
received a MISS for all 78)
When I immediately issued a PURGE again a few seconds later for all 204
URLs, every one of them was a MISS and purged successfully. I did it
again a few seconds after that, and again all good. Same again a few
minutes after that. No HITs.
So this seems to be in some way related to how long the objects have
been in the cache.
Nigel
On 16/06/2017 13:27, Nigel Peck wrote:
>
> Sorry for the delay on working on this. I've read your email a few times
> now and am still confused! I need to read the man pages suggested but
> haven't got to it yet. Will let you know when I make some progress on it.
>
> I'm fixing the issue in the interim here by issuing another GET request
> in my cache refresh scripts for any PURGE requests that come back with a
> HIT.
>
> Nigel
>
> On 02/06/2017 18:08, Dridi Boukelmoune wrote:
>>>> Amazingly enough I never looked at the logs of a purge, maybe ExpKill
>>>> could give us a VXID to then check against the hit. If only
>>>> SomeoneElse(tm)
>>>> could spare me the time and look at it themselves and tell us (wink
>>>> wink=).
>>>
>>>
>>> I'm very happy to help in any way I can. Please let me know anything
>>> I can
>>> do or information I can provide. I'm no C programmer (web
>>> developer/server
>>> admin), so can't help out with coding/patching/debugging[3], but
>>> anything
>>> else I can do, please let me know what you need.
>>
>> Well, luckily I didn't write any C code to find out what purge logs
>> look like. I'm certainly not going to debug code I'm not familiar with ;)
>>
>> I wrote a dummy test case instead:
>>
>> varnishtest "purge logs"
>>
>> server s1 {
>> rxreq
>> expect req.url == "/to-be-purged"
>> txresp
>> } -start
>>
>> varnish v1 -vcl+backend {
>> sub vcl_recv {
>> if (req.method == "PURGE") {
>> return (purge);
>> }
>> }
>> } -start
>>
>> client c1 {
>> txreq -url "/to-be-purged"
>> rxresp
>>
>> txreq -req PURGE -url "/to-be-purged"
>> rxresp
>>
>> txreq -req PURGE -url "/unknown"
>> rxresp
>> } -run
>>
>> And then looked at the logs manually:
>>
>> varnishtest test.vtc -v | grep vsl | less
>>
>> Here's a sample:
>>
>> [...]
>> **** v1 0.4 vsl| 1002 VCL_return b deliver
>> **** v1 0.4 vsl| 1002 Storage b malloc s0
>> [...]
>> **** v1 0.4 vsl| 0 ExpKill - EXP_When
>> p=0x7f420b027000 e=1496443420.703764200 f=0xe
>> **** v1 0.4 vsl| 0 ExpKill - EXP_expire
>> p=0x7f420b027000 e=-0.000092268 f=0x0
>> **** v1 0.4 vsl| 0 ExpKill - EXP_Expired
>> x=1002 t=-0
>> [...]
>> **** v1 0.4 vsl| 1003 ReqMethod c PURGE
>> **** v1 0.4 vsl| 1003 ReqURL c /to-be-purged
>> [...]
>> **** v1 0.4 vsl| 1003 VCL_return c purge
>> **** v1 0.4 vsl| 1003 VCL_call c HASH
>> **** v1 0.4 vsl| 1003 VCL_return c lookup
>> **** v1 0.4 vsl| 1003 VCL_call c PURGE
>> **** v1 0.4 vsl| 1003 VCL_return c synth
>> [...]
>> **** v1 0.4 vsl| 1004 ReqMethod c PURGE
>> **** v1 0.4 vsl| 1004 ReqURL c /unknown
>> [...]
>> **** v1 0.4 vsl| 1004 VCL_return c purge
>> **** v1 0.4 vsl| 1004 VCL_call c HASH
>> **** v1 0.4 vsl| 1004 VCL_return c lookup
>> **** v1 0.4 vsl| 1004 VCL_call c PURGE
>> **** v1 0.4 vsl| 1004 VCL_return c synth
>> [...]
>>
>> The interesting transaction id (VXID) is 1002.
>>
>> So 1) purge-related logs will only show up with raw grouping in
>> varnishlog (which I find unfortunate but I should have remembered the
>> expiry thread would have been involved) and 2) we don't see in a
>> transaction log how many objects were actually purged (moved to the
>> expiry inbox).
>>
>> The ExpKill records appear before because transactions commit their
>> logs when they finish by default.
>>
>>> Would a cleanly installed server and absolute minimum VCL to
>>> reproduce this
>>> be useful? You would be welcome to have access to that server, if
>>> useful,
>>> once I've got it set up and producing the same problem.
>>
>> Not yet, at this point we know that we were looking at an incomplete
>> picture so what you need to do is capture raw logs and we will be able
>> to get both a VXID and a timestamp from the ExpKill records (although
>> the timestamp for EXP_expire puzzles me).
>>
>> See man varnishlog to see how to write (-w) and then read (-r) logs
>> to/from a file. When you notice the alleged bug, note the transaction
>> id and write the current logs (with the -d option) so that you can
>> pick up all the interesting bits at rest (instead of doing it on live
>> traffic).
>>
>>> I can say that in my case there is definitely no Age header coming
>>> from the
>>> back-end. Also as shown in the example I sent it is the 7th HIT on that
>>> object.
>>
>> Yes, smells like a bug. But before capturing logs, make sure to remove
>> Hash records from the vsl_mask (man varnishd) so we can confirm what's
>> being purged too.
>>
>> I have a theory, a long shot that will only prove how unfamiliar I am
>> with this part of Varnish. Since the purge moves the object to the
>> expiry inbox, it could be that under load the restart may happen
>> before the expiry thread marks it as expired, thus creating a race
>> with the next lookup.
>>
>> Cheers,
>> Dridi
>>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc at varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
More information about the varnish-misc
mailing list