Grace and misbehaving servers

Fri Mar 20 22:11:58 UTC 2020

On Thu , Mar 19, 2020 at 11:12 AM Dridi Boukelmoune <dridi at varni.sh> wrote:
>
> Not quite!
>
> ttl+grace+keep defines how long an object may stay in the cache
> (barring any form of invalidation).
>
> The grace I'm referring to is beresp.grace,

Well, when I wrote "if ttl + grace + keep is a low value set in vcl_backend_response", I was talking about beresp.grace, as in beresp.ttl + beresp.grace + beresp.keep.

> it defines how long we might serve a stale object while a background fetch is in progress.

I'm not really seeing how that is different from what I said. If beresp.ttl + beresp.grace + beresp.keep is 10s in total, then a req.grace of say 24h wouldn't do much good, right? Or maybe I just misunderstood what you were saying here.

> As always in such cases it's not black or white. Depending on the
> nature of your web traffic you may want to put the cursor on always
> serving something, or never serving something stale. For example, live
> "real time" traffic may favor failing some requests over serving stale
> data.

Well, I was thinking of the typical "regular" small/medium website, like blogs, corporate profile, small town news etc.

> I agree that on paper it sounds simple, but in practice it might be
> harder to get right.

OK. But what if I implemented it in this way, in my VCL?

* In vcl_backend_response, set beresp.grace to 72h if status < 400
* In vcl_backend_error and vcl_backend_response (when status >= 500), return (abandon)
* In vcl_synth, restart the request, with a special req header set
* In vcl_recv, if this req header is present, set req.grace to 72h

Wouldn't this work? If no, why? If yes, would you say there is something else problematic with it? Of course I would have to handle some special cases, and maybe check req.restarts and such, but I'm talking about the thought process as a whole here. I might be missing something, but I think I would need someone to point it out to me because I just don't get why this would be wrong.

> Is it hurting you that less frequently requested contents don't stay
> in the cache?

If it results in people seeing error pages when a stale content would be perfectly fine for them, then yes.

And these less frequently requested pages might still be part of a group of pages that all result in an error in the backend (while the health probe still return 200 OK). So while one individual page might be visited infrequently, the total number of visits on these kind of pages might be high.

Lets say that there are 3.000 unique (and cachable) pages that are visited during an average weekend. And all of these are in the Varnish cache, but 2.000 of these have stale content. Now lets say that 50% of all pages start returning 500 errors from the backend, on a Friday evening. That would mean that about ~1000 of these stale pages would result in the error displayed to the end users during that weekend. I would much more prefer if it were to still serve them stale content, and then I could look into the problem on Monday morning.

> Another option is to give Varnish a high TTL (and give clients a lower
> TTL) and trigger a form of invalidation directly from the backend when
> you know a resource changed.

Well, that is perfectly fine for pages that have a one-to-one mapping between the page (ie the URL) and the content updated. But most pages in our setup contain a mix of multiple contents, and it is not possible to know beforehand if a specific content will contribute to the result of a specific page. That is especially true for new content that might be included in multiple pages already in the cache.

The only way to handle that in a foolproof way, as far as I can tell, is to invalidate all pages (since any page can contain this kind of content) the moment any object is updated. But that would pretty much clear the cache constantly. And we would still have to handle the case where the cache is invalidated for a page that gives a 500 error when Varnish tries to fetch it.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20200320/5617174f/attachment.html>