Grace and misbehaving servers

Mon Mar 23 10:00:18 UTC 2020

Hi,

On Fri, Mar 20, 2020 at 10:14 PM Batanun B <batanun at hotmail.com> wrote:
>
> On Thu , Mar 19, 2020 at 11:12 AM Dridi Boukelmoune <dridi at varni.sh> wrote:
> >
> > Not quite!
> >
> > ttl+grace+keep defines how long an object may stay in the cache
> > (barring any form of invalidation).
> >
> > The grace I'm referring to is beresp.grace,
>
> Well, when I wrote "if ttl + grace + keep is a low value set in vcl_backend_response", I was talking about beresp.grace, as in beresp.ttl + beresp.grace + beresp.keep.
>
>
> > it defines how long we might serve a stale object while a background fetch is in progress.
>
> I'm not really seeing how that is different from what I said. If beresp.ttl + beresp.grace + beresp.keep is 10s in total, then a req.grace of say 24h wouldn't do much good, right? Or maybe I just misunderstood what you were saying here.

Or maybe *I* just misunderstood your understanding :)

> > As always in such cases it's not black or white. Depending on the
> > nature of your web traffic you may want to put the cursor on always
> > serving something, or never serving something stale. For example, live
> > "real time" traffic may favor failing some requests over serving stale
> > data.
>
> Well, I was thinking of the typical "regular" small/medium website, like blogs, corporate profile, small town news etc.
>
>
> > I agree that on paper it sounds simple, but in practice it might be
> > harder to get right.
>
> OK. But what if I implemented it in this way, in my VCL?
>
> * In vcl_backend_response, set beresp.grace to 72h if status < 400
> * In vcl_backend_error and vcl_backend_response (when status >= 500), return (abandon)
> * In vcl_synth, restart the request, with a special req header set
> * In vcl_recv, if this req header is present, set req.grace to 72h
>
> Wouldn't this work? If no, why? If yes, would you say there is something else problematic with it? Of course I would have to handle some special cases, and maybe check req.restarts and such, but I'm talking about the thought process as a whole here. I might be missing something, but I think I would need someone to point it out to me because I just don't get why this would be wrong.

For starters, there currently is no way to know for sure that you
entered vcl_synth because of a return(abandon) transition. There are
plans to make it possible, but currently you can do that with
confidence lower than 100%.

A problem with the restart logic is the race it opens since you now
have two lookups, but overall, that's the kind of convoluted VCL that
should work. The devil might be in the details.

> > Is it hurting you that less frequently requested contents don't stay
> > in the cache?
>
> If it results in people seeing error pages when a stale content would be perfectly fine for them, then yes.
>
> And these less frequently requested pages might still be part of a group of pages that all result in an error in the backend (while the health probe still return 200 OK). So while one individual page might be visited infrequently, the total number of visits on these kind of pages might be high.
>
> Lets say that there are 3.000 unique (and cachable) pages that are visited during an average weekend. And all of these are in the Varnish cache, but 2.000 of these have stale content. Now lets say that 50% of all pages start returning 500 errors from the backend, on a Friday evening. That would mean that about ~1000 of these stale pages would result in the error displayed to the end users during that weekend. I would much more prefer if it were to still serve them stale content, and then I could look into the problem on Monday morning.

In this case you might want to combine your VCL restart logic with
vmod_saintmode.

https://github.com/varnish/varnish-modules/blob/6.0-lts/docs/vmod_saintmode.rst#vmod_saintmode

This VMOD allows you to create circuit breakers for individual
resources for a given backend. That will result in more complicated
but will help you mark individual resources as sick, making the need
for a "special req header" redundant. And since vmod_saintmode marks
resources sick for a given time, it means that NOT ALL individual
clients will go through the complete restart dance during that window.

I think you may still have to do a restart in vcl_miss because only
then will you know the saint-mode health (you need both a backend and
a hash).

> > Another option is to give Varnish a high TTL (and give clients a lower
> > TTL) and trigger a form of invalidation directly from the backend when
> > you know a resource changed.
>
> Well, that is perfectly fine for pages that have a one-to-one mapping between the page (ie the URL) and the content updated. But most pages in our setup contain a mix of multiple contents, and it is not possible to know beforehand if a specific content will contribute to the result of a specific page. That is especially true for new content that might be included in multiple pages already in the cache.
>
> The only way to handle that in a foolproof way, as far as I can tell, is to invalidate all pages (since any page can contain this kind of content) the moment any object is updated. But that would pretty much clear the cache constantly. And we would still have to handle the case where the cache is invalidated for a page that gives a 500 error when Varnish tries to fetch it.

And you might solve this problem with vmod_xkey!

https://github.com/varnish/varnish-modules/blob/6.0-lts/docs/vmod_xkey.rst#vmod_xkey

You need help from the backend to communicate a list of "abstract
identifiers" of "things" that contribute to a response. This way if a
change in your backend spans multiple responses you can still perform
a single invalidation to affect them all.

Dridi