Grace and misbehaving servers

Thu Mar 19 10:12:11 UTC 2020

On Tue, Mar 17, 2020 at 8:06 PM Batanun B <batanun at hotmail.com> wrote:
>
> Hi Dridi,
>
> On Monday, March 16, 2020 9:58 AM Dridi Boukelmoune <dridi at varni.sh> wrote:
>
> > Not really, it's actually the other way around. The beresp.grace
> > variable defines how long you may serve an object past its TTL once it
> > enters the cache.
> >
> > Subsequent requests can then limit grace mode, so think of req.grace
> > as a req.max_grace variable (which maybe hints that it should have
> > been called that in the first place).
>
> OK. So beresp.grace mainly effects how long the object can stay in the cache? And if ttl + grace + keep is a low value set in vcl_backend_response, then vcl_recv is limited in how high the grace can be?

Not quite!

ttl+grace+keep defines how long an object may stay in the cache
(barring any form of invalidation).

The grace I'm referring to is beresp.grace, it defines how long we
might serve a stale object while a background fetch is in progress.

> And req.grace doesn't effect the time that the object is in the cache? Even if req.grace is set to a low value on the very first request (ie the same request that triggers the call to the backend)?

Right, req.grace only defines the maximum staleness tolerated by a
client. So if backend selection happens on the backend side, you can
for example adjust that maximum based on the health of the backend.

> > What you are describing is stale-if-error, something we don't support
> > but could be approximated with somewhat convoluted VCL. It used to be
> > easier when Varnish had saint mode built-in because it generally
> > resulted in less convoluted VCL.
> >
> > It's not something I would recommend attempting today.
>
> That's strange. This stale-if-error sounds like something pretty much everyone would want, right? I mean, if there is is stale content available why show an error page to the end user?

As always in such cases it's not black or white. Depending on the
nature of your web traffic you may want to put the cursor on always
serving something, or never serving something stale. For example, live
"real time" traffic may favor failing some requests over serving stale
data.

Many users want stale-if-error, but it's not trivial, and it needs to
be balanced against other aspects like performance.

> But maybe it was my want to "cache/remember" previous failed fetches and that made it complicated? So if I loosen the requirements/wish-list a bit, into this:
>
> Assuming that:
> * A request comes in to Varnish
> * The content is stale, but still in the cache
> * The backend is considered healthy
> * The short (10s) grace has expired
> * Varnish triggers a synchronus fetch in the backend
> * This fetch fails (timeout or 5xx error)
>
> I would then like Varnish to:
> * Return the stale content

I agree that on paper it sounds simple, but in practice it might be
harder to get right.

For example, "add HTTP/3 support" is a simple statement, but the work
it implies can be orders of magnitude more complicated. And
stale-if-error is one those tricky features: tricky for performance,
that must not break existing VCL, etc.

> Would this be possible using basic Varnish community edition, without a "convoluted VCL", as you put it? Is it possible without triggering a restart of the request? Either way, I am interested in hearing about how it can be achieved. Is there any documentation or blog post that mentions this? Or can you give me some example code perhaps? Even a convoluted example would be OK by me.

I wouldn't recommend stale-if-error at all today, as I said in my first reply.

> Increasing the req.grace value for every request is not an option, since we only want to serve old content if Varnish can't get hold of new content. And some of our pages are visited very rarely, so we can't rely on a constant stream of visitors keeping the content fresh in the cache.

Is it hurting you that less frequently requested contents don't stay
in the cache?

Another option is to give Varnish a high TTL (and give clients a lower
TTL) and trigger a form of invalidation directly from the backend when
you know a resource changed.

Dridi