Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic

Thu Dec 16 20:23:15 UTC 2021

Could be a vary issue, can you post the output of `varnishlog -d -g request
-q 'RespStaus eq 500'? Please anonymize the bits you don't want us to see
by replacing them with "XXXXXX", but don't simply remove them.

In the meantime, here's a cheat sheet to get started with varnishlog:
https://docs.varnish-software.com/tutorials/vsl-query/

-- 
Guillaume Quintard

On Thu, Dec 16, 2021 at 12:15 PM Batanun B <batanun at hotmail.com> wrote:

> Hi,
>
> One of our websites usually has quite a low but steady stream of visitors,
> but occationally we get a sudden surge of requests over a very short time
> period (about 1-2 minutes). Varnish seems to handle the load fine, but the
> backends struggle with this. But I have noticed that Varnish doesn't serve
> the stale cached data, but instead shows our synthetic 500 page. This is
> true even for the start page, that definitely existed in the cache. And we
> have a grace period of 10 days, so I it's quite annoying that we can't
> simply serve the stale cached data during this short period.
>
> I have tried picturing the entire flow, following the logic in the vcl,
> but I can't see what we do wrong. And annoyingly I can't reproduce the
> problem locally by simply shutting down the backends (or setting them to
> unhealthy), because whenever I do that I get the stale content served just
> as intended. Could the sheer volume itself cause this, making it impossible
> to reproduce by simply fetching the page a few times in the browser before
> and after disabling the backends? Or is there some edge case that I haven't
> thought of that is causing this?
>
> A simpliedfied version of our vcl is included below, with only the
> relevant parts. But unless I have some blatent problem with the vcl, I
> think it would be good if I learned how to troubleshoot this using the
> Varnish tools, like varnishlog. So that next time this happens, I can use
> varnishlog etc to see what's happening.
>
> Is it possible using varnishlog to find the very first request for a
> specific path ("/" in our case) where it returned the synthetic 500 and put
> it in the cache? And is it also possible to find the request just before
> that one, for the same path. If I could extract those two requests
> (including all the metadata in those transactions) from the djungle of
> thousands of requests, then maybe I can find some explanation why the
> second request doesn't return the stale data.
>
> -----------------------------
>
> sub vcl_hit {
>     if (obj.ttl > 0s) {
>         // Regular cache hit
>         return (deliver);
>     } else if (req.restarts == 0 && std.healthy(req.backend_hint)) {
>         // Graced cache hit, first attempt.
>         // Force cache miss to trigger fetch in foreground (ie synchronous
> fetch).
>         set req.http.X-forced-miss = "true";
>         return (miss);
>     } else {
>         // Graced cache hit, previous attempts failed (or backend
> unhealthy). Let the fetch happen in the background (ie asynchronous fetch),
> and return the cached value.
>         return (deliver);
>     }
> }
>
> sub vcl_recv {
>     if (req.http.X-cache-pass == "true") {
>         return(pass);
>     }
>
>     set req.grace = 240h; // 10 day grace
> }
>
> sub vcl_backend_response {
>     if (bereq.http.X-cache-pass != "true") {
>   if (beresp.status < 400) {
>      set beresp.grace = 240h;
>      set beresp.ttl = 30m; // Cache invalidation in the form of xkey
> softpurge can put objects into grace before the TTL is past.
>         } else if (beresp.status < 500) {
>             set beresp.ttl = 1m;
>             return (deliver);
>         } else if (beresp.status >= 500) {
>             // In some cases we want to abandon the backend request on
> 500-errors, since it otherwise would overwrite the cached object that still
> is usefull for grace.
>             // This will make it jump to vcl_synth with a 503 status.
> There it will restart the request.
>             if (bereq.is_bgfetch) {
>                 // In grace period. Abandoning 5xx request, since it
> otherwise would overwrite the cached object that still is usefull for grace
>                 return (abandon);
>             } else if (bereq.http.X-forced-miss == "true") {
>                 return (abandon);
>             }
>             // Non background fetch, ie no grace period (and no stale
> content available). Cache the error page for a few seconds.
>             set beresp.ttl = 15s;
>             return (deliver);
>         }
>     }
> }
>
> sub vcl_synth {
>     if (req.http.X-forced-miss == "true" && resp.status >= 500) {
>         return (restart);
>     }
> }
>
> sub vcl_backend_error {
>     if (bereq.is_bgfetch || bereq.http.X-forced-miss == "true") {
>         // By calling abandon, it will jump to vcl_synth with a 503
> status. There it will restart the request.
>         return (abandon);
>     }
>     set beresp.http.Content-Type = "text/html; charset=utf-8";
>     set beresp.ttl = 15s;
>     synthetic(std.fileread("/etc/varnish/500.html"));
>     return (deliver);
> }
>
> -----------------------------
> _______________________________________________
> varnish-misc mailing list
> varnish-misc at varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20211216/57851d97/attachment-0001.html>