Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic

Batanun B batanun at hotmail.com
Thu Dec 16 20:12:21 UTC 2021


Hi,

One of our websites usually has quite a low but steady stream of visitors, but occationally we get a sudden surge of requests over a very short time period (about 1-2 minutes). Varnish seems to handle the load fine, but the backends struggle with this. But I have noticed that Varnish doesn't serve the stale cached data, but instead shows our synthetic 500 page. This is true even for the start page, that definitely existed in the cache. And we have a grace period of 10 days, so I it's quite annoying that we can't simply serve the stale cached data during this short period.

I have tried picturing the entire flow, following the logic in the vcl, but I can't see what we do wrong. And annoyingly I can't reproduce the problem locally by simply shutting down the backends (or setting them to unhealthy), because whenever I do that I get the stale content served just as intended. Could the sheer volume itself cause this, making it impossible to reproduce by simply fetching the page a few times in the browser before and after disabling the backends? Or is there some edge case that I haven't thought of that is causing this?

A simpliedfied version of our vcl is included below, with only the relevant parts. But unless I have some blatent problem with the vcl, I think it would be good if I learned how to troubleshoot this using the Varnish tools, like varnishlog. So that next time this happens, I can use varnishlog etc to see what's happening.

Is it possible using varnishlog to find the very first request for a specific path ("/" in our case) where it returned the synthetic 500 and put it in the cache? And is it also possible to find the request just before that one, for the same path. If I could extract those two requests (including all the metadata in those transactions) from the djungle of thousands of requests, then maybe I can find some explanation why the second request doesn't return the stale data.

-----------------------------

sub vcl_hit {
    if (obj.ttl > 0s) {
        // Regular cache hit
        return (deliver);
    } else if (req.restarts == 0 && std.healthy(req.backend_hint)) {
        // Graced cache hit, first attempt.
        // Force cache miss to trigger fetch in foreground (ie synchronous fetch).
        set req.http.X-forced-miss = "true";
        return (miss);
    } else {
        // Graced cache hit, previous attempts failed (or backend unhealthy). Let the fetch happen in the background (ie asynchronous fetch), and return the cached value.
        return (deliver);
    }
}

sub vcl_recv {
    if (req.http.X-cache-pass == "true") {
        return(pass);
    }

    set req.grace = 240h; // 10 day grace
}

sub vcl_backend_response {
    if (bereq.http.X-cache-pass != "true") {
  if (beresp.status < 400) {
     set beresp.grace = 240h;
     set beresp.ttl = 30m; // Cache invalidation in the form of xkey softpurge can put objects into grace before the TTL is past.
        } else if (beresp.status < 500) {
            set beresp.ttl = 1m;
            return (deliver);
        } else if (beresp.status >= 500) {
            // In some cases we want to abandon the backend request on 500-errors, since it otherwise would overwrite the cached object that still is usefull for grace.
            // This will make it jump to vcl_synth with a 503 status. There it will restart the request.
            if (bereq.is_bgfetch) {
                // In grace period. Abandoning 5xx request, since it otherwise would overwrite the cached object that still is usefull for grace
                return (abandon);
            } else if (bereq.http.X-forced-miss == "true") {
                return (abandon);
            }
            // Non background fetch, ie no grace period (and no stale content available). Cache the error page for a few seconds.
            set beresp.ttl = 15s;
            return (deliver);
        }
    }
}

sub vcl_synth {
    if (req.http.X-forced-miss == "true" && resp.status >= 500) {
        return (restart);
    }
}

sub vcl_backend_error {
    if (bereq.is_bgfetch || bereq.http.X-forced-miss == "true") {
        // By calling abandon, it will jump to vcl_synth with a 503 status. There it will restart the request.
        return (abandon);
    }
    set beresp.http.Content-Type = "text/html; charset=utf-8";
    set beresp.ttl = 15s;
    synthetic(std.fileread("/etc/varnish/500.html"));
    return (deliver);
}

-----------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20211216/77e5bbd9/attachment.html>


More information about the varnish-misc mailing list