Varnish sometimes delays requests for minutes

Wed Sep 19 02:36:04 CEST 2012

I've only recently begun using Varnish, so let me first and foremost apologize
for anything obvious that I've overlooked or misunderstood.

We've just started using Varnish on a small part of our infrastructure, and for
the most part, it's actually beautifully.  However, periodically a page request
will take around two minutes to return.

When a request acts like this, all other requests to the same URL act the same;
that is, they all hang and get back responses at the same time.  My guess is
that the hang is exactly 2 minutes, but I never make the first request wrapped
in a time(1) call.  During this time, other requests passing through the same
VCL logic and going to the same backend execute successfully - both those
served from the cache and those not.

The hung requests only happen on pages that have expired caches, but not every
request to that page while it is expired will trigger a hang.  That is, I have
a difficult time reproducing it, and kinda have to just watch requests in the
corner of my eye while going about other duties.

I am fairly certain that these hangs are not the fault of the backend, but not
completely sure - mark me down at 95% certainty.

Here are the parts of my configuration that seem potentially relevant:

========

backend default {
    .host = "127.0.0.1";
    .port = "8080";
    .connect_timeout = 30s;       # Default: 0.7s
    .first_byte_timeout = 3m;     # Default: 60s
    .between_bytes_timeout = 2m;  # Default: 60s
    # Always feed requests to the backend, even if Varnish thinks it's down.
    .saintmode_threshold = 0;
}

sub vcl_recv {
    # Set X-Forwarded-For so we get the client IP on the backend.
    if (req.restarts == 0) {
        if (req.http.X-Forwarded-For) {
            set req.http.X-Forwarded-For =
             req.http.X-Forwarded-For + ", " + client.ip;
        } else {
            set req.http.X-Forwarded-For = client.ip;
        }
    }

    # Ignore non-RFC2616 or CONNECT requests entirely.
    if (req.request != "GET"
     && req.request != "HEAD"
     && req.request != "PUT"
     && req.request != "POST"
     && req.request != "TRACE"
     && req.request != "OPTIONS"
     && req.request != "DELETE") {
        return (pipe);
    }
    # And don't cache anything not GET or HEAD, since POST and friends usually
    # modify state.
    if (req.request != "GET" && req.request != "HEAD") {
        return (pass);
    }

    # Turn on rules a little bit at a time.
    if (req.http.host ~ "^my.example.com$") {
        # [snip] force non-caching on a few pages

        set req.grace = 30s;

        return (lookup);
    }

    # Don't cache anything we haven't explicitly told Varnish to cache.
    return (pass);
}

sub vcl_fetch {
    if (req.http.host ~ "^my.example.com$") {
        # Cache permanent redirects for a while, since... they're permanent.
        if (beresp.status == 301) {
            set beresp.ttl = 30m;
        }

        # Sometimes 404s are the result of us not getting assets built quite in
        # time, or similar things.  So, let us recover quickly.
        if (beresp.status == 404) {
            set beresp.ttl = 0s;
            return (hit_for_pass);
        }

        # Hopefully refreshing an error page will cause it to not error.
        if (beresp.status == 500
         || beresp.status == 502
         || beresp.status == 503
         || beresp.status == 504) {
            set beresp.ttl = 0s;
            return (hit_for_pass);
        }

        # Allow objects to live a bit longer than their expiry time, in case we
        # want to serve up expired objects while waiting for the backend to
        # respond.  This should never be shorter than req.grace, as set in
        # vcl_recv.
        set beresp.grace = 1m;

        return (deliver);
    }
}

========

While I would appreciate any solutions, what I'm really looking for are
pointers on how to go about debugging this sort of problem - does varnishlog
provide information that's useful?  It spits out so much information that I'm
not quite sure what I need to filter out to see what's going on here.

Any help appreciated.  Thanks.
 - P