Backend Fetch failed

Thu Apr 6 19:38:22 CEST 2017

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

For problems like this, *always look for the FetchError entry in the
backend logs*.

> << BeReq    >> 65547
[...]
> -   FetchError     no backend connection -   Timestamp      Beresp:
> 1491485655.912871 0.000051 0.000051 -   Timestamp      Error:
> 1491485655.912878 0.000059 0.000007
[...]
> -   End

The client-side logs, on the other hand, frankly don't matter -- not
for the purposes of diagnosing the problem with the backend fetch. So
I'll just ignore them altogether.

> *   << Request  >> 65546
[...]
> -   End

> *   << Request  >> 5
[...]
> -   End

> *   << BeReq    >> 6
[...]
> -   FetchError     no backend connection -   Timestamp      Beresp:
> 1491485659.606340 0.000056 0.000056 -   Timestamp      Error:
> 1491485659.606347 0.000062 0.000006
[...]

FetchError "no backend connection" very likely means, in this case,
that your backend is failing its health checks, so that Varnish
determines that there is no healthy backend to which it can direct the
requests.

There is one other possibility for "no backend connection", which is
that Varnish attempted to initiate a network connection to the
backend, but the connection could not be obtained before
connect_timeout expired. In that case, the timestamps would have shown
that almost exactly as much time as connect_timeout would have been
taken, which for your config would be very obvious (more about that
further down). But as you see here in the Timestamp entries, Varnish
determined the error after about 50 microseconds, which is
near-certain proof that the health checks failed (about enough time
for Varnish to check its record that the backend is unhealthy).

You can see the results of the health checks in the log, but for that
you need raw grouping, since health checks are not transactional (they
are not part of requests/responses that Varnish serves):

$ varnishlog -g raw -i Backend_health

Your health checks are probably failing because you've written the
probes incorrectly:

backend drupal {
[...]
    .probe = {
        .url = "drupal.miat.com<http://drupal.miat.com>";
[...]
     }
}

This is very common misunderstanding: "url" in the conceptual world of
Varnish only ever refers to the *path*; the domain should not appear
there. So your probes should say something like:

backend drupal {
[...]
    .probe = {
        .url = "/"; # or whatever path should be used for probes
[...]
     }
}

Even after you fix that, you're really taking chances with the short
timeout for the probes:

    .probe = {
[...]
        .timeout = 60ms;
[...]
     }

Are you sure that your backends will always respond to the health
probes within 60 milliseconds? Set it to 1s and give them a chance.

That, I think, is the cause of your 503 problem, but I have to say
something about this as well, the timeouts you have set for all of
your backends:

    .connect_timeout = 6000s;
    .first_byte_timeout = 6000s;
    .between_bytes_timeout = 6000s;

Those timeouts are astonishingly, appallingly, gobsmackingly too long.
Just looking at that is almost making my head explode.

This is another common mistake: setting the Varnish timeouts to
"forever and ever and ever". On the contrary, you're much better off
if the timeouts are set to *fail fast*.

Setting your timeouts to 100 minutes helps absolutely no one at all --
it means that a worker thread in Varnish will sit idle for 100
minutes, waiting for something to happen. Worker threads are a limited
resource in Varnish; you want them to keep doing useful work, and give
up relatively soon if a backend is not responding. If there is a
serious problem in your system, so that many backends are not
responding, then your worker threads will all go idle waiting for the
timeouts to expire, and Varnish will have to start new threads.
Eventually the maximum number of threads will be reached, and when
that happens, Varnish will start to refuse new requests, which usually
means that your site goes down altogether. It's a recipe for disaster.

Rest assured that if your backend has not responded for 5999 seconds,
then it's not going to respond in the 6000th second either. It's not
responding at all.

Consider just going with the default timeouts, or with something on
the order of 6 seconds, rather than 6000 seconds. Or maybe 60 seconds,
but that's already getting too long. If your backend developers can't
get their apps to respond within a few seconds, then go yell at them
until they do. As the Varnish admin, you *cannot* solve that problem
for them, by setting your timeouts to "until the end of time".

HTH,
Geoff
- -- 
** * * UPLEX - Nils Goroll Systemoptimierung

Scheffelstraße 32
22301 Hamburg

Tel +49 40 2880 5731
Mob +49 176 636 90917
Fax +49 40 42949753

http://uplex.de

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJY5n0OAAoJEOUwvh9pJNUReBEP/1tFl8TigTIpQgng09dNc9jT
XYLLnFxZXFsjDsENX8kkemyk94AfW95AOpbFNoqtALkGiHLXTDWy0h++Lw3hT1ll
GxS8m5/qQ1+IpXXHpjHC86et1PTq7aKWtNTud0riA4b9jirlNYcdk/zaZCB/zRyA
5FzHh7By3LzJZ6qHXycYZWBy3PUQZfG1awX3VWtOzj+UP/hfHIlb6CcY97uF/8L9
Z7uff42o14iYFCGyALsy0JP3la/3qtb1tuzTn1vgqvBM9pVTdRKQXmL9Q/8XsX+Z
ySdHMaGG8/5WnUFznwXayEN84Y5fdYk6ZzGbAV3sZtQJkpHXquhj/LRQYDIjxESp
ILDh/FobMqevvXFBL/IcjaEj22xYyviu/8fYK+/QPfQ2yv5B0FWX1yIQDyNZx+4e
37XVDd96EMxA/t1XfTVk2DGw9kEtFPmLdatQx487vJsd4OyT3HX6Tiug5T2pHyPT
H/a2qKoRMOySD9i0SYMJG0v81Fi/jrJknZJZ/WHAIo4GAs2CRvFH+oI2/USMQzPj
brT/JeyVGOUObXkVA1uEYtrucUU07qOtdeVP5RBs6zaULJyu/KbIIF0cQMd0YBam
yXBwNVl89ec1RIcHl7TuTzNQ0euqgFyNZW1OAlQIbJKDProf6BHsyGwAXX7jexio
PkdtqxaiGBWa5OBR4Gws
=p7Ha
-----END PGP SIGNATURE-----