Random "Data not received" with Varnish behind ELB

Wed Oct 22 18:59:55 CEST 2014

At 2014-10-22T12:23-0400, Greg Taylor wrote:
> We managed to find what appears to be one of these failed requests:
> 2014-10-22T15:47:56.092374Z littlepeople [1]68.115.251.182:45368 - -1
> -1 -1 408 0 0 0 "GET [2]http://littlepeople.pathwright.com:80/
> HTTP/1.1"

What does varnishlog show at this point?
If you don't see anything from varnishlog during a broken request, I
would break out tshark to dump packets on the varnish box. Also check to
see if the logs (or packets) when Chrome connects are different from
when Firefox connects. It could be something like SPDY not interacting
properly with some aspect of the system.

> Says request timeout. I've got my ELB timeout set to 6 seconds, and
> varnish's default idle_timeout is 10s. I have also added this:

What's the backend webserver timeout?
What happens if you set them all to have the same timeout?

Given the generosity of your other timeouts, why is connect_timeout only
1 second? Especially combined with forcing all backend connections to be
closed, this seems like a prime candidate for errors.

Can you post a gist of "varnishstat -1"?

On the ELB monitoring page, what do the statistics look like?
Especially for the following:
* Surge Queue Length
* Spillover Count
* Backend Connection Errors
* Average Latency (tells you how close you're coming to those timeouts)

You might find this useful reading:
http://reinvent.kinvey.com/h/i/6206548-key-aws-elb-monitoring-metrics

>    sub vcl_pipe {
>      # Don't re-use backend connections.
>      set bereq.http.Connection = "close";
>    }

Don't do this. Re-opening connections is expensive.

Paul