Does a return(restart) prevent first_byte_timeout from terminating a connection?

Sat Jan 11 01:39:45 CET 2014

I am experiencing some strange behavior. Periodically requests will take 2x
the time specified in first_byte_timeout before they actually are killed
and the server returns a 503. This is quite annoying and it means that our
users receive a 504 gateway timeout instead of our error message.

My gut tells me that it has something to do with the way return(restart)
works. The response code returned from the backend is a 503.

Let me describe my vcl config briefly. The backend definitions look like
this:

backend test1 {
  .host = "10.0.0.1";
  .port = "80";
  .connect_timeout = 1s;
  .first_byte_timeout = 65s;
  .between_bytes_timeout = 1s;
  .saintmode_threshold = 5;
  .probe = {
    .request =
      "GET /status HTTP/1.1"
      "Host: test.com"
      "Connection: close";
    .interval = 60s;
    .timeout = 10s;
    .window = 5;
    .threshold = 3;
  }
}

(similar for backend test2)

I'm using the round-robin directory. Under vcl_fetch I have this:

sub vcl_fetch {
  if (beresp.status == 503) {
    set beresp.saintmode = 60s;
    if (req.request != "POST") {
      return(restart);
    } else {
      error 500 "Failed";
    }
  }

  set beresp.grace = 10s;
}

Lastly, here's me catching the varnish server in the act of essentially
making two requests and not timing out.

  169 BackendXID   b 1547011740
  169 TxRequest    b GET
  169 TxURL        b test?
  169 TxProtocol   b HTTP/1.1
  169 TxHeader     b User-Agent: curl/7.30.0
  169 TxHeader     b Host: xxx.xxx.xxx
  169 TxHeader     b Accept: */*
  169 TxHeader     b X-Varnish: 1547011740
  169 TxHeader     b Accept-Encoding: gzip
  169 BackendClose b test1

  107 BackendXID   b 1547011740
  107 TxRequest    b GET
  107 TxURL        b /test?
  107 TxProtocol   b HTTP/1.1
  107 TxHeader     b User-Agent: curl/7.30.0
  107 TxHeader     b Host: xxx.xxx.xxx
  107 TxHeader     b Accept: */*
  107 TxHeader     b X-Varnish: 1547011740
  107 TxHeader     b Accept-Encoding: gzip
  107 BackendClose b test2

   47 SessionOpen  c 162.208.42.130 59288 :80
   47 ReqStart     c 162.208.42.130 59288 1547011740
   47 RxRequest    c GET
   47 RxURL        c /test?
   47 RxProtocol   c HTTP/1.1
   47 RxHeader     c User-Agent: curl/7.30.0
    47 RxHeader     c Accept: */*
   47 RxHeader     c Connection: keep-alive
   47 VCL_call     c recv lookup
   47 VCL_call     c hash
   47 Hash         c
   47 Hash         c /test?
   47 Hash         c
   47 VCL_return   c hash
   47 VCL_call     c miss fetch
   47 Backend      c 169 main_cluster test1
   47 FetchError   c http first read error: -1 11 (Resource temporarily
unavailable)
   47 Backend      c 107 main_cluster test2
   47 FetchError   c http first read error: -1 11 (Resource temporarily
unavailable)
   47 VCL_call     c error deliver
   47 VCL_call     c deliver deliver
   47 TxProtocol   c HTTP/1.1
   47 TxStatus     c 503
   47 TxResponse   c Service Unavailable
   47 TxHeader     c Server: Varnish
   47 TxHeader     c Content-Type: application/json; charset=utf-8
   47 TxHeader     c Content-Length: 65
   47 TxHeader     c Accept-Ranges: bytes
   47 TxHeader     c Date: Fri, 10 Jan 2014 23:24:52 GMT
   47 TxHeader     c X-Varnish: 1547011740
   47 TxHeader     c Age: 130
   47 TxHeader     c Via: 1.1 varnish
   47 TxHeader     c Connection: close
   47 Length       c 65
   47 ReqEnd       c 1547011740 1389396162.194909096 1389396292.196509361
0.000040293 130.000771999 0.000828266

What it looks to me like what's happening is the first server is returning
a bad status code after 60 seconds (the first_byte_timeout is 65), which
satisfies the requirement of first_byte_timeout. The request triggers a
return(restart) and it's tried again on another host, which takes another
60 or so seconds. Finally the request just dies by only after it's been
hanging around for twice the time allotted in first_byte_timeout

Does anyone know what I'm doing wrong? I do want to trigger a
return(restart) for spurious status codes, but I want first_byte_timeout to
serve as the time I'm willing to let a request go through. Is there a
better parameter I can use than this one?

It looks very similar to this
bug<https://www.varnish-cache.org/trac/ticket/1156> but
it looks like that's been fixed. I'm using varnish 3.0.5.

Any help is greatly appreciated.
-- 
Stephen Wood
www.heystephenwood.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20140110/d1659e4a/attachment.html>