Backend Polling + connection timeouts = weird issue

Thu Apr 22 08:38:52 CEST 2010

So I'm having a weird sporadic issue, and I'm wondering if varnish can
mark a backend down on timed out connections that are not the backend
probe. I ask since I'm not getting failed backend messages in varnish
log, yet i'm getting burst of 503's.

My backend probe looks like this.

backend backend_XXX {
.host = "10.100.118.238";
.port = "80";
.probe = {
	.url = "/robots.txt?__site=XXXXX";
	.timeout = 5000ms;
	.interval = 5s;
	.window = 10;
	.threshold = 4;
	}
}

After one of these timed out connections...
 201 ReqStart     c 10.100.114.196 60540 1416455775
 201 RxRequest    c GET
 201 RxURL        c /XXXXsomeXXXXurl
 201 RxProtocol   c HTTP/1.0
 201 RxHeader     c User-Agent: Wget/1.10.2 (Red Hat modified)
 201 RxHeader     c Accept: */*
 201 RxHeader     c Host: xxxx.xxxxx.com
 201 RxHeader     c Connection: Keep-Alive
 201 VCL_call     c recv
 201 VCL_return   c lookup
 201 VCL_call     c hash
 201 VCL_return   c hash
 201 VCL_call     c miss
 201 VCL_return   c fetch
---->   201 FetchError   c no backend connection <---------
 201 VCL_call     c error
 201 VCL_return   c deliver
 201 Length       c 839
 201 VCL_call     c deliver
 201 VCL_return   c deliver
 201 TxProtocol   c HTTP/1.1
 201 TxStatus     c 503
 201 TxResponse   c Service Unavailable
 201 TxHeader     c Server: Varnish
 201 TxHeader     c Retry-After: 0
 201 TxHeader     c Content-Length: 839
 201 TxHeader     c Date: Thu, 22 Apr 2010 02:03:46 GMT
 201 TxHeader     c X-Varnish: 1416455775
 201 TxHeader     c Age: 0
 201 TxHeader     c Via: 1.1 varnish
 201 TxHeader     c Connection: close
 201 ReqEnd       c 1416455775 1271901825.787379980
1271901826.187721968 0.000025034 0.400313139 0.000028849
 201 SessionClose c error
 201 StatSess     c 10.100.114.196 60540 0 1 1 0 0 0 195 839

a typical timed out connection, which i'm verifying in my backend as
having taken close to 80 seconds looks like this..

825637170 WorkThread   - 924 1271901801.804220915 0.000025988
1.712481976 0.000046015
  64 SessionOpen  c 10.100.114.196 60537 0.0.0.0:80
  64 ReqStart     c 10.100.114.196 60537 1416454378
  64 RxRequest    c GET
  64 RxURL        c /XXXsomeXXXsimilarXXXurl
  64 RxProtocol   c HTTP/1.0
  64 RxHeader     c User-Agent: Wget/1.10.2 (Red Hat modified)
  64 RxHeader     c Accept: */*
  64 RxHeader     c Host: sec.floridatoday.com
  64 RxHeader     c Connection: Keep-Alive
  64 VCL_call     c recv
  64 VCL_return   c lookup
  64 VCL_call     c hash
  64 VCL_return   c hash
  64 VCL_call     c miss
  64 VCL_return   c fetch
 211 BackendOpen  b backend_XXX 10.100.118.149 12967 10.100.118.238 80
  64 Backend      c 211 backend_XXX backend_XXX
 211 TxRequest    b GET
 211 TxURL        b /XXXsomeXXXsimilarXXXurl
 211 TxProtocol   b HTTP/1.1
 211 TxHeader     b User-Agent: Wget/1.10.2 (Red Hat modified)
 211 TxHeader     b Accept: */*
 211 TxHeader     b Host: xxxx.xxxxxx.com
 211 TxHeader     b X-Forwarded-For: 10.100.114.196
 211 TxHeader     b X-Varnish: 1416454378
 117 Debug        c "herding"
 202 Debug        c "herding"

.... never comes back...

( again, going back to the backend, shows this request as having taken
80 seconds, so i'm assuming varnish gave up after X seconds)

Has anyone seen something like this? It does recover... but it seems
like a time out connection is causing the whole backend to be marked
as failed.. which sucks big time for me since my backends are behind a
load balancer.

Should I try connection restarts? Would they even help in this case?

-Javier

PS: Unrelated, but i tried registering on trac, so submit my
varnishncsa extended logging patch, and it seems the registration is
not working on varnish-cache.org