503 service unavailable error

Tim Dunphy bluethundr at gmail.com
Thu Jul 9 05:19:58 CEST 2015


>
> that interval and window on your web server is scary..... what you're
> saying is 'check each web server every 10 minutes, and only fail it
> after 3 failures'


Hah!! Agreed. I was just trying to rule the connect timeouts out of the
picture as to why the failures were happening!
I plan to set them to more normal intervals once I'm finished testing and
I've been able to get this to work.


>
> next time you see the issue, look at:
> varnishadm -n <varnish_name> debug.health


Hmm you may have a point as to the back ends. Varnish is indeed seeing them
as 'sick' when I encounter the 503 error:


[root at varnish1:~] #varnishadm -n  varnish1   debug.health
Backend web1 is Sick
Current states  good:  0 threshold:  2 window:  3
Average responsetime of good probes: 0.000000
Oldest                                                    Newest
================================================================
------------------------------------------------------4444444444 Good IPv4
------------------------------------------------------XXXXXXXXXX Good Xmit
------------------------------------------------------RRRRRRRRRR Good Recv
----------------------------------------------------HH---------- Happy
Backend web2 is Sick
Current states  good:  0 threshold:  2 window:  3
Average responsetime of good probes: 0.000000
Oldest                                                    Newest
================================================================
------------------------------------------------------4444444444 Good IPv4
------------------------------------------------------XXXXXXXXXX Good Xmit
------------------------------------------------------RRRRRRRRRR Good Recv
----------------------------------------------------HH---------- Happy


>
> I'd be willing to bet that varnish is just failing the backends.  Try
> running the healthcheck manually from the varnish boxes:
> curl -H "Host:kiki.example.com" -v "http://10.10.10.26/healthcheck.php"
> And see if you're actually getting good healthchecks.  If you're not,
> then you need to look at your backends (specifically healthcheck.php)


But if I perform the curl you're suggesting, I am able to retrieve the
healthcheck.php file!!

#curl --user admin:somepass -H "Host:wiki.example.com" -v "
http://10.10.10.25/healthcheck.php"
* About to connect() to 52.5.117.61 port 80 (#0)
*   Trying 52.5.117.61... connected
* Connected to 52.5.117.61 (52.5.117.61) port 80 (#0)
* Server auth using Basic with user 'admin'
> GET /healthcheck.php HTTP/1.1
> Authorization: Basic SomeBase64Hash==
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/
3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2
> Accept: */*
> Host:wiki.example.com
>
< HTTP/1.1 200 OK
< Date: Thu, 09 Jul 2015 02:10:35 GMT
< Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips mod_fcgid/2.3.9
PHP/5.4.42 SVN/1.7.14 mod_wsgi/3.4 Python/2.7.5
< X-Powered-By: PHP/5.4.42
< Content-Length: 5
< Content-Type: text/html; charset=UTF-8
<
good
* Connection #0 to host 52.5.117.61 left intact
* Closing connection #0

But in the curl I just did I was specifying the user auth. Which got me to
thinking, maybe I'm handing apache basic auth in the wrong way in my VCL
file?

To test this idea out, I commented out the basic auth lines in my apache
config. Then cycled the services on both apache servers and both varnish
servers.

When I ran the test you gave me again, this is the result I got back:

#varnishadm -n  varnish1   debug.health
Backend web1 is Healthy
Current states  good:  3 threshold:  2 window:  3
Average responsetime of good probes: 0.032781
Oldest                                                    Newest
================================================================
---------------------------------------------------------------4 Good IPv4
---------------------------------------------------------------X Good Xmit
---------------------------------------------------------------R Good Recv
-------------------------------------------------------------HHH Happy
Backend web2 is Healthy
Current states  good:  3 threshold:  2 window:  3
Average responsetime of good probes: 0.032889
Oldest                                                    Newest
================================================================
---------------------------------------------------------------4 Good IPv4
---------------------------------------------------------------X Good Xmit
---------------------------------------------------------------R Good Recv
-------------------------------------------------------------HHH Happy

Everbody's happy again!!

And I tried browsing around the wiki for quite a long time. And there were
NO 503 errors the entire time I was using it. Which tells me that I am,
indeed, not handling auth correctly in my VCL.

The way I thought I solved the problem was by adding a .request to the web
server definitions that specified the headers to do a GET on the health
check:

.request =
   "GET /healthcheck.php HTTP/1.1"
   "Host: wiki.example.com"
   "Connection: close";

The reason I thought this worked was because, after I'd restarted varnish
with that change in place I was able to log into the wiki with basic auth
in the web browser. And then I'd be able to use it for a while before the
back-end would come up as 'sick' in varnish again which would cause the 503
error.

I then tried following this advice again, which I had also tried earlier
without much luck:

http://blog.tenya.me/blog/2011/12/14/varnish-http-authentication/

Which tells you to add this section to your VCL file:

 if (! req.http.Authorization ~ "Basic SomeBase64Hash==")
      {
       error 401 "Restricted";
      }

And then add this sub_vcl section:

sub vcl_error {

  if (obj.status == 401) {
  set obj.http.Content-Type = "text/html; charset=utf-8";
  set obj.http.WWW-Authenticate = "Basic realm=Secured";
  synthetic {"

   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"  "
http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd">

    <HTML>
    <HEAD>
    <TITLE>Error</TITLE>
    <META HTTP-EQUIV='Content-Type' CONTENT='text/html;'>
    </HEAD>
    <BODY><H1>401 Unauthorized (varnish)</H1></BODY>
    </HTML>
    "};
     return (deliver);
    }
}

And after restarting varnish again on both nodes, with authentication in
place in the VHOST configs on the web servers I was able to log into the
wiki site again and browse around for a while.

But then after some browsing around the back ends would go sick again and
you would see the 503:

#varnishadm -n  varnish1   debug.health
Backend web1 is Sick
Current states  good:  1 threshold:  2 window:  3
Average responsetime of good probes: 0.000000
Oldest                                                    Newest
================================================================
--------------------------------------------------------------44 Good IPv4
--------------------------------------------------------------XX Good Xmit
--------------------------------------------------------------RR Good Recv
------------------------------------------------------------HH-- Happy
Backend web2 is Sick
Current states  good:  1 threshold:  2 window:  3
Average responsetime of good probes: 0.000000
Oldest                                                    Newest
================================================================
--------------------------------------------------------------44 Good IPv4
--------------------------------------------------------------XX Good Xmit
--------------------------------------------------------------RR Good Recv
------------------------------------------------------------HH-- Happy

So SOMETHING must still be off with how I'm handling authentication in my
VCL config. The next step I'm thinking of trying involves passing the
authentication headers to the .request section of my web server definition.
Although I'm not sure if it'll work. I'll let you guys know if it does.

But I'd like to present the current state of my VLC again in case anyone
has any insight or knowledge to share that may help.

backend web1 {

  .host = "10.10.10.25";

  .port = "80";

  .connect_timeout = 3600s;

  .first_byte_timeout = 3600s;

  .between_bytes_timeout = 3600s;

  .max_connections = 70;

  .probe = {

  .request =

   "GET /healthcheck.php HTTP/1.1"

   "Host: wiki.example.com"

   "Connection: close";

   .interval = 10m;

   .timeout = 60s;

   .window = 3;

   .threshold = 2;

   }

}

backend web2 {

  .host = "10.10.10.26";

  .port = "80";

  .connect_timeout = 3600s;

  .first_byte_timeout = 3600s;

  .between_bytes_timeout = 3600s;

  .max_connections = 70;

  .probe = {

  .request =

   "GET /healthcheck.php HTTP/1.1"

   "Host: wiki.example.com"

   "Connection: close";

   .interval = 10m;

   .timeout = 60s;

   .window = 3;

   .threshold = 2;

   }

}

director www round-robin {

  { .backend = web1;   }

  { .backend = web2;  }

 }

sub vcl_recv {

     if (! req.http.Authorization ~ "Basic Base64Hash==")

      {

       error 401 "Restricted";

      }

    if (req.url ~ "&action=submit($|/)") {

        return (pass);

    }

    set req.backend = www;

    return (lookup);

}

sub vcl_fetch {

      set beresp.ttl = 3600s;

      set beresp.grace = 4h;

      return (deliver);

}

sub vcl_error {

  if (obj.status == 401) {

  set obj.http.Content-Type = "text/html; charset=utf-8";

  set obj.http.WWW-Authenticate = "Basic realm=Secured";

  synthetic {"


   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"  "
http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd">


    <HTML>

    <HEAD>

    <TITLE>Error</TITLE>

    <META HTTP-EQUIV='Content-Type' CONTENT='text/html;'>

    </HEAD>

    <BODY><H1>401 Unauthorized (varnish)</H1></BODY>

    </HTML>

    "};

     return (deliver);

    }

}

sub vcl_deliver {

     if (obj.hits> 0) {

      set resp.http.X-Cache = "HIT";

     } else {

        set resp.http.X-Cache = "MISS";

     }

 }
Once again I genuinely appreciate the help of this list, and hope I haven't
worn out my welcome! ;)

Thanks,
Tim


On Wed, Jul 8, 2015 at 9:31 PM, Jason Price <japrice at gmail.com> wrote:

> that interval and window on your web server is scary..... what you're
> saying is 'check each web server every 10 minutes, and only fail it
> after 3 failures'
>
> next time you see the issue, look at:
>
> varnishadm -n <varnish_name> debug.health
>
> I'd be willing to bet that varnish is just failing the backends.  Try
> running the healthcheck manually from the varnish boxes:
>
> curl -H "Host:kiki.example.com" -v "http://10.10.10.26/healthcheck.php"
>
> And see if you're actually getting good healthchecks.  If you're not,
> then you need to look at your backends (specifically healthcheck.php)
>
> On Wed, Jul 8, 2015 at 12:14 PM, Tim Dunphy <bluethundr at gmail.com> wrote:
> > Hi guys,
> >
> >
> >  I'm having an issue where my varnish server will stop working after a
> while
> > of browsing around the site I'm using it with and throw a 503 server
> > unavailable error.
> >
> > In my varnish logs I'm getting a 'no backend connection error':
> >
> >    10 FetchError   c no backend connection
> >    10 VCL_call     c error deliver
> >    10 VCL_call     c deliver deliver
> >    10 TxProtocol   c HTTP/1.1
> >    10 TxStatus     c 503
> >    10 TxResponse   c Service Unavailable
> >    10 TxHeader     c Server: Varnish
> >
> >
> > And if I do a GET on the healthcheck from the command line on the varnish
> > server, I get a 503 response from varnish:
> >
> > #GET http://wiki.example.com/healthcheck.php
> >
> > <?xml version="1.0" encoding="utf-8"?>
> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> >  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
> > <html>
> >   <head>
> >     <title>503 Service Unavailable</title>
> >   </head>
> >   <body>
> >     <h1>Error 503 Service Unavailable</h1>
> >     <p>Service Unavailable</p>
> >     <h3>Guru Meditation:</h3>
> >     <p>XID: 2107225059</p>
> >     <hr>
> >     <p>Varnish cache server</p>
> >   </body>
> > </html>
> >
> > But if I do another GET on the healthcheck file from the varnish server
> to
> > another apache VHOST on the same server as the wiki site that responds to
> > the IP of the web server instead of the IP for the varnish server, the
> GET
> > works:
> >
> > #GET http://ops1.example.com/healthcheck.php
> > good
> >
> >
> > So I'm not sure why varnish is having trouble reaching the HC file. The
> web
> > server is a little far from the varnish server. The varnish machines are
> in
> > NYC and the web servers are in northern Virginia.
> >
> > So I tried setting the timeouts in the varnish config to a really high
> > number. And that was working for a while. But today I noticed that it
> > stopped working. I'll have to restart the varnish service and browse the
> > site for a while. Then it'll stop working again and produce the 503
> error.
> > It's pretty annoying!
> >
> > I was wondering if there might be something in my VCL I could tweak to
> make
> > this work? Or if the fact is that the web servers are simply too far from
> > varnish for this to be practical.
> >
> > Here's my VCL file. It's pretty basic:
> >
> > backend web1 {
> >   .host = "10.10.10.25";
> >   .port = "80";
> >   .connect_timeout = 1200s;
> >   .first_byte_timeout = 1200s;
> >   .between_bytes_timeout = 1200s;
> >   .max_connections = 70;
> >   .probe = {
> >   .request =
> >    "GET /healthcheck.php HTTP/1.1"
> >    "Host: wiki.example.com"
> >    "Connection: close";
> >    .interval = 10m;
> >    .timeout = 60s;
> >    .window = 3;
> >    .threshold = 2;
> >    }
> > }
> >
> > backend web2 {
> >   .host = "10.10.10.26";
> >   .port = "80";
> >   .connect_timeout = 1200s;
> >   .first_byte_timeout = 1200s;
> >   .between_bytes_timeout = 1200s;
> >   .max_connections = 70;
> >   .probe = {
> >   .request =
> >    "GET /healthcheck.php HTTP/1.1"
> >    "Host: wiki.example.com"
> >    "Connection: close";
> >    .interval = 10m;
> >    .timeout = 60s;
> >    .window = 3;
> >    .threshold = 2;
> >    }
> > }
> >
> > director www round-robin {
> >   { .backend = web1;   }
> >   { .backend = web2;  }
> >  }
> >
> > sub vcl_recv {
> >
> >     if (req.url ~ "&action=submit($|/)") {
> >         return (pass);
> >     }
> >
> >     set req.backend = www;
> >     return (lookup);
> > }
> >
> > sub vcl_fetch {
> >       set beresp.ttl = 3600s;
> >       set beresp.grace = 4h;
> >       return (deliver);
> > }
> >
> >
> > sub vcl_deliver {
> >      if (obj.hits> 0) {
> >       set resp.http.X-Cache = "HIT";
> >      } else {
> >         set resp.http.X-Cache = "MISS";
> >      }
> >  }
> >
> > Thanks,
> > Tim
> >
> >
> >
> > --
> > GPG me!!
> >
> > gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
> >
> >
> > _______________________________________________
> > varnish-misc mailing list
> > varnish-misc at varnish-cache.org
> > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>



-- 
GPG me!!

gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20150708/ad30689f/attachment-0001.html>


More information about the varnish-misc mailing list