site goes down if same one of two varnish nodes stopped
Tim Dunphy
bluethundr at gmail.com
Wed May 14 05:30:14 CEST 2014
HI Jason and others,
I just wanted to let you know that this pretty much solved my problem. This
problem was most definitely relating to the F5 load balancers. Basically
the F5 guy I was dealing with had to set an option called "Reconnects' to a
value of 3 from a value of 0. Apparently if this option is set to 0, then
the F5 cannot load balancer whatever it's trying to. This was a valuable
lesson and something to keep in mind for the future.
Thanks again!
Tim
On Sun, May 11, 2014 at 8:33 PM, Tim Dunphy <bluethundr at gmail.com> wrote:
> It sounds like you are using a tcp monitor on your F5 from reading over at
>> first glance. If varnish goes down, but the system stays up, your monitor
>> wouldn’t remove the node from the pool and keeps sending connections to
>> that node. You want to use a custom monitor like in the attached image in
>> combination with this in your vcl. You can test this by stopping one of
>> your varnish nodes and seeing if it is marked down in the pool.
>> // add ping url to test Varnish status
>> if (req.request == "GET" && req.url ~ "/varnish-ping") {
>> error 200 "OK";
>> }
>
>
>
> Hi Jason,
>
> Thank you very much for your reply. And yeah, I sort of think this is an
> issue with the F5 and not with Varnish. Mainly because both Varnish
> instances are identically installed and configured. I don't actually have
> direct access to the F5 at work. But one of the F5 guys that I deal with is
> very easy going and I'm sure will be ready to help.
>
> I'll ping him with this scenario tomorrow. Thank you for confirming my
> suspicion that this issue is likely on the F5 end and not the Varnish end.
>
> Thanks
> Tim
>
>
> On Sun, May 11, 2014 at 8:08 PM, Jason Heffner <jdh132 at psu.edu> wrote:
>
>> It sounds like you are using a tcp monitor on your F5 from reading over
>> at first glance. If varnish goes down, but the system stays up, your
>> monitor wouldn’t remove the node from the pool and keeps sending
>> connections to that node. You want to use a custom monitor like in the
>> attached image in combination with this in your vcl. You can test this by
>> stopping one of your varnish nodes and seeing if it is marked down in the
>> pool.
>>
>> // add ping url to test Varnish status
>> if (req.request == "GET" && req.url ~ "/varnish-ping") {
>> error 200 "OK";
>> }
>>
>>
>> Jason
>>
>> p: (814) 865-1840, c: (814) 777-7665
>> Systems Administrator
>> Teaching and Learning with Technology, Information Technology Services
>> The Pennsylvania State University
>>
>> On May 11, 2014, at 7:15 PM, Tim Dunphy <bluethundr at gmail.com> wrote:
>>
>> Hey guys,
>>
>> One more interesting thing about my situation. Is that if I do a
>> varnishstat command on both node A (which seems to control the site) and
>> node B (which does not seem to), I get further evidence that both nodes are
>> supporting the site.
>>
>>
>> 0+02:10:12
>> *uszmpwsls014la*
>>
>> Hitrate ratio: 4 4 4
>>
>> Hitrate avg: 0.9977 0.9977 0.9977
>>
>>
>> 3139 1.00 0.40 Client connections accepted
>>
>> 3149 1.00 0.40 Client requests received
>>
>> 3120 1.00 0.40 Cache hits
>>
>> 29 0.00 0.00 Cache misses
>>
>> 25 0.00 0.00 Backend conn. success
>>
>> 4 0.00 0.00 Backend conn. reuses
>>
>> 20 0.00 0.00 Backend conn. was closed
>>
>> 26 0.00 0.00 Backend conn. recycles
>>
>> 29 0.00 0.00 Fetch with Length
>>
>> 16 . . N struct sess_mem
>>
>> 26 . . N struct object
>>
>> 36 . . N struct objectcore
>>
>> 25 . . N struct objecthead
>>
>> 2 . . N struct vbe_conn
>>
>> 500 . . N worker threads
>>
>> 500 0.00 0.06 N worker threads created
>>
>> 3 . . N backends
>>
>> 1563 . . N LRU moved objects
>>
>> 3128 1.00 0.40 Objects sent with write
>>
>> 3139 1.00 0.40 Total Sessions
>>
>>
>> 0+03:04:56
>> *uszmpwsls014lb*
>>
>> Hitrate ratio: 10 21 21
>>
>> Hitrate avg: 0.9999 0.9998 0.9998
>>
>>
>> 4440 2.00 0.40 Client connections accepted
>>
>> 4440 2.00 0.40 Client requests received
>>
>> 4421 2.00 0.40 Cache hits
>>
>> 19 0.00 0.00 Cache misses
>>
>> 19 0.00 0.00 Backend conn. success
>>
>> 16 0.00 0.00 Backend conn. was closed
>>
>> 19 0.00 0.00 Backend conn. recycles
>>
>> 19 0.00 0.00 Fetch with Length
>>
>> 10 . . N struct sess_mem
>>
>> 19 . . N struct object
>>
>> 29 . . N struct objectcore
>>
>> 11 . . N struct objecthead
>>
>> 3 . . N struct vbe_conn
>>
>> 500 . . N worker threads
>>
>> 500 0.00 0.05 N worker threads created
>>
>> 3 . . N backends
>>
>> 2209 . . N LRU moved objects
>>
>> 4440 2.00 0.40 Objects sent with write
>>
>> 4440 2.00 0.40 Total Sessions
>>
>> 4440 2.00 0.40 Total Requests
>>
>> So why the entire site goes down when I bring down node A but leave up
>> node B I am still a little puzzled by. Unless the explanation may be that
>> the F5 is NOT balancing the two varnish nodes in quite the way I appear to
>> think. But if that is the case, then why do we see almost identical stats
>> coming out of both hosts?
>>
>>
>> Thanks
>>
>> Tim
>>
>>
>> On Sun, May 11, 2014 at 6:20 PM, Tim Dunphy <bluethundr at gmail.com> wrote:
>>
>>> hey all..
>>>
>>> I have two varnish nodes being balanced by an F5 load balancer both were
>>> installed in the same exact manner with yum installing local rpms of
>>> varnish 2.1.5 (the requested version of the client).
>>>
>>> Both share the exact same default.vcl file. But if you take node a down
>>> with node b running the whole site goes down if you take node b down with
>>> node a running the site stays up. I need to determine why node b isn't
>>> supporting the site. Each varnish node needs to be balancing 3 web servers
>>> and it looks like the a node does. Since the site goes down when you take
>>> down node a and leave node b running
>>>
>>> I had a look at varnishlog for both and both nodes appear to be getting
>>> hit.
>>>
>>> Node A:
>>>
>>> 3 VCL_return c deliver
>>>
>>> 3 TxProtocol c HTTP/1.1
>>>
>>> 3 TxStatus c 200
>>>
>>> 3 TxResponse c OK
>>>
>>> 3 TxHeader c Server: Apache
>>>
>>> 3 TxHeader c X-Powered-By: PHP/5.2.8
>>>
>>> 3 TxHeader c Content-Type: text/html
>>>
>>> 3 TxHeader c Cache-Control: max-age = 600
>>>
>>> 3 TxHeader c Content-Length: 4
>>>
>>> 3 TxHeader c Date: Sun, 11 May 2014 22:11:02 GMT
>>>
>>> 3 TxHeader c X-Varnish: 1578371599 1578371564
>>>
>>> 3 TxHeader c Age: 86
>>>
>>> 3 TxHeader c Via: 1.1 varnish
>>>
>>> 3 TxHeader c Connection: close
>>>
>>> 3 TxHeader c Varnish-X-Cache: HIT
>>>
>>> 3 TxHeader c Varnish-X-Cache-Hits: 35
>>>
>>> 3 Length c 4
>>>
>>> 3 ReqEnd c 1578371599 1399846262.156239033
>>> 1399846262.156332970 0.000054121 0.000056028 0.000037909
>>>
>>>
>>> Node B:
>>>
>>> 9 VCL_return c deliver
>>>
>>> 9 TxProtocol c HTTP/1.1
>>>
>>> 9 TxStatus c 200
>>>
>>> 9 TxResponse c OK
>>>
>>> 9 TxHeader c Server: Apache
>>>
>>> 9 TxHeader c X-Powered-By: PHP/5.2.17
>>>
>>> 9 TxHeader c Content-Type: text/html
>>>
>>> 9 TxHeader c Cache-Control: max-age = 600
>>>
>>> 9 TxHeader c Content-Length: 4
>>>
>>> 9 TxHeader c Date: Sun, 11 May 2014 22:11:33 GMT
>>>
>>> 9 TxHeader c X-Varnish: 1525629213 1525629076
>>>
>>> 9 TxHeader c Age: 341
>>>
>>> 9 TxHeader c Via: 1.1 varnish
>>>
>>> 9 TxHeader c Connection: close
>>>
>>> 9 TxHeader c Varnish-X-Cache: HIT
>>>
>>> 9 TxHeader c Varnish-X-Cache-Hits: 137
>>>
>>> 9 Length c 4
>>>
>>> 9 ReqEnd c 1525629213 1399846293.098695993
>>> 1399846293.098922968 0.000057936 0.000181913 0.000045061
>>>
>>> So I'm not sure why this is the case.
>>>
>>> Here’s the VCL file that I’m using in case this might shed any clues. I
>>> apologize that I’m still to much of a newb to ferret out the most relevant
>>> parts. But I hope that the context may yield some clues.
>>>
>>> backend web1 {
>>>
>>> .host = "10.10.1.104";
>>>
>>> .port = "80";
>>>
>>> .connect_timeout = 45s;
>>>
>>> .first_byte_timeout = 45s;
>>>
>>> .between_bytes_timeout = 45s;
>>>
>>> .max_connections = 70;
>>>
>>> .probe = {
>>>
>>> .url = "/healthcheck.php";
>>>
>>> .timeout = 5s;
>>>
>>> .interval = 30s;
>>>
>>> .window = 10;
>>>
>>> .threshold = 1;
>>>
>>> }
>>>
>>> }
>>>
>>> backend web2 {
>>>
>>> .host = "10.10.1.105";
>>>
>>> .port = "80";
>>>
>>> .connect_timeout = 45s;
>>>
>>> .first_byte_timeout = 45s;
>>>
>>> .between_bytes_timeout = 45s;
>>>
>>> .max_connections = 70;
>>>
>>> .probe = {
>>>
>>> .url = "/healthcheck.php";
>>>
>>> .timeout = 5s;
>>>
>>> .interval = 30s;
>>>
>>> .window = 10;
>>>
>>> .threshold = 1;
>>>
>>> }
>>>
>>> }
>>>
>>> backend web3 {
>>>
>>> .host = "10.10.1.106";
>>>
>>> .port = "80";
>>>
>>> .connect_timeout = 45s;
>>>
>>> .first_byte_timeout = 45s;
>>>
>>> .between_bytes_timeout = 45s;
>>>
>>> .max_connections = 70;
>>>
>>> .probe = {
>>>
>>> .url = "/healthcheck.php";
>>>
>>> .timeout = 5s;
>>>
>>> .interval = 30s;
>>>
>>> .window = 10;
>>>
>>> .threshold = 1;
>>>
>>> }
>>>
>>> }
>>>
>>> acl purge {
>>>
>>> "localhost";
>>>
>>> "127.0.0.1";
>>>
>>> "10.10.1.102";
>>>
>>> "10.10.1.103";
>>>
>>> }
>>>
>>> director www round-robin {
>>>
>>> { .backend = web1; }
>>>
>>> { .backend = web2; }
>>>
>>> { .backend = web3; }
>>>
>>>
>>> }
>>>
>>> sub vcl_recv {
>>>
>>> set req.backend = www;
>>>
>>> set req.grace = 6h;
>>>
>>> if (!req.backend.healthy) {
>>>
>>> set req.grace = 24h;
>>>
>>> }
>>>
>>> set req.http.X-Forwarded-For = req.http.X-Forwarded-For ", "
>>> client.ip;
>>>
>>> if (req.http.host ~ "^origin\.test(.+\.|)mywebsite\.com$") {
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> if (req.http.host ~ ".*\.mywebsite.com|mywebsite.com") {
>>>
>>> /* allow (origin.)stage.m.mywebsite.com to be a separate host */
>>>
>>> if (req.http.host != "stage.m.mywebsite.com") {
>>>
>>> set req.http.host = "stage.mywebsite.com";
>>>
>>> }
>>>
>>> } else {
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> if (req.request == "PURGE") {
>>>
>>> if (!client.ip ~ purge) {
>>>
>>> error 405 "Not allowed.";
>>>
>>> }
>>>
>>> return (lookup);
>>>
>>> }
>>>
>>> if (req.request != "GET" &&
>>>
>>> req.request != "HEAD" &&
>>>
>>> req.request != "PUT" &&
>>>
>>> req.request != "POST" &&
>>>
>>> req.request != "TRACE" &&
>>>
>>> req.request != "OPTIONS" &&
>>>
>>> req.request != "DELETE") {
>>>
>>> return (pipe);
>>>
>>> }
>>>
>>> if (req.request != "GET" && req.request != "HEAD") {
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> if (req.url ~ "sites/all/modules/custom/bravo_ad/ads.html\?.*") {
>>>
>>> set req.url = "/sites/all/modules/custom/bravo_ad/ads.html";
>>>
>>> }
>>>
>>> if (req.url ~ "eyeblaster/addineyeV2.html\?.*") {
>>>
>>> set req.url = "/eyeblaster/addineyeV2.html";
>>>
>>> }
>>>
>>> if (req.url ~
>>> "ahah_helper\.php|bravo_points\.php|install\.php|update\.php|cron\.php|/json(:?\?.*)?$")
>>> {
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> if (req.http.Authorization) {
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> if (req.url ~ "login" || req.url ~ "logout") {
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> if (req.url ~ "^/admin/" || req.url ~ "^/node/add/") {
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> if (req.http.Cache-Control ~ "no-cache") {
>>>
>>> // return (pass);
>>>
>>> }
>>>
>>> if (req.http.Cookie ~
>>> "(VARNISH|DRUPAL_UID|LOGGED_IN|SESS|_twitter_sess)") {
>>>
>>> set req.http.Cookie = regsuball(req.http.Cookie,
>>> "(^|;\s*)(__[a-z]+|has_js)=[^;]*", "");
>>>
>>> set req.http.Cookie = regsub(req.http.Cookie, "^;\s*", "");
>>>
>>> } else {
>>>
>>> unset req.http.Cookie;
>>>
>>> }
>>>
>>> /* removed varnish cache backend logic */
>>>
>>> if (req.restarts == 0) {
>>>
>>> set req.backend = www;
>>>
>>> } elsif (req.restarts >= 2) {
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> if (req.restarts >= 2) {
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> if (req.url ~
>>> "\.(ico|jpg|jpeg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|ICO|JPG|JPEG|PNG|GIF|GZ|TGZ|BZ2|TBZ|MP3|OOG|SWF)")
>>> {
>>>
>>> unset req.http.Accept-Encoding;
>>>
>>> }
>>>
>>> if (req.url ~
>>> "^/(sites/all/modules/mywebsite_admanager/includes/ads.php|doubleclick/DARTIframe.html)(\?.*|)$")
>>> {
>>>
>>> set req.url = regsub(req.url, "\?.*$", "");
>>>
>>> }
>>>
>>> if (req.http.Accept-Encoding ~ "gzip") {
>>>
>>> set req.http.Accept-Encoding = "gzip";
>>>
>>> } elsif (req.http.Accept-Encoding ~ "deflate") {
>>>
>>> set req.http.Accept-Encoding = "deflate";
>>>
>>> } else {
>>>
>>> unset req.http.Accept-Encoding;
>>>
>>> }
>>>
>>> return (lookup);
>>>
>>> }
>>>
>>> sub vcl_pipe {
>>>
>>> set bereq.http.connection = "close";
>>>
>>> return (pipe);
>>>
>>> }
>>>
>>> sub vcl_pass {
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> sub vcl_hash {
>>>
>>> set req.hash += req.url;
>>>
>>> set req.hash += req.http.host;
>>>
>>> if (req.http.Cookie ~ "VARNISH|DRUPAL_UID|LOGGED_IN") {
>>>
>>> set req.hash += req.http.Cookie;
>>>
>>> }
>>>
>>> return (hash);
>>>
>>> }
>>>
>>> sub vcl_hit {
>>>
>>> if (req.request == "PURGE") {
>>>
>>> set obj.ttl = 0s;
>>>
>>> error 200 "Purged.";
>>>
>>> }
>>>
>>> }
>>>
>>> sub vcl_fetch {
>>>
>>> if (beresp.status == 500) {
>>>
>>> set req.http.X-Varnish-Error = "1";
>>>
>>> restart;
>>>
>>> }
>>>
>>> set beresp.grace = 6h;
>>>
>>> # Set a short circuit cache lifetime for resp codes above 302
>>>
>>> if (beresp.status > 302) {
>>>
>>> set beresp.ttl = 60s;
>>>
>>> set beresp.http.Cache-Control = "max-age = 60";
>>>
>>> }
>>>
>>> if (beresp.http.Edge-control ~ "no-store") {
>>>
>>> set beresp.http.storage = "1";
>>>
>>> set beresp.cacheable = false;
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> if (beresp.status >= 300 || !beresp.cacheable) {
>>>
>>> set beresp.http.Varnish-X-Cacheable = "Not Cacheable";
>>>
>>> set beresp.http.storage = "1";
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> if (beresp.http.Set-Cookie) {
>>>
>>> return (pass);
>>>
>>> }
>>>
>>> if (beresp.cacheable) {
>>>
>>> unset beresp.http.expires;
>>>
>>> set beresp.ttl = 600s;
>>>
>>> set beresp.http.Cache-Control = "max-age = 600";
>>>
>>> if (req.url ~
>>> "\.(ico|jpg|jpeg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|ICO|JPG|JPEG|PNG|GIF|GZ|TGZ|BZ2|TBZ|MP3|OOG|SWF)")
>>> {
>>>
>>> set beresp.ttl = 43829m;
>>>
>>> set beresp.http.Cache-Control = "max-age = 1000000";
>>>
>>> }
>>>
>>> }
>>>
>>> return (deliver);
>>>
>>> }
>>>
>>>
>>> sub vcl_deliver {
>>>
>>> if (obj.hits > 0) {
>>>
>>> set resp.http.Varnish-X-Cache = "HIT";
>>>
>>> set resp.http.Varnish-X-Cache-Hits = obj.hits;
>>>
>>> } else {
>>>
>>> set resp.http.Varnish-X-Cache = "MISS";
>>>
>>> }
>>>
>>> return (deliver);
>>>
>>> }
>>>
>>> sub vcl_error {
>>>
>>> if (req.restarts == 0) {
>>>
>>> return (restart);
>>>
>>> }
>>>
>>> if (req.http.X-Varnish-Error != "1") {
>>>
>>> set req.http.X-Varnish-Error = "1";
>>>
>>> return (restart);
>>>
>>> }
>>>
>>> }
>>>
>>> The only part that I omitted was the one pointing to the error
>>> page. Can anyone offer any advice on how to troubleshoot this?
>>>
>>> I'm enclosing the full VCL in case that extra info is helpful. I didn't
>>> omit much tho.
>>>
>>> Thank you!
>>>
>>> Tim
>>>
>>> --
>>> GPG me!!
>>>
>>> gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
>>>
>>>
>>
>>
>> --
>> GPG me!!
>>
>> gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc at varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>>
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc at varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>
>
>
> --
> GPG me!!
>
> gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
>
>
--
GPG me!!
gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20140513/3030ca34/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: varnish-ping.png
Type: image/png
Size: 128851 bytes
Desc: not available
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20140513/3030ca34/attachment-0001.png>
More information about the varnish-misc
mailing list