site goes down if same one of two varnish nodes stopped

Tim Dunphy bluethundr at gmail.com
Wed May 14 05:30:14 CEST 2014


HI Jason and others,

I just wanted to let you know that this pretty much solved my problem. This
problem was most definitely relating to the F5 load balancers. Basically
the F5 guy I was dealing with had to set an option called "Reconnects' to a
value of 3 from a value of 0. Apparently if this option is set to 0, then
the F5 cannot load balancer whatever it's trying to. This was a valuable
lesson and something to keep in mind for the future.

Thanks again!
Tim


On Sun, May 11, 2014 at 8:33 PM, Tim Dunphy <bluethundr at gmail.com> wrote:

> It sounds like you are using a tcp monitor on your F5 from reading over at
>> first glance. If varnish goes down, but the system stays up, your monitor
>> wouldn’t remove the node from the pool and keeps sending connections to
>> that node. You want to use a custom monitor like in the attached image in
>> combination with this in your vcl. You can test this by stopping one of
>> your varnish nodes and seeing if it is marked down in the pool.
>>   // add ping url to test Varnish status
>>   if (req.request == "GET" && req.url ~ "/varnish-ping") {
>>   error 200 "OK";
>>   }
>
>
>
> Hi Jason,
>
> Thank you very much for your reply. And yeah, I sort of think this is an
> issue with the F5 and not with Varnish. Mainly because both Varnish
> instances are identically installed and configured. I don't actually have
> direct access to the F5 at work. But one of the F5 guys that I deal with is
> very easy going and I'm sure will be ready to help.
>
> I'll ping him with this scenario tomorrow. Thank you for confirming my
> suspicion that this issue is likely on the F5 end and not the Varnish end.
>
> Thanks
> Tim
>
>
> On Sun, May 11, 2014 at 8:08 PM, Jason Heffner <jdh132 at psu.edu> wrote:
>
>> It sounds like you are using a tcp monitor on your F5 from reading over
>> at first glance. If varnish goes down, but the system stays up, your
>> monitor wouldn’t remove the node from the pool and keeps sending
>> connections to that node. You want to use a custom monitor like in the
>> attached image in combination with this in your vcl. You can test this by
>> stopping one of your varnish nodes and seeing if it is marked down in the
>> pool.
>>
>>   // add ping url to test Varnish status
>>   if (req.request == "GET" && req.url ~ "/varnish-ping") {
>>   error 200 "OK";
>>   }
>>
>>
>> Jason
>>
>> p: (814) 865-1840, c: (814) 777-7665
>> Systems Administrator
>> Teaching and Learning with Technology, Information Technology Services
>> The Pennsylvania State University
>>
>> On May 11, 2014, at 7:15 PM, Tim Dunphy <bluethundr at gmail.com> wrote:
>>
>> Hey guys,
>>
>>  One more interesting thing about my situation. Is that if I do a
>> varnishstat command on both node A (which seems to control the site) and
>> node B (which does not seem to), I get further evidence that both nodes are
>> supporting the site.
>>
>>
>> 0+02:10:12
>> *uszmpwsls014la*
>>
>> Hitrate ratio:        4        4        4
>>
>> Hitrate avg:     0.9977   0.9977   0.9977
>>
>>
>>         3139         1.00         0.40 Client connections accepted
>>
>>         3149         1.00         0.40 Client requests received
>>
>>         3120         1.00         0.40 Cache hits
>>
>>           29         0.00         0.00 Cache misses
>>
>>           25         0.00         0.00 Backend conn. success
>>
>>            4         0.00         0.00 Backend conn. reuses
>>
>>           20         0.00         0.00 Backend conn. was closed
>>
>>           26         0.00         0.00 Backend conn. recycles
>>
>>           29         0.00         0.00 Fetch with Length
>>
>>           16          .            .   N struct sess_mem
>>
>>           26          .            .   N struct object
>>
>>           36          .            .   N struct objectcore
>>
>>           25          .            .   N struct objecthead
>>
>>            2          .            .   N struct vbe_conn
>>
>>          500          .            .   N worker threads
>>
>>          500         0.00         0.06 N worker threads created
>>
>>            3          .            .   N backends
>>
>>         1563          .            .   N LRU moved objects
>>
>>         3128         1.00         0.40 Objects sent with write
>>
>>         3139         1.00         0.40 Total Sessions
>>
>>
>> 0+03:04:56
>> *uszmpwsls014lb*
>>
>> Hitrate ratio:       10       21       21
>>
>> Hitrate avg:     0.9999   0.9998   0.9998
>>
>>
>>         4440         2.00         0.40 Client connections accepted
>>
>>         4440         2.00         0.40 Client requests received
>>
>>         4421         2.00         0.40 Cache hits
>>
>>           19         0.00         0.00 Cache misses
>>
>>           19         0.00         0.00 Backend conn. success
>>
>>           16         0.00         0.00 Backend conn. was closed
>>
>>           19         0.00         0.00 Backend conn. recycles
>>
>>           19         0.00         0.00 Fetch with Length
>>
>>           10          .            .   N struct sess_mem
>>
>>           19          .            .   N struct object
>>
>>           29          .            .   N struct objectcore
>>
>>           11          .            .   N struct objecthead
>>
>>            3          .            .   N struct vbe_conn
>>
>>          500          .            .   N worker threads
>>
>>          500         0.00         0.05 N worker threads created
>>
>>            3          .            .   N backends
>>
>>         2209          .            .   N LRU moved objects
>>
>>         4440         2.00         0.40 Objects sent with write
>>
>>         4440         2.00         0.40 Total Sessions
>>
>>         4440         2.00         0.40 Total Requests
>>
>> So why the entire site goes down when I bring down node A but leave up
>> node B I am still a little puzzled by. Unless the explanation may be that
>> the F5 is NOT balancing the two varnish nodes in quite the way I appear to
>> think. But if that is the case, then why do we see almost identical stats
>> coming out of both hosts?
>>
>>
>> Thanks
>>
>> Tim
>>
>>
>> On Sun, May 11, 2014 at 6:20 PM, Tim Dunphy <bluethundr at gmail.com> wrote:
>>
>>> hey all..
>>>
>>> I have two varnish nodes being balanced by an F5 load balancer both were
>>> installed in the same exact manner with yum installing local rpms of
>>> varnish 2.1.5 (the requested version of the client).
>>>
>>> Both share the exact same default.vcl file.  But if you take node a down
>>> with node b running the whole site goes down if you take node b down with
>>> node a running the site stays up. I need to determine why node b isn't
>>> supporting the site. Each varnish node needs to be balancing 3 web servers
>>> and it looks like the a node does. Since the site goes down when you take
>>> down node a and leave node b running
>>>
>>> I had a look at varnishlog for both and both nodes appear to be getting
>>> hit.
>>>
>>> Node A:
>>>
>>> 3 VCL_return   c deliver
>>>
>>>     3 TxProtocol   c HTTP/1.1
>>>
>>>     3 TxStatus     c 200
>>>
>>>     3 TxResponse   c OK
>>>
>>>     3 TxHeader     c Server: Apache
>>>
>>>     3 TxHeader     c X-Powered-By: PHP/5.2.8
>>>
>>>     3 TxHeader     c Content-Type: text/html
>>>
>>>     3 TxHeader     c Cache-Control: max-age = 600
>>>
>>>     3 TxHeader     c Content-Length: 4
>>>
>>>     3 TxHeader     c Date: Sun, 11 May 2014 22:11:02 GMT
>>>
>>>     3 TxHeader     c X-Varnish: 1578371599 1578371564
>>>
>>>     3 TxHeader     c Age: 86
>>>
>>>     3 TxHeader     c Via: 1.1 varnish
>>>
>>>     3 TxHeader     c Connection: close
>>>
>>>     3 TxHeader     c Varnish-X-Cache: HIT
>>>
>>>     3 TxHeader     c Varnish-X-Cache-Hits: 35
>>>
>>>     3 Length       c 4
>>>
>>>     3 ReqEnd       c 1578371599 1399846262.156239033
>>> 1399846262.156332970 0.000054121 0.000056028 0.000037909
>>>
>>>
>>> Node B:
>>>
>>> 9 VCL_return   c deliver
>>>
>>>     9 TxProtocol   c HTTP/1.1
>>>
>>>     9 TxStatus     c 200
>>>
>>>     9 TxResponse   c OK
>>>
>>>     9 TxHeader     c Server: Apache
>>>
>>>     9 TxHeader     c X-Powered-By: PHP/5.2.17
>>>
>>>     9 TxHeader     c Content-Type: text/html
>>>
>>>     9 TxHeader     c Cache-Control: max-age = 600
>>>
>>>     9 TxHeader     c Content-Length: 4
>>>
>>>     9 TxHeader     c Date: Sun, 11 May 2014 22:11:33 GMT
>>>
>>>     9 TxHeader     c X-Varnish: 1525629213 1525629076
>>>
>>>     9 TxHeader     c Age: 341
>>>
>>>     9 TxHeader     c Via: 1.1 varnish
>>>
>>>     9 TxHeader     c Connection: close
>>>
>>>     9 TxHeader     c Varnish-X-Cache: HIT
>>>
>>>     9 TxHeader     c Varnish-X-Cache-Hits: 137
>>>
>>>     9 Length       c 4
>>>
>>>     9 ReqEnd       c 1525629213 1399846293.098695993
>>> 1399846293.098922968 0.000057936 0.000181913 0.000045061
>>>
>>> So I'm not sure why this is the case.
>>>
>>> Here’s the VCL file that I’m using in case this might shed any clues. I
>>> apologize that I’m still to much of a newb to ferret out the most relevant
>>> parts. But I hope that the context may yield some clues.
>>>
>>> backend web1 {
>>>
>>>     .host = "10.10.1.104";
>>>
>>>     .port = "80";
>>>
>>>     .connect_timeout = 45s;
>>>
>>>     .first_byte_timeout = 45s;
>>>
>>>     .between_bytes_timeout = 45s;
>>>
>>>     .max_connections = 70;
>>>
>>>     .probe = {
>>>
>>>         .url = "/healthcheck.php";
>>>
>>>         .timeout = 5s;
>>>
>>>         .interval = 30s;
>>>
>>>         .window = 10;
>>>
>>>         .threshold = 1;
>>>
>>>     }
>>>
>>> }
>>>
>>> backend web2 {
>>>
>>>     .host = "10.10.1.105";
>>>
>>>     .port = "80";
>>>
>>>     .connect_timeout = 45s;
>>>
>>>     .first_byte_timeout = 45s;
>>>
>>>     .between_bytes_timeout = 45s;
>>>
>>>     .max_connections = 70;
>>>
>>>     .probe = {
>>>
>>>         .url = "/healthcheck.php";
>>>
>>>         .timeout = 5s;
>>>
>>>         .interval = 30s;
>>>
>>>         .window = 10;
>>>
>>>         .threshold = 1;
>>>
>>>     }
>>>
>>> }
>>>
>>> backend web3 {
>>>
>>>     .host = "10.10.1.106";
>>>
>>>     .port = "80";
>>>
>>>     .connect_timeout = 45s;
>>>
>>>     .first_byte_timeout = 45s;
>>>
>>>     .between_bytes_timeout = 45s;
>>>
>>>     .max_connections = 70;
>>>
>>>     .probe = {
>>>
>>>         .url = "/healthcheck.php";
>>>
>>>         .timeout = 5s;
>>>
>>>         .interval = 30s;
>>>
>>>         .window = 10;
>>>
>>>         .threshold = 1;
>>>
>>>     }
>>>
>>> }
>>>
>>> acl purge {
>>>
>>>     "localhost";
>>>
>>>     "127.0.0.1";
>>>
>>>     "10.10.1.102";
>>>
>>>     "10.10.1.103";
>>>
>>> }
>>>
>>> director www round-robin {
>>>
>>>     { .backend = web1; }
>>>
>>>     { .backend = web2; }
>>>
>>>     { .backend = web3; }
>>>
>>>
>>> }
>>>
>>> sub vcl_recv {
>>>
>>>     set req.backend = www;
>>>
>>>     set req.grace = 6h;
>>>
>>>     if (!req.backend.healthy) {
>>>
>>>         set req.grace = 24h;
>>>
>>>     }
>>>
>>>     set req.http.X-Forwarded-For = req.http.X-Forwarded-For ", "
>>> client.ip;
>>>
>>>     if (req.http.host ~ "^origin\.test(.+\.|)mywebsite\.com$") {
>>>
>>>       return (pass);
>>>
>>>     }
>>>
>>>     if (req.http.host ~ ".*\.mywebsite.com|mywebsite.com") {
>>>
>>>         /* allow (origin.)stage.m.mywebsite.com to be a separate host */
>>>
>>>         if (req.http.host != "stage.m.mywebsite.com") {
>>>
>>>             set req.http.host = "stage.mywebsite.com";
>>>
>>>         }
>>>
>>>     } else {
>>>
>>>         return (pass);
>>>
>>>     }
>>>
>>>     if (req.request == "PURGE") {
>>>
>>>         if (!client.ip ~ purge) {
>>>
>>>             error 405 "Not allowed.";
>>>
>>>         }
>>>
>>>         return (lookup);
>>>
>>>     }
>>>
>>>     if (req.request != "GET" &&
>>>
>>>         req.request != "HEAD" &&
>>>
>>>         req.request != "PUT" &&
>>>
>>>         req.request != "POST" &&
>>>
>>>         req.request != "TRACE" &&
>>>
>>>         req.request != "OPTIONS" &&
>>>
>>>         req.request != "DELETE") {
>>>
>>>             return (pipe);
>>>
>>>     }
>>>
>>>     if (req.request != "GET" && req.request != "HEAD") {
>>>
>>>         return (pass);
>>>
>>>     }
>>>
>>>     if (req.url ~ "sites/all/modules/custom/bravo_ad/ads.html\?.*") {
>>>
>>>       set req.url = "/sites/all/modules/custom/bravo_ad/ads.html";
>>>
>>>     }
>>>
>>>     if (req.url ~ "eyeblaster/addineyeV2.html\?.*") {
>>>
>>>         set req.url = "/eyeblaster/addineyeV2.html";
>>>
>>>     }
>>>
>>>     if (req.url ~
>>> "ahah_helper\.php|bravo_points\.php|install\.php|update\.php|cron\.php|/json(:?\?.*)?$")
>>> {
>>>
>>>         return (pass);
>>>
>>>     }
>>>
>>>     if (req.http.Authorization) {
>>>
>>>         return (pass);
>>>
>>>     }
>>>
>>>     if (req.url ~ "login" || req.url ~ "logout") {
>>>
>>>         return (pass);
>>>
>>>     }
>>>
>>>     if (req.url ~ "^/admin/" || req.url ~ "^/node/add/") {
>>>
>>>         return (pass);
>>>
>>>     }
>>>
>>>     if (req.http.Cache-Control ~ "no-cache") {
>>>
>>>         // return (pass);
>>>
>>>     }
>>>
>>>     if (req.http.Cookie ~
>>> "(VARNISH|DRUPAL_UID|LOGGED_IN|SESS|_twitter_sess)") {
>>>
>>>         set req.http.Cookie = regsuball(req.http.Cookie,
>>> "(^|;\s*)(__[a-z]+|has_js)=[^;]*", "");
>>>
>>>         set req.http.Cookie = regsub(req.http.Cookie, "^;\s*", "");
>>>
>>>     } else {
>>>
>>>         unset req.http.Cookie;
>>>
>>>     }
>>>
>>>     /* removed varnish cache backend logic */
>>>
>>>     if (req.restarts == 0) {
>>>
>>>         set req.backend = www;
>>>
>>>     } elsif (req.restarts >= 2) {
>>>
>>>         return (pass);
>>>
>>>     }
>>>
>>>     if (req.restarts >= 2) {
>>>
>>>         return (pass);
>>>
>>>     }
>>>
>>>     if (req.url ~
>>> "\.(ico|jpg|jpeg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|ICO|JPG|JPEG|PNG|GIF|GZ|TGZ|BZ2|TBZ|MP3|OOG|SWF)")
>>> {
>>>
>>>         unset req.http.Accept-Encoding;
>>>
>>>     }
>>>
>>>     if (req.url ~
>>> "^/(sites/all/modules/mywebsite_admanager/includes/ads.php|doubleclick/DARTIframe.html)(\?.*|)$")
>>> {
>>>
>>>         set req.url = regsub(req.url, "\?.*$", "");
>>>
>>>     }
>>>
>>>     if (req.http.Accept-Encoding ~ "gzip") {
>>>
>>>         set req.http.Accept-Encoding = "gzip";
>>>
>>>     } elsif (req.http.Accept-Encoding ~ "deflate") {
>>>
>>>         set req.http.Accept-Encoding = "deflate";
>>>
>>>     } else {
>>>
>>>         unset req.http.Accept-Encoding;
>>>
>>>     }
>>>
>>>     return (lookup);
>>>
>>> }
>>>
>>> sub vcl_pipe {
>>>
>>>     set bereq.http.connection = "close";
>>>
>>>     return (pipe);
>>>
>>> }
>>>
>>> sub vcl_pass {
>>>
>>>     return (pass);
>>>
>>> }
>>>
>>> sub vcl_hash {
>>>
>>>     set req.hash += req.url;
>>>
>>>     set req.hash += req.http.host;
>>>
>>>     if (req.http.Cookie ~ "VARNISH|DRUPAL_UID|LOGGED_IN") {
>>>
>>>         set req.hash += req.http.Cookie;
>>>
>>>     }
>>>
>>>     return (hash);
>>>
>>> }
>>>
>>> sub vcl_hit {
>>>
>>>     if (req.request == "PURGE") {
>>>
>>>         set obj.ttl = 0s;
>>>
>>>         error 200 "Purged.";
>>>
>>>     }
>>>
>>> }
>>>
>>> sub vcl_fetch {
>>>
>>>     if (beresp.status == 500) {
>>>
>>>         set req.http.X-Varnish-Error = "1";
>>>
>>>         restart;
>>>
>>>     }
>>>
>>>     set beresp.grace = 6h;
>>>
>>>     # Set a short circuit cache lifetime for resp codes above 302
>>>
>>>     if (beresp.status > 302) {
>>>
>>>     set beresp.ttl = 60s;
>>>
>>>     set beresp.http.Cache-Control = "max-age = 60";
>>>
>>>     }
>>>
>>>     if (beresp.http.Edge-control ~ "no-store") {
>>>
>>>         set beresp.http.storage = "1";
>>>
>>>         set beresp.cacheable = false;
>>>
>>>         return (pass);
>>>
>>>     }
>>>
>>>     if (beresp.status >= 300 || !beresp.cacheable) {
>>>
>>>         set beresp.http.Varnish-X-Cacheable = "Not Cacheable";
>>>
>>>         set beresp.http.storage = "1";
>>>
>>>         return (pass);
>>>
>>>     }
>>>
>>>     if (beresp.http.Set-Cookie) {
>>>
>>>         return (pass);
>>>
>>>     }
>>>
>>>     if (beresp.cacheable) {
>>>
>>>         unset beresp.http.expires;
>>>
>>>         set beresp.ttl = 600s;
>>>
>>>         set beresp.http.Cache-Control = "max-age = 600";
>>>
>>>         if (req.url ~
>>> "\.(ico|jpg|jpeg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|ICO|JPG|JPEG|PNG|GIF|GZ|TGZ|BZ2|TBZ|MP3|OOG|SWF)")
>>> {
>>>
>>>             set beresp.ttl = 43829m;
>>>
>>>             set beresp.http.Cache-Control = "max-age = 1000000";
>>>
>>>         }
>>>
>>>     }
>>>
>>>     return (deliver);
>>>
>>> }
>>>
>>>
>>> sub vcl_deliver {
>>>
>>>     if (obj.hits > 0) {
>>>
>>>         set resp.http.Varnish-X-Cache = "HIT";
>>>
>>>         set resp.http.Varnish-X-Cache-Hits = obj.hits;
>>>
>>>     } else {
>>>
>>>         set resp.http.Varnish-X-Cache = "MISS";
>>>
>>>     }
>>>
>>>     return (deliver);
>>>
>>> }
>>>
>>> sub vcl_error {
>>>
>>>     if (req.restarts == 0) {
>>>
>>>         return (restart);
>>>
>>>     }
>>>
>>>     if (req.http.X-Varnish-Error != "1") {
>>>
>>>         set req.http.X-Varnish-Error = "1";
>>>
>>>         return (restart);
>>>
>>>     }
>>>
>>> }
>>>
>>>  The only part that I omitted was the one pointing to the error
>>> page. Can anyone offer any advice on how to troubleshoot this?
>>>
>>> I'm enclosing the full VCL in case that extra info is helpful. I didn't
>>> omit much tho.
>>>
>>> Thank you!
>>>
>>> Tim
>>>
>>> --
>>> GPG me!!
>>>
>>> gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
>>>
>>>
>>
>>
>> --
>> GPG me!!
>>
>> gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
>>
>>  _______________________________________________
>> varnish-misc mailing list
>> varnish-misc at varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>>
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc at varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>
>
>
> --
> GPG me!!
>
> gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
>
>


-- 
GPG me!!

gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20140513/3030ca34/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: varnish-ping.png
Type: image/png
Size: 128851 bytes
Desc: not available
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20140513/3030ca34/attachment-0001.png>


More information about the varnish-misc mailing list