[Varnish] #845: Health checks get duplicated when loading a new config

Mon Jan 17 11:38:04 CET 2011

#845: Health checks get duplicated when loading a new config
----------------------+-----------------------------------------------------
 Reporter:  johnnyh   |       Owner:  martin                
     Type:  defect    |      Status:  new                   
 Priority:  normal    |   Milestone:                        
Component:  varnishd  |     Version:  trunk                 
 Severity:  normal    |    Keywords:  health check duplicate
----------------------+-----------------------------------------------------
Changes (by martin):

  * owner:  => martin

Old description:

> Summary: When you reload a config, and the do vcl.discard on the old
> config, the health checks sometimes get broken. It is related to report
> ticket 834 ( http://www.varnish-cache.org/trac/ticket/834 ) , but the
> final word on 834 was that doing a vcl.discard fixes the problem.
> Apparently it does not do so all the time. A workaround for this issue is
> to restart varnish, but that is a really nasty solution because the cache
> gets flushed and it is also not as 'safe' as just doing a 'reload' of
> Varnish.
>
> Now first some system details:
>
> Varnish v2.1.3
> Intel(R) Xeon(R) CPUX5670  @ 2.93GHz (In a VMware virtual machine)
> 64-bit
> 4G RAM
> Linux kernel 2.6.18-194.26.1.el5
> RHEL 5.5 completely up-to-date
> Custom VCL (described below)
>

> Here is how to try and reproduce it:
>
> - Start with a working config, with a single backend, with a health check
> that returns 'healthy' all the time.
> - Now change the IP address of the backend to something that is certainly
> not a healthy backend, like 1.2.3.4.
> - Load the new config and start using it:
>
> # DATE=`date +%s` ; varnishadm -T 127.0.0.1:6082 vcl.load reload${DATE}
> /etc/varnish/test-config.vcl; varnishadm -T 127.0.0.1:6082 vcl.use
> reload${DATE}
>
> - We now have a situation where Varnishlog shows our backend as healthy
> and sick at the same time:
>
> # varnishlog
>     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.012475
> 0.012311 HTTP/1.1 200 OK
>     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.008161
> 0.011274 HTTP/1.1 200 OK
>     0 Backend_health - test_site Still sick ------- 0 4 5 0.000000
> 0.000000
>     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.011735
> 0.011389 HTTP/1.1 200 OK
>
> - Let's check what configs varnish thinks it knows about:
>
> # varnishadm -T 127.0.0.1:6082 vcl.list
> available     105 boot
> active          1 reload1294928381
>
> - According to ticket 834 we must now discard the old configurations,
> which is only one in this case:
>
> # varnishadm -T 127.0.0.1:6081 vcl.discard boot
>
> - The problem now exists here: Sometimes, the discarded configuration
> does not 'disappear' from the list of available configurations, but it
> remains there in the state 'discarded'
>
> # varnishadm -T 127.0.0.1:6082 vcl.list
> discarded     105 boot
> active          1 reload1294928381
>
> - The real problem lies here: The backend checks are now kaput.
> Varnishlogs shows the backend as healthy and sick at the same time.
>
> # varnishlog
>     0 Backend_health - test_site Still sick ------- 0 4 5 0.000000
> 0.000000
>     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.007989
> 0.010539 HTTP/1.1 200 OK
>     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.014861
> 0.011620 HTTP/1.1 200 OK
>     0 Backend_health - test_site Still sick ------- 0 4 5 0.000000
> 0.000000
>     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.011745
> 0.011651 HTTP/1.1 200 OK
>
> - There appears to be no way to fix this situation other than restarting
> Varnishd.
>
> - I have been able to reproduce this problem a few times, but not
> consistently. It seems this problem shows up when you use vcl.load-vcl
> .use-vcl.discard in rapid succession. If you work really slowly while
> doing the reload/discard cycle, you will probably not find this bug.  The
> way I reload and discard my configs, is by having the following script in
> my init.d script, so that all I have to do is call "/etc/init.d/varnish
> reload ; /etc/init.d/varnish discard". Here is the code I use in the init
> script:
>
> vcl_reload() {
>     echo "Reloading Varnish VCL..."
>     DATE=`date +%s`
>     varnishadm -T $HOSTPORT vcl.load reload${DATE} $VARNISH_VCL_CONF ||
> vcl_exit 1 "Error compiling config $VARNISH_VCL_CONF"
>     varnishadm -T $HOSTPORT vcl.use reload${DATE} || vcl_exit 1 "Error
> loading config $VARNISH_VCL_CONF"
>     vcl_exit 0 "VCL reloaded succesfuly."
> }
>
> vcl_discard_all() {
>     echo "Discarding old configurations..."
>     COUNT=`varnishadm -T $HOSTPORT vcl.list | grep -v ^$ | grep active
> -B1 | wc -l`
>     if [ $COUNT -le 1 ] ; then vcl_exit 1 "Error: There are no old
> configurations to discard." ; fi
>     varnishadm -T $HOSTPORT vcl.list | grep -v ^$ | while read CONFIG ;
> do
>         if [ `echo "$CONFIG" | awk '{print $1}'` == "available" ] ; then
>             varnishadm -T $HOSTPORT vcl.discard `echo "$CONFIG" | awk
> '{print $3}'`
>         else
>             break
>         fi
>     done
>     vcl_exit 0 "Old configurations were succesfully discarded."
> }

New description:

 Summary: When you reload a config, and the do vcl.discard on the old
 config, the health checks sometimes get broken. It is related to report
 ticket 834 ( http://www.varnish-cache.org/trac/ticket/834 ) , but the
 final word on 834 was that doing a vcl.discard fixes the problem.
 Apparently it does not do so all the time. A workaround for this issue is
 to restart varnish, but that is a really nasty solution because the cache
 gets flushed and it is also not as 'safe' as just doing a 'reload' of
 Varnish.

 Now first some system details:

 Varnish v2.1.3
 Intel(R) Xeon(R) CPUX5670  @ 2.93GHz (In a VMware virtual machine)
 64-bit
 4G RAM
 Linux kernel 2.6.18-194.26.1.el5
 RHEL 5.5 completely up-to-date
 Custom VCL (described below)

 Here is how to try and reproduce it:

 - Start with a working config, with a single backend, with a health check
 that returns 'healthy' all the time.
 - Now change the IP address of the backend to something that is certainly
 not a healthy backend, like 1.2.3.4.
 - Load the new config and start using it:

 # DATE=`date +%s` ; varnishadm -T 127.0.0.1:6082 vcl.load reload${DATE}
 /etc/varnish/test-config.vcl; varnishadm -T 127.0.0.1:6082 vcl.use
 reload${DATE}

 - We now have a situation where Varnishlog shows our backend as healthy
 and sick at the same time:

 # varnishlog
     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.012475
 0.012311 HTTP/1.1 200 OK
     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.008161
 0.011274 HTTP/1.1 200 OK
     0 Backend_health - test_site Still sick ------- 0 4 5 0.000000
 0.000000
     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.011735
 0.011389 HTTP/1.1 200 OK

 - Let's check what configs varnish thinks it knows about:

 # varnishadm -T 127.0.0.1:6082 vcl.list
 available     105 boot
 active          1 reload1294928381

 - According to ticket 834 we must now discard the old configurations,
 which is only one in this case:

 # varnishadm -T 127.0.0.1:6081 vcl.discard boot

 - The problem now exists here: Sometimes, the discarded configuration does
 not 'disappear' from the list of available configurations, but it remains
 there in the state 'discarded'

 # varnishadm -T 127.0.0.1:6082 vcl.list
 discarded     105 boot
 active          1 reload1294928381

 - The real problem lies here: The backend checks are now kaput.
 Varnishlogs shows the backend as healthy and sick at the same time.

 # varnishlog
     0 Backend_health - test_site Still sick ------- 0 4 5 0.000000
 0.000000
     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.007989
 0.010539 HTTP/1.1 200 OK
     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.014861
 0.011620 HTTP/1.1 200 OK
     0 Backend_health - test_site Still sick ------- 0 4 5 0.000000
 0.000000
     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.011745
 0.011651 HTTP/1.1 200 OK

 - There appears to be no way to fix this situation other than restarting
 Varnishd.

 - I have been able to reproduce this problem a few times, but not
 consistently. It seems this problem shows up when you use vcl.load-vcl
 .use-vcl.discard in rapid succession. If you work really slowly while
 doing the reload/discard cycle, you will probably not find this bug.  The
 way I reload and discard my configs, is by having the following script in
 my init.d script, so that all I have to do is call "/etc/init.d/varnish
 reload ; /etc/init.d/varnish discard". Here is the code I use in the init
 script:

 vcl_reload() {
     echo "Reloading Varnish VCL..."
     DATE=`date +%s`
     varnishadm -T $HOSTPORT vcl.load reload${DATE} $VARNISH_VCL_CONF ||
 vcl_exit 1 "Error compiling config $VARNISH_VCL_CONF"
     varnishadm -T $HOSTPORT vcl.use reload${DATE} || vcl_exit 1 "Error
 loading config $VARNISH_VCL_CONF"
     vcl_exit 0 "VCL reloaded succesfuly."
 }

 vcl_discard_all() {
     echo "Discarding old configurations..."
     COUNT=`varnishadm -T $HOSTPORT vcl.list | grep -v ^$ | grep active -B1
 | wc -l`
     if [ $COUNT -le 1 ] ; then vcl_exit 1 "Error: There are no old
 configurations to discard." ; fi
     varnishadm -T $HOSTPORT vcl.list | grep -v ^$ | while read CONFIG ; do
         if [ `echo "$CONFIG" | awk '{print $1}'` == "available" ] ; then
             varnishadm -T $HOSTPORT vcl.discard `echo "$CONFIG" | awk
 '{print $3}'`
         else
             break
         fi
     done
     vcl_exit 0 "Old configurations were succesfully discarded."
 }

--

-- 
Ticket URL: <http://www.varnish-cache.org/trac/ticket/845#comment:2>
Varnish <http://varnish-cache.org/>
The Varnish HTTP Accelerator