[Varnish] #845: Health checks get duplicated when loading a new config

Thu Jan 13 16:17:52 CET 2011

#845: Health checks get duplicated when loading a new config
------------------------------------+---------------------------------------
 Reporter:  johnnyh                 |        Type:  defect  
   Status:  new                     |    Priority:  normal  
Milestone:                          |   Component:  varnishd
  Version:  trunk                   |    Severity:  normal  
 Keywords:  health check duplicate  |  
------------------------------------+---------------------------------------
 Summary: When you reload a config, and the do vcl.discard on the old
 config, the health checks sometimes get broken. It is related to report
 ticket 834 ( http://www.varnish-cache.org/trac/ticket/834 ) , but the
 final word on 834 was that doing a vcl.discard fixes the problem.
 Apparently it does not do so all the time. A workaround for this issue is
 to restart varnish, but that is a really nasty solution because the cache
 gets flushed and it is also not as 'safe' as just doing a 'reload' of
 Varnish.

 Now first some system details:

 Varnish v2.1.3
 Intel(R) Xeon(R) CPUX5670  @ 2.93GHz (In a VMware virtual machine)
 64-bit
 4G RAM
 Linux kernel 2.6.18-194.26.1.el5
 RHEL 5.5 completely up-to-date
 Custom VCL (described below)


 Here is how to try and reproduce it:

 - Start with a working config, with a single backend, with a health check
 that returns 'healthy' all the time.
 - Now change the IP address of the backend to something that is certainly
 not a healthy backend, like 1.2.3.4.
 - Load the new config and start using it:

 # DATE=`date +%s` ; varnishadm -T 127.0.0.1:6082 vcl.load reload${DATE}
 /etc/varnish/test-config.vcl; varnishadm -T 127.0.0.1:6082 vcl.use
 reload${DATE}

 - We now have a situation where Varnishlog shows our backend as healthy
 and sick at the same time:

 # varnishlog
     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.012475
 0.012311 HTTP/1.1 200 OK
     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.008161
 0.011274 HTTP/1.1 200 OK
     0 Backend_health - test_site Still sick ------- 0 4 5 0.000000
 0.000000
     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.011735
 0.011389 HTTP/1.1 200 OK

 - Let's check what configs varnish thinks it knows about:

 # varnishadm -T 127.0.0.1:6082 vcl.list
 available     105 boot
 active          1 reload1294928381

 - According to ticket 834 we must now discard the old configurations,
 which is only one in this case:

 # varnishadm -T 127.0.0.1:6081 vcl.discard boot

 - The problem now exists here: Sometimes, the discarded configuration does
 not 'disappear' from the list of available configurations, but it remains
 there in the state 'discarded'

 # varnishadm -T 127.0.0.1:6082 vcl.list
 discarded     105 boot
 active          1 reload1294928381

 - The real problem lies here: The backend checks are now kaput.
 Varnishlogs shows the backend as healthy and sick at the same time.

 # varnishlog
     0 Backend_health - test_site Still sick ------- 0 4 5 0.000000
 0.000000
     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.007989
 0.010539 HTTP/1.1 200 OK
     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.014861
 0.011620 HTTP/1.1 200 OK
     0 Backend_health - test_site Still sick ------- 0 4 5 0.000000
 0.000000
     0 Backend_health - test_site Still healthy 4--X-RH 5 4 5 0.011745
 0.011651 HTTP/1.1 200 OK

 - There appears to be no way to fix this situation other than restarting
 Varnishd.

 - I have been able to reproduce this problem a few times, but not
 consistently. It seems this problem shows up when you use vcl.load-vcl
 .use-vcl.discard in rapid succession. If you work really slowly while
 doing the reload/discard cycle, you will probably not find this bug.  The
 way I reload and discard my configs, is by having the following script in
 my init.d script, so that all I have to do is call "/etc/init.d/varnish
 reload ; /etc/init.d/varnish discard". Here is the code I use in the init
 script:

 vcl_reload() {
     echo "Reloading Varnish VCL..."
     DATE=`date +%s`
     varnishadm -T $HOSTPORT vcl.load reload${DATE} $VARNISH_VCL_CONF ||
 vcl_exit 1 "Error compiling config $VARNISH_VCL_CONF"
     varnishadm -T $HOSTPORT vcl.use reload${DATE} || vcl_exit 1 "Error
 loading config $VARNISH_VCL_CONF"
     vcl_exit 0 "VCL reloaded succesfuly."
 }

 vcl_discard_all() {
     echo "Discarding old configurations..."
     COUNT=`varnishadm -T $HOSTPORT vcl.list | grep -v ^$ | grep active -B1
 | wc -l`
     if [ $COUNT -le 1 ] ; then vcl_exit 1 "Error: There are no old
 configurations to discard." ; fi
     varnishadm -T $HOSTPORT vcl.list | grep -v ^$ | while read CONFIG ; do
         if [ `echo "$CONFIG" | awk '{print $1}'` == "available" ] ; then
             varnishadm -T $HOSTPORT vcl.discard `echo "$CONFIG" | awk
 '{print $3}'`
         else
             break
         fi
     done
     vcl_exit 0 "Old configurations were succesfully discarded."
 }

-- 
Ticket URL: <http://www.varnish-cache.org/trac/ticket/845>
Varnish <http://varnish-cache.org/>
The Varnish HTTP Accelerator