Adjusting the parents watchdog timer

Tue Feb 26 23:58:39 CET 2008

Dave,

For what it's worth, I've also seen the condition you described. My  
solution was to simply disable the watchdog timer. I did not have the  
energy to debug it, because it only seemed to happen when I was  
pushing rather ridiculously high test loads. As an experiment, it  
might be interesting to just dedicate one additional thread to  
responding to something just like PING/PONG in addition to the PING/ 
PONG already there, and use a separate client to generate checks. For  
sake of discussion, call this PING2. You'll also want to adjust PING/ 
PONG so that the parent logs a failure instead of actually restarting  
the child. If the PING/PONG fails, but the PING2 continues to work,  
then it points to an internal issue within varnishd. However, if they  
both quit working, that would indicate external an condition as the  
more likely cause.

One way to catch it is to change the parent PING code to start gdb  
and attach it to your varnishd process when it fails to respond,  
rather than killing it. That way you can stop and look at exactly  
what's happening inside varnishd at the time of the condition. I'd  
first look at things like the thread count, and get a backtrace of  
all the threads to see if something is obviously stuck somewhere. If  
you don't see it right away, consider just running gcore inside gdb  
to generate a core file, and perhaps one of us could take a look.

Regards,

Adrian

On Feb 26, 2008, at 2:42 PM, Dave Cheney wrote:

> [root at rado ~]# uname -a
> Linux rado.redbubble.com 2.6.23.14-64.fc7 #1 SMP Sun Jan 20 22:20:19
> EST 2008 x86_64 x86_64 x86_64 GNU/Linux
>
> I'm using cacti to graph the usual system stats with a 5 minute
> resolution. There is some indication of a loadavg jump in concert with
> each child restart which may indicate thread pileup.
>
> I haven't looked at the PING code but i'm guessing its a connect, a
> dummy request and a close, so thread starvation or mutex contention
> could cause a blockage.
>
> I'll reduce the timeout to 5 seconds and see what happens.
>
> Cheers
>
> Dave
>
> On 27/02/2008, at 9:33 AM, Poul-Henning Kamp wrote:
>
>> The default value is very much a number pulled out of thin air, and
>> it is certainly not inconceiveable that it needs tweaking.
>>
>> In general I would expect 5 seconds to be enough, and I would like to
>> find out why they are not.
>>
>> It may help if you plot (using munin ?) the activity on the machine
>> and see if you can correlate the hickups with activitity.
>
> _______________________________________________
> varnish-dev mailing list
> varnish-dev at projects.linpro.no
> http://projects.linpro.no/mailman/listinfo/varnish-dev