Adjusting the parents watchdog timer
Adrian Otto
aotto at mosso.com
Tue Feb 26 23:58:39 CET 2008
Dave,
For what it's worth, I've also seen the condition you described. My
solution was to simply disable the watchdog timer. I did not have the
energy to debug it, because it only seemed to happen when I was
pushing rather ridiculously high test loads. As an experiment, it
might be interesting to just dedicate one additional thread to
responding to something just like PING/PONG in addition to the PING/
PONG already there, and use a separate client to generate checks. For
sake of discussion, call this PING2. You'll also want to adjust PING/
PONG so that the parent logs a failure instead of actually restarting
the child. If the PING/PONG fails, but the PING2 continues to work,
then it points to an internal issue within varnishd. However, if they
both quit working, that would indicate external an condition as the
more likely cause.
One way to catch it is to change the parent PING code to start gdb
and attach it to your varnishd process when it fails to respond,
rather than killing it. That way you can stop and look at exactly
what's happening inside varnishd at the time of the condition. I'd
first look at things like the thread count, and get a backtrace of
all the threads to see if something is obviously stuck somewhere. If
you don't see it right away, consider just running gcore inside gdb
to generate a core file, and perhaps one of us could take a look.
Regards,
Adrian
On Feb 26, 2008, at 2:42 PM, Dave Cheney wrote:
> [root at rado ~]# uname -a
> Linux rado.redbubble.com 2.6.23.14-64.fc7 #1 SMP Sun Jan 20 22:20:19
> EST 2008 x86_64 x86_64 x86_64 GNU/Linux
>
> I'm using cacti to graph the usual system stats with a 5 minute
> resolution. There is some indication of a loadavg jump in concert with
> each child restart which may indicate thread pileup.
>
> I haven't looked at the PING code but i'm guessing its a connect, a
> dummy request and a close, so thread starvation or mutex contention
> could cause a blockage.
>
> I'll reduce the timeout to 5 seconds and see what happens.
>
> Cheers
>
> Dave
>
> On 27/02/2008, at 9:33 AM, Poul-Henning Kamp wrote:
>
>> The default value is very much a number pulled out of thin air, and
>> it is certainly not inconceiveable that it needs tweaking.
>>
>> In general I would expect 5 seconds to be enough, and I would like to
>> find out why they are not.
>>
>> It may help if you plot (using munin ?) the activity on the machine
>> and see if you can correlate the hickups with activitity.
>
> _______________________________________________
> varnish-dev mailing list
> varnish-dev at projects.linpro.no
> http://projects.linpro.no/mailman/listinfo/varnish-dev
More information about the varnish-dev
mailing list