Bug: Child not responding to ping, killing it.

Thu Dec 18 11:57:58 CET 2008

Hi all (again ;) if I talk too much, tell me I will stop),

I continue to investigate this problem. It seems that varnish is really
keeping ESTABLISHED connexions to the backend for a verryve verry verrry
long time :

cache1b# netstat -apnt |grep ESTABLISHED|awk '{print $5}' | cut -f 1 -d
':'| sort | uniq -c | sort -g
      6 client1
      8 client2
     10 client3
    > total 24 open connexions
     43 backend1
     50 backend2
     74 backend3
    > total 167 open connexions !!!

The strange thing in that situation is that, on the BACKEND side, the
number of ESTABLISHED connexions is quite low :

for i in be1b be2b be3b ; do ssh $i netstat -apnt |grep :30000 |grep
ESTABLISHED ; done  | wc -l
20

maybe the problem is on the BACKEND REUSE code ?
maybe it is on the PROBE code ?

Maybe there is not really any problem on varnish side : I have another
idea regarding this, that may come from the fact that the Backends are
behind an ipvs load-balancer (yes, our config is quite complex...)

this ipvs load-balancer is in NAT mode, so, there is a NAT (and
therefore a connexion tracking list) somewhere between varnish and the
backends.

Maybe the connexion between varnish and its backend is using http
keepalive, so the TCP channel is not closed at the end, and maybe it is
closed some time AFTER the NAT connexion keeping timeout.

In that case, varnish never receive the TCP connexion closing packet,
and thus keeps the connexion open until ... it fills up its connexion stack.

There is so many scenario that I don't think I will be able to test all
of them before my client (the user of this big cluster) kicks varnish
off ;) but I will try them in order to find a solution,

to be continued...

Regards,

B.