Child panics on OpenSolaris

Thu Feb 11 16:39:10 CET 2010

On 10 February 2010 13:53, Poul-Henning Kamp <phk at phk.freebsd.dk> wrote:
> In message <282e72051002100615x701a37a8o416c9af6d1d7fd38 at mail.gmail.com>, Paul
> Wright writes:
>
>>Thanks for the explanation of what's going on.  Looking at those
>>tickets there are suggestions to try the poll waiter which we're
>>already using - are there any further tests we could try to help
>>narrow down this issue?  I'm happy to assist trying out patches.
>
> I can see three ways to nail this issue:
>
> 1. Catch a tcpdump, when it happens, showing that the client side
>   did close, and Solaris (incorrectly) returns EBADF.
>
> 2. Catch a ktrace/systrace/dtrace, when it happens, that show
>   that Varnish incorrectly closes the fd.
>
> 3. Setup some synthetic test to show that solaris returns EBADF
>   when it shouldn't
>
> If either of those are in your reach, by all means go for it...

I've had a go at 1.) and have two verbose `snoop` traces during child
panics.  I used sp.client from the backtrace to find out the port
number and then looked at just matching packets. From my (limited)
Wireshark comprehension they show the client establishing a connection
to Varnish, issue a GET, receive the response (200 OK).  Then the
client sends a RST packet, from there the connection disappears.
Would this cause the child to panic?

I can't post these traces publicly but are there any other details
that would help?

I've been racking my brains to think if there is any special in our
setup and the only thing that springs to mind is the firewall.  We
have an OpenBSD firewall using PF to redirect HTTP traffic from the
public IP address to the internal web servers which has worked without
issue for a number of years.  During testing the only firewall change
I've made redirects this traffic to Varnish instead.

Paul.