varnish lock-up

Jeff Potter jpotter-varnish at codepuppy.com
Wed Jun 12 15:02:03 CEST 2013


Hi --

Haven't heard anything on this, so thought I'd try to post it into a bug tracking system. However, I can't seem to figure out where that'd be -- https://www.varnish-cache.org/trac ?

Thanks,
Jeff

On Jun 5, 2013, at 9:16 PM, Jeff Potter <jpotter-varnish at codepuppy.com> wrote:

> 
> Hi Varnish Folk,
> 
> We had an OOM condition today on a dev box that I thought was worth passing along, as it could've happened in production or may happen to others.
> 
> This is on Varnish 3.0.3 / Ubuntu 10.04.
> 
> It looks like the parent varnish process is not handling unexpected child exits correctly while inside a signal, leading to a situation where the parent process locks up because of a race condition where the signal handlers stomp on the global values of child_state and child_pid.
> 
> Here's a snippet of log -- the OOM killer kills the child processes each time that varnish attempts to run them (28004, 15403, 15640/-1).
> 
> [7270495.691960] Out of memory: kill process 27086 (varnishd) score 9908734 or a child
> [7270495.691971] Killed process 28004 (varnishd)
> 
> 1370466543.3  Child (28004) died signal=9
> 1370466543.3  Child cleanup complete
> 1370466543.4  child (15403) Started
> 1370466545.4  Child (15403) said Not running as root, no priv-sep
> 1370466545.4  Child (15403) said Child starts
> 1370466545.4  Child (15403) said SMF.s0 mmap'ed 268435456 bytes of 268435456
> 
> [7270511.916123] Out of memory: kill process 27086 (varnishd) score 9724109 or a child
> [7270511.916134] Killed process 15403 (varnishd)
> 
> 1370466558.8  Child (15403) not responding to CLI, killing it.
> 1370466559.6  Child (15403) died signal=9
> 1370466559.6  Child cleanup complete
> 1370466559.6  child (15640) Started
> 
> [7270520.798539] Out of memory: kill process 27086 (varnishd) score 6705406 or a child
> [7270520.798551] Killed process 15640 (varnishd)
> 
> 1370466568.1  Pushing vcls failed:
> 1370466568.1  CLI communication error (hdr)
> 1370466568.1  Stopping Child
> 1370466568.1  Child (15640) died signal=9
> 
> # (The -1 value is "interesting" given the check for >0; child_pid is a global and it's being set to -1 elsewhere -- so presumably start_child is being called from somewhere else; i.e. start_child is running, mgt_sigchild gets triggered, then flow returns to start_child right after and assigns -1 to child_pid)
> 
> 1370466568.1  Child (-1) said Not running as root, no priv-sep
> 1370466568.1  Child (-1) said Child starts
> 1370466568.1  Child (-1) said SMF.s0 mmap'ed 268435456 bytes of 268435456
> 1370466568.1  Child cleanup complete
> 
> 
> Reading through mgt_child.c, here's what I think is going on:
> 
> 1) The signal handler is getting invoked when the child process is killed by the OOM killer
> 2) Then the parent process, inside the sig handler (mgt_sigchld), calls fork (inside mgt_sigchld's call to start_child), which causes a new child process to start
> 3) The OOM killer kills that second child process right away, before the sig handler has a chance to return
> 
> Does it maybe make sense to set the signal to be blocked for sigchild so that the events can be processed synchronously without losing them?
> 
> -Jeff
> 
> 
> _______________________________________________
> varnish-misc mailing list
> varnish-misc at varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc




More information about the varnish-misc mailing list