varnish lock-up
Jeff Potter
jpotter-varnish at codepuppy.com
Thu Jun 6 03:16:39 CEST 2013
Hi Varnish Folk,
We had an OOM condition today on a dev box that I thought was worth passing along, as it could've happened in production or may happen to others.
This is on Varnish 3.0.3 / Ubuntu 10.04.
It looks like the parent varnish process is not handling unexpected child exits correctly while inside a signal, leading to a situation where the parent process locks up because of a race condition where the signal handlers stomp on the global values of child_state and child_pid.
Here's a snippet of log -- the OOM killer kills the child processes each time that varnish attempts to run them (28004, 15403, 15640/-1).
[7270495.691960] Out of memory: kill process 27086 (varnishd) score 9908734 or a child
[7270495.691971] Killed process 28004 (varnishd)
1370466543.3 Child (28004) died signal=9
1370466543.3 Child cleanup complete
1370466543.4 child (15403) Started
1370466545.4 Child (15403) said Not running as root, no priv-sep
1370466545.4 Child (15403) said Child starts
1370466545.4 Child (15403) said SMF.s0 mmap'ed 268435456 bytes of 268435456
[7270511.916123] Out of memory: kill process 27086 (varnishd) score 9724109 or a child
[7270511.916134] Killed process 15403 (varnishd)
1370466558.8 Child (15403) not responding to CLI, killing it.
1370466559.6 Child (15403) died signal=9
1370466559.6 Child cleanup complete
1370466559.6 child (15640) Started
[7270520.798539] Out of memory: kill process 27086 (varnishd) score 6705406 or a child
[7270520.798551] Killed process 15640 (varnishd)
1370466568.1 Pushing vcls failed:
1370466568.1 CLI communication error (hdr)
1370466568.1 Stopping Child
1370466568.1 Child (15640) died signal=9
# (The -1 value is "interesting" given the check for >0; child_pid is a global and it's being set to -1 elsewhere -- so presumably start_child is being called from somewhere else; i.e. start_child is running, mgt_sigchild gets triggered, then flow returns to start_child right after and assigns -1 to child_pid)
1370466568.1 Child (-1) said Not running as root, no priv-sep
1370466568.1 Child (-1) said Child starts
1370466568.1 Child (-1) said SMF.s0 mmap'ed 268435456 bytes of 268435456
1370466568.1 Child cleanup complete
Reading through mgt_child.c, here's what I think is going on:
1) The signal handler is getting invoked when the child process is killed by the OOM killer
2) Then the parent process, inside the sig handler (mgt_sigchld), calls fork (inside mgt_sigchld's call to start_child), which causes a new child process to start
3) The OOM killer kills that second child process right away, before the sig handler has a chance to return
Does it maybe make sense to set the signal to be blocked for sigchild so that the events can be processed synchronously without losing them?
-Jeff
More information about the varnish-misc
mailing list