PID stored in _.vsl doesn't appear to be correct in 2.1

Thu Apr 15 01:25:21 CEST 2010

This could easily be a misconfiguration, but I've been playing with
2.1 lately with minor modifications to a long-running 2.0.5/2.0.6
setup. My monit has an annoying tendency to lose track of varnishd and
try to start up a new instance, when one is already running. When this
happens, the original varnishd keeps running just fine but the _.vsl
is trashed, so things like varnishstat and varnishlog are just idle,
though I can see varnishd serving up traffic.

>From strace'ing a subsequent run of varnishd while one is already
running, I see it doing the kill( <master PID>, 0 ) but it's not using
the PID that varnishd is running as. The /var/run/varnishd PID is
correct for the master process, prior to the second run (gets
overwritten by each subsequent start-up). The PID that kill() is
trying is the original parent of the master process, i.e. the parent
clone()'s the eventual master process and closes itself but when
varnishd gets the master PID as recorded in the _.vsl, it's that
now-closed parent of the master whose PID has been recorded. Just a
handful of strace lines above the clone() in the master's parent
process, I can see it doing the open() on _.vsl and mmap()'ing it --
though I'm not ambitious enough to sift through gigs of ltrace to see
what/when it's writing the PID :)

With actual PIDs, it looks like this
25084 (parent) -> 25099 (master; 25084 immediately calls exit_group())
-> 25106 (child process)

For this example, the subsequent varnishd is calling kill() on 25084

On the command line, this error is generated by the 2nd run:

storage_file: filename: /var/cache/varnish/cache size 15000 MB.
SHMFILE used by orphan varnishd child process (pid=25106)
(We assume that process is busy dying.)
Creating new SHMFILE

Presumably it's due to this change:

"Try to detect the case of two running varnishes with the same shmlog
and storage by writing the master and child ids to the shmlog and
refusing to start if they are still running."

If you guys would like me to try something or send along something,
let me know. My varnishd invocation is (ps output doesn't show the ='s
for the -p's for some reason):

varnishd -a :8099 -T :8100 -f /etc/varnish/my.vcl -t 0 -l 80m -s
file,/var/cache/varnish/cache,15000M -u nobody -P /var/run/varnishd -p
listen_depth 4096 -p thread_pools 6 -p thread_pool_max 800 -p
thread_pool_min 200 -p lru_interval 60 -p cli_timeout 20 -p
default_grace 120 -p thread_pool_stack 1048576