Child crashes when the backend is down

Dag-Erling Smørgrav des at linpro.no
Tue May 1 20:50:24 CEST 2007


This is probably what's killing my customer up north:

> % varnishd -a localhost:8080 -b localhost:80 -sfile,/tmp,4m -d -d

[note that there is no server at localhost:80]

> file /tmp/varnish.TIGksM (unlinked) size 4194304 bytes (1024 fs-blocks, 1024 pages)
> Using old SHMFILE
> rolling(1)...
> rolling(2)...
> start
> CLI <start>
> start child pid 7023
> 200 0
>
> Child said (2, 7023): <<Child starts
> managed to mmap 4194304 bytes of 4194304
> Ready
> CLI ready
> >>

[now point firefox at http://localhost:8080/]

> Child said (2, 7023): <<socktest: linger=0 sndtimeo=0 rcvtimeo=0
> >>
> Child not responding to ping

[because it is busy dumping core]

> Cache child died pid=7023 status=0x8b

[0x8b = SIGSEGV]

> Clean child
> Child cleaned
> start child pid 7039
> Child said (2, 7039): <<Child starts
> managed to mmap 4194304 bytes of 4194304
> Ready
> CLI ready
> >>
> Manager got SIGINT
> Clean child
> Child stopping
> unlink /tmp/vcl.XX9MHO5n

[let's look at that core file]

> % gdb =varnishd ~/core
> GNU gdb 6.6-debian
> Copyright (C) 2006 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu"...
> Using host libthread_db library "/lib/libthread_db.so.1".
> Error while mapping shared library sections:
> /tmp/vcl.XX9MHO5n: No such file or directory.
> Reading symbols from /opt/varnish/lib/libvarnish.so.0...done.
> Loaded symbols for /opt/varnish/lib/libvarnish.so.0
> Reading symbols from /opt/varnish/lib/libvcl.so.0...done.
> Loaded symbols for /opt/varnish/lib/libvcl.so.0
> Reading symbols from /lib/libdl.so.2...done.
> Loaded symbols for /lib/libdl.so.2
> Reading symbols from /lib/librt.so.1...done.
> Loaded symbols for /lib/librt.so.1
> Reading symbols from /lib/libpthread.so.0...done.
> Loaded symbols for /lib/libpthread.so.0
> Reading symbols from /lib/libc.so.6...done.
> Loaded symbols for /lib/libc.so.6
> Reading symbols from /lib/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> Reading symbols from /lib/libnss_files.so.2...done.
> Loaded symbols for /lib/libnss_files.so.2
> Symbol file not found for /tmp/vcl.XX9MHO5n
> Core was generated by `varnishd -a localhost:8080 -b localhost:80 -sfile,/tmp,4m -d -d'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00002afef513d3d9 in pthread_mutex_lock () from /lib/libpthread.so.0
> (gdb) where
> #0  0x00002afef513d3d9 in pthread_mutex_lock () from /lib/libpthread.so.0
> #1  0x000000000040cfa3 in HSH_Unbusy (o=0x64e840) at cache_hash.c:195
> #2  0x00000000004095fc in cnt_fetch (sp=0x64c308) at cache_center.c:292
> #3  0x000000000040a428 in CNT_Session (sp=0x64c308) at steps.h:40
> #4  0x0000000000411c2b in wrk_do_one (w=0x43002c50) at cache_pool.c:191
> #5  0x0000000000411fe7 in wrk_thread (priv=0x632340) at cache_pool.c:231
> #6  0x00002afef513b2a5 in start_thread () from /lib/libpthread.so.0
> #7  0x00002afef541f61d in clone () from /lib/libc.so.6
> #8  0x0000000000000000 in ?? ()
> (gdb) up
> #1  0x000000000040cfa3 in HSH_Unbusy (o=0x64e840) at cache_hash.c:195
> 195             LOCK(&o->objhead->mtx);
> (gdb) l
> 190             CHECK_OBJ_NOTNULL(o, OBJECT_MAGIC);
> 191             assert(o->busy);
> 192             assert(o->refcnt > 0);
> 193             if (o->cacheable)
> 194                     EXP_Insert(o);
> 195             LOCK(&o->objhead->mtx);
> 196             o->busy = 0;
> 197             UNLOCK(&o->objhead->mtx);
> 198             while (1) {
> 199                     sp = TAILQ_FIRST(&o->waitinglist);
> (gdb) p *o
> $1 = {magic = 847584578, refcnt = 1, xid = 780407261, objhead = 0x0,
>   heap_idx = 0, ban_seq = 0, pass = 0, response = 0, valid = 0, cacheable = 0,
>   busy = 1, len = 0, age = 0, entered = 0, ttl = 0, last_modified = 0, http = {
>     magic = 1680389577, s = 0x0, t = 0x0, v = 0x0, f = 0x0, e = 0x0,
>     conds = 0 '\0', logtag = HTTP_Rx, hd = {{b = 0x0,
>         e = 0x0} <repeats 32 times>}, hdf = '\0' <repeats 31 times>, nhd = 0},
>   list = {tqe_next = 0x0, tqe_prev = 0x0}, deathrow = {tqe_next = 0x0,
>     tqe_prev = 0x0}, store = {tqh_first = 0x0, tqh_last = 0x64eb18},
>   waitinglist = {tqh_first = 0x0, tqh_last = 0x64eb28}}

There is no o->objhead because no object was ever retrieved and
Fetch() returned early.

Why is o->busy protected by o->objhead->mtx?  Why not have a mutex
directly in struct object instead?

DES
-- 
Dag-Erling Smørgrav
Senior Software Developer
Linpro AS - www.linpro.no



More information about the varnish-dev mailing list