[Varnish] #610: rushing too often may overflow session workspace
Varnish
varnish-bugs at projects.linpro.no
Tue Dec 29 17:44:25 CET 2009
#610: rushing too often may overflow session workspace
----------------------+-----------------------------------------------------
Reporter: slink | Owner: phk
Type: defect | Status: new
Priority: high | Milestone:
Component: varnishd | Version: 2.0
Severity: major | Keywords:
----------------------+-----------------------------------------------------
(see also the discussion in
http://projects.linpro.no/pipermail/varnish-dev/2009-December/002348.html
)
When very many sessions are waiting for a busy object from a slow
backend, varnish may panic in HSH_prepare with a message similar to
the following
{{{
varnishd[7518]: [ID 733861 local0.error] Child (8742) Panic message:
Missing errorhandling code in HSH_Prepare(), cache_hash.c line 187:
Condition((p) != 0) not true. errno = 32 (Broken pipe) thread =
(cache-worker)sp = 10bc7038 {
fd = 2322, id = 2322, xid = 441755642,
client = X.X.X.X:61981,
step = STP_LOOKUP,
handling = hash,
ws = 10bc70a8 { overflow
id = "sess",
{s,f,r,e} = {10bc7938,,+16340,0,+16384},
},
worker = fffffd7e5d504e00 {
},
vcl = {
srcname = {
"input",
"Default",
},
},
},
}}}
This panic message is, in many cases, a symptom of too small a
session workspace (see also #472, #551, #560). For the issue
documented here to apply, the session workspace should be inspected:
{{{
(gdb) x /52x sp->ws->e-1000
0x10bcb550: 0x00000000 0x00000000 0x10bc794f 0x00000000
0x10bcb560: 0x10bc7950 0x00000000 0x10bc7bcb 0x00000000
0x10bcb570: 0x10bc7bd6 0x00000000 0x00000000 0x00000000
0x10bcb580: 0x00000000 0x00000000 0x00000000 0x00000000
0x10bcb590: 0x10bc794f 0x00000000 0x10bc7950 0x00000000
0x10bcb5a0: 0x10bc7bcb 0x00000000 0x10bc7bd6 0x00000000
0x10bcb5b0: 0x00000000 0x00000000 0x00000000 0x00000000
0x10bcb5c0: 0x00000000 0x00000000 0x10bc794f 0x00000000
0x10bcb5d0: 0x10bc7950 0x00000000 0x10bc7bcb 0x00000000
0x10bcb5e0: 0x10bc7bd6 0x00000000 0x00000000 0x00000000
0x10bcb5f0: 0x00000000 0x00000000 0x00000000 0x00000000
0x10bcb600: 0x10bc794f 0x00000000 0x10bc7950 0x00000000
0x10bcb610: 0x10bc7bcb 0x00000000 0x10bc7bd6 0x00000000
}}}
If this issue applies, the workspace will be clobbered with a sequence
of hash pointers:
{{{
(gdb) print sp->nhashptr
$8 = 6
(gdb) print *sp->hashptr
$14 = 0x10bc794f "/"
(gdb) print *(sp->hashptr+1)
$15 = 0x10bc7950 ""
(gdb) print *(sp->hashptr+2)
$16 = 0x10bc7bcb "www.XXXX.de"
(gdb) print *(sp->hashptr+3)
$17 = 0x10bc7bd6 ""
(gdb) print *(sp->hashptr+4)
$18 = 0x0
(gdb) print *(sp->hashptr+5)
$19 = 0x0
(gdb) print *(sp->hashptr+6)
$20 = 0x0
}}}
In the case, the reason for the panic is that the session repeatedly
got rushed (scheduled onto a thread) while the object it was waiting
for was still busy. While this should, in principle, only add
additional computational overhead for checking if the object being
waited for has become available in the cache, a side effect of the
particular (efficient) memory management used for workspaces causes
session workspace exhaustion:
When we determine in cnt_lookup that the requested object is busy,
we've already allocated memory from the session workspace, so if we
enter cnt_lookup very often, we will eventually clobber the session
workspace and panic:
{{{
#!c
if (sp->obj == NULL) {
HSH_Prepare(sp, sp->vcl->nhashcount);
VCL_hash_method(sp);
assert(sp->handling == VCL_RET_HASH);
}
o = HSH_Lookup(sp);
if (o == NULL) {
/*
* We lost the session to a busy object, disembark the
* worker thread. The hash code to restart the session,
* still in STP_LOOKUP, later when the busy object isn't.
*
* XXX When getting here, we have allocated space for
* sp->hashptr on sp->http->ws using HSH_Prepare, which we
never
* return until the whole session object is destroyed. We
would
* want to use WS_Release on the space allocated by
HSH_Prepare,
* but VCL_hash_method(sp) could have allocated space and
used
* (refenced) it anywhere in the session object - so
unless we
* changed memory management into something less
efficient, we
* can't safely return the space.
*
* This imposes a practical limit on the number of
restarts
*
*/
return (1);
}
}}}
The comment explains why I think this cant be fixed easily.
I will attach a suggested fix for varnish 2.0.5 which should reduce
rushes to those cases when they actually make sense.
I haven't looked at the current trunk yet, but as far as I remember,
the relevant functions have been changed siginificantly.
Thank you for this great product,
Nils
--
Ticket URL: <http://varnish.projects.linpro.no/ticket/610>
Varnish <http://varnish.projects.linpro.no/>
The Varnish HTTP Accelerator
More information about the varnish-bugs
mailing list