[Varnish] #610: rushing too often may overflow session workspace
Varnish
varnish-bugs at varnish-cache.org
Mon Mar 7 12:28:33 CET 2011
#610: rushing too often may overflow session workspace
----------------------+-----------------------------------------------------
Reporter: slink | Owner: kristian
Type: defect | Status: assigned
Priority: high | Milestone: Later
Component: varnishd | Version: trunk
Severity: normal | Keywords:
----------------------+-----------------------------------------------------
Changes (by kristian):
* status: new => assigned
* version: 2.0 => trunk
* milestone: => Later
* owner: phk => kristian
* severity: major => normal
Old description:
> (see also the discussion in
> http://projects.linpro.no/pipermail/varnish-dev/2009-December/002348.html
> )
>
> When very many sessions are waiting for a busy object from a slow
> backend, varnish may panic in HSH_prepare with a message similar to
> the following
>
> {{{
> varnishd[7518]: [ID 733861 local0.error] Child (8742) Panic message:
> Missing errorhandling code in HSH_Prepare(), cache_hash.c line 187:
> Condition((p) != 0) not true. errno = 32 (Broken pipe) thread =
> (cache-worker)sp = 10bc7038 {
> fd = 2322, id = 2322, xid = 441755642,
> client = X.X.X.X:61981,
> step = STP_LOOKUP,
> handling = hash,
> ws = 10bc70a8 { overflow
> id = "sess",
> {s,f,r,e} = {10bc7938,,+16340,0,+16384},
> },
> worker = fffffd7e5d504e00 {
> },
> vcl = {
> srcname = {
> "input",
> "Default",
> },
> },
> },
> }}}
>
> This panic message is, in many cases, a symptom of too small a
> session workspace (see also #472, #551, #560). For the issue
> documented here to apply, the session workspace should be inspected:
>
> {{{
> (gdb) x /52x sp->ws->e-1000
> 0x10bcb550: 0x00000000 0x00000000 0x10bc794f
> 0x00000000
> 0x10bcb560: 0x10bc7950 0x00000000 0x10bc7bcb
> 0x00000000
> 0x10bcb570: 0x10bc7bd6 0x00000000 0x00000000
> 0x00000000
> 0x10bcb580: 0x00000000 0x00000000 0x00000000
> 0x00000000
> 0x10bcb590: 0x10bc794f 0x00000000 0x10bc7950
> 0x00000000
> 0x10bcb5a0: 0x10bc7bcb 0x00000000 0x10bc7bd6
> 0x00000000
> 0x10bcb5b0: 0x00000000 0x00000000 0x00000000
> 0x00000000
> 0x10bcb5c0: 0x00000000 0x00000000 0x10bc794f
> 0x00000000
> 0x10bcb5d0: 0x10bc7950 0x00000000 0x10bc7bcb
> 0x00000000
> 0x10bcb5e0: 0x10bc7bd6 0x00000000 0x00000000
> 0x00000000
> 0x10bcb5f0: 0x00000000 0x00000000 0x00000000
> 0x00000000
> 0x10bcb600: 0x10bc794f 0x00000000 0x10bc7950
> 0x00000000
> 0x10bcb610: 0x10bc7bcb 0x00000000 0x10bc7bd6
> 0x00000000
> }}}
>
> If this issue applies, the workspace will be clobbered with a sequence
> of hash pointers:
>
> {{{
> (gdb) print sp->nhashptr
> $8 = 6
>
> (gdb) print *sp->hashptr
> $14 = 0x10bc794f "/"
> (gdb) print *(sp->hashptr+1)
> $15 = 0x10bc7950 ""
> (gdb) print *(sp->hashptr+2)
> $16 = 0x10bc7bcb "www.XXXX.de"
> (gdb) print *(sp->hashptr+3)
> $17 = 0x10bc7bd6 ""
> (gdb) print *(sp->hashptr+4)
> $18 = 0x0
> (gdb) print *(sp->hashptr+5)
> $19 = 0x0
> (gdb) print *(sp->hashptr+6)
> $20 = 0x0
> }}}
>
> In the case, the reason for the panic is that the session repeatedly
> got rushed (scheduled onto a thread) while the object it was waiting
> for was still busy. While this should, in principle, only add
> additional computational overhead for checking if the object being
> waited for has become available in the cache, a side effect of the
> particular (efficient) memory management used for workspaces causes
> session workspace exhaustion:
>
> When we determine in cnt_lookup that the requested object is busy,
> we've already allocated memory from the session workspace, so if we
> enter cnt_lookup very often, we will eventually clobber the session
> workspace and panic:
>
> {{{
> #!c
> if (sp->obj == NULL) {
> HSH_Prepare(sp, sp->vcl->nhashcount);
> VCL_hash_method(sp);
> assert(sp->handling == VCL_RET_HASH);
> }
>
> o = HSH_Lookup(sp);
>
> if (o == NULL) {
> /*
> * We lost the session to a busy object, disembark the
> * worker thread. The hash code to restart the session,
> * still in STP_LOOKUP, later when the busy object isn't.
> *
> * XXX When getting here, we have allocated space for
> * sp->hashptr on sp->http->ws using HSH_Prepare, which
> we never
> * return until the whole session object is destroyed. We
> would
> * want to use WS_Release on the space allocated by
> HSH_Prepare,
> * but VCL_hash_method(sp) could have allocated space and
> used
> * (refenced) it anywhere in the session object - so
> unless we
> * changed memory management into something less
> efficient, we
> * can't safely return the space.
> *
> * This imposes a practical limit on the number of
> restarts
> *
> */
> return (1);
> }
> }}}
>
> The comment explains why I think this cant be fixed easily.
>
> I will attach a suggested fix for varnish 2.0.5 which should reduce
> rushes to those cases when they actually make sense.
>
> I haven't looked at the current trunk yet, but as far as I remember,
> the relevant functions have been changed siginificantly.
>
> Thank you for this great product,
>
> Nils
New description:
(see also the discussion in
http://projects.linpro.no/pipermail/varnish-dev/2009-December/002348.html
)
When very many sessions are waiting for a busy object from a slow
backend, varnish may panic in HSH_prepare with a message similar to
the following
{{{
varnishd[7518]: [ID 733861 local0.error] Child (8742) Panic message:
Missing errorhandling code in HSH_Prepare(), cache_hash.c line 187:
Condition((p) != 0) not true. errno = 32 (Broken pipe) thread =
(cache-worker)sp = 10bc7038 {
fd = 2322, id = 2322, xid = 441755642,
client = X.X.X.X:61981,
step = STP_LOOKUP,
handling = hash,
ws = 10bc70a8 { overflow
id = "sess",
{s,f,r,e} = {10bc7938,,+16340,0,+16384},
},
worker = fffffd7e5d504e00 {
},
vcl = {
srcname = {
"input",
"Default",
},
},
},
}}}
This panic message is, in many cases, a symptom of too small a
session workspace (see also #472, #551, #560). For the issue
documented here to apply, the session workspace should be inspected:
{{{
(gdb) x /52x sp->ws->e-1000
0x10bcb550: 0x00000000 0x00000000 0x10bc794f 0x00000000
0x10bcb560: 0x10bc7950 0x00000000 0x10bc7bcb 0x00000000
0x10bcb570: 0x10bc7bd6 0x00000000 0x00000000 0x00000000
0x10bcb580: 0x00000000 0x00000000 0x00000000 0x00000000
0x10bcb590: 0x10bc794f 0x00000000 0x10bc7950 0x00000000
0x10bcb5a0: 0x10bc7bcb 0x00000000 0x10bc7bd6 0x00000000
0x10bcb5b0: 0x00000000 0x00000000 0x00000000 0x00000000
0x10bcb5c0: 0x00000000 0x00000000 0x10bc794f 0x00000000
0x10bcb5d0: 0x10bc7950 0x00000000 0x10bc7bcb 0x00000000
0x10bcb5e0: 0x10bc7bd6 0x00000000 0x00000000 0x00000000
0x10bcb5f0: 0x00000000 0x00000000 0x00000000 0x00000000
0x10bcb600: 0x10bc794f 0x00000000 0x10bc7950 0x00000000
0x10bcb610: 0x10bc7bcb 0x00000000 0x10bc7bd6 0x00000000
}}}
If this issue applies, the workspace will be clobbered with a sequence
of hash pointers:
{{{
(gdb) print sp->nhashptr
$8 = 6
(gdb) print *sp->hashptr
$14 = 0x10bc794f "/"
(gdb) print *(sp->hashptr+1)
$15 = 0x10bc7950 ""
(gdb) print *(sp->hashptr+2)
$16 = 0x10bc7bcb "www.XXXX.de"
(gdb) print *(sp->hashptr+3)
$17 = 0x10bc7bd6 ""
(gdb) print *(sp->hashptr+4)
$18 = 0x0
(gdb) print *(sp->hashptr+5)
$19 = 0x0
(gdb) print *(sp->hashptr+6)
$20 = 0x0
}}}
In the case, the reason for the panic is that the session repeatedly
got rushed (scheduled onto a thread) while the object it was waiting
for was still busy. While this should, in principle, only add
additional computational overhead for checking if the object being
waited for has become available in the cache, a side effect of the
particular (efficient) memory management used for workspaces causes
session workspace exhaustion:
When we determine in cnt_lookup that the requested object is busy,
we've already allocated memory from the session workspace, so if we
enter cnt_lookup very often, we will eventually clobber the session
workspace and panic:
{{{
#!c
if (sp->obj == NULL) {
HSH_Prepare(sp, sp->vcl->nhashcount);
VCL_hash_method(sp);
assert(sp->handling == VCL_RET_HASH);
}
o = HSH_Lookup(sp);
if (o == NULL) {
/*
* We lost the session to a busy object, disembark the
* worker thread. The hash code to restart the session,
* still in STP_LOOKUP, later when the busy object isn't.
*
* XXX When getting here, we have allocated space for
* sp->hashptr on sp->http->ws using HSH_Prepare, which we
never
* return until the whole session object is destroyed. We
would
* want to use WS_Release on the space allocated by
HSH_Prepare,
* but VCL_hash_method(sp) could have allocated space and
used
* (refenced) it anywhere in the session object - so
unless we
* changed memory management into something less
efficient, we
* can't safely return the space.
*
* This imposes a practical limit on the number of
restarts
*
*/
return (1);
}
}}}
The comment explains why I think this cant be fixed easily.
I will attach a suggested fix for varnish 2.0.5 which should reduce
rushes to those cases when they actually make sense.
I haven't looked at the current trunk yet, but as far as I remember,
the relevant functions have been changed siginificantly.
Thank you for this great product,
Nils
--
--
Ticket URL: <http://www.varnish-cache.org/trac/ticket/610#comment:5>
Varnish <http://varnish-cache.org/>
The Varnish HTTP Accelerator
More information about the varnish-bugs
mailing list