[Varnish] #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions
Varnish
varnish-bugs at varnish-cache.org
Sun Jul 12 22:50:51 CEST 2015
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
---------------------------------+----------------------
Reporter: geoff | Type: defect
Status: new | Priority: normal
Milestone: Varnish 4.0 release | Component: varnishd
Version: 4.0.3 | Severity: critical
Keywords: |
---------------------------------+----------------------
We have been experiencing a problem with a VSL client -- under certain
conditions the dispatcher callback is no longer called, and the process
resident size increases slowly over a long period of time, until all
system RAM is consumed.
{{{
#6 0x00007f55851be85d in VAS_Fail_default (
func=0x7f55851ddd10 "vtx_synth_rec", file=0x7f55851dd28a
"vsl_dispatch.c",
line=1008, cond=0x7f55851dda96 "(synth) != 0", err=12,
kind=VAS_ASSERT)
at ../libvarnish/vas.c:67
#7 0x00007f55851ce903 in vtx_synth_rec (vtx=0x7d24a0, tag=76,
fmt=0x7f55851ddafc "%s (%u:%s \"%.*s\")") at vsl_dispatch.c:1008
#8 0x00007f55851cee17 in vtx_diag_tag (vtx=0x7d24a0, ptr=0x125823fe80,
reason=0x7f55851dda23 "vxid mismatch") at vsl_dispatch.c:1065
#9 0x00007f55851cdefb in vtx_scan (vslq=0x73d0d0, vtx=0x7d24a0)
at vsl_dispatch.c:863
#10 0x00007f55851ce1c2 in vtx_force (vslq=0x73d0d0, vtx=0x7d24a0,
reason=0x7f55851ddc87 "timeout") at vsl_dispatch.c:911
#11 0x00007f55851cffd4 in VSLQ_Dispatch (vslq=0x73d0d0,
func=0x40a5cd <dispatch>, priv=0x0) at vsl_dispatch.c:1345
}}}
That's err=12==ENOMEM for the ALLOC_OBJ() at the beginning of
vtx_synth_rec(). As you can see, VSL forced out a transaction due to
timeout, and vtx_scan() also found a VXID mismatch. The vtx_scan() call is
the second one in vtx_force().
We don't yet know why the VXID mismatch happened, and it's obviously not
OK, we're still investigating, but I suspect that vtx_scan() ran into an
endless loop here:
{{{
while (!(vtx->flags & VTX_F_COMPLETE) &&
vslc_vtx_next(&vtx->c.cursor) == 1) {
ptr = vtx->c.cursor.rec.ptr;
if (VSL_ID(ptr) != vtx->key.vxid) {
(void)vtx_diag_tag(vtx, ptr, "vxid mismatch");
continue;
}
}}}
I can confirm from the core dump that `VSL_ID(ptr) != vtx->key.vxid` is
true here, so it goes into vtx_diag_tag() to synthesize "vxid mismatch".
There was already a synthetic record added by vtx_force(), and
vtx_diag_tag() causes a new one to be added. Then the while loop goes back
to call vslc_vtx_next(), which advances the cursor to the synth record
that vtx_diag_tag() just added. The VXID mismatch still holds, so once
again vtx_diag_tag() gets called. Again a synth record is added, and again
vslc_vtx_next() advances to that record.
Thus we have an endless loop, and a struct synth gets allocated every
time. That's only 104 bytes, but slowly and surely all of RAM gets filled.
Our VSL client has a statistic that is incremented every time the
dispatcher is called, and we see in the log that this counter stops
increasing when memory size starts to increase. The counter is stopped for
about 10 to 20 minutes while the resident size increases, until all of RAM
is full. The process finally took up about 75 GB of RAM, meaning that it
had to be about 750 million allocations, assuming that it was allocating
the 104 bytes of struct synth every time, which is probably why it takes
so long.
We can see the large mapping is in the heap (low addresses), so it must
have been something that was malloc'ed (we've also seen this in pmap):
{{{
(gdb) info files
[...]
0x0000000000400000 - 0x0000000000401000 is load1a
0x0000000000401000 - 0x0000000000401000 is load1b
0x0000000000613000 - 0x0000000000614000 is load2
0x0000000000614000 - 0x000000000061d000 is load3
0x0000000000703000 - 0x0000000000724000 is load4
0x0000000000724000 - 0x0000001258261000 is load5
[...]
}}}
0x1258261000 - 0x724000 == 75GB.
And we can see that the last pointer of the synth list goes deep into that
large mapping.
{{{
(gdb) p vtx->synth
$87 = {vtqh_first = 0x866840, vtqh_last = 0x125823ff48}
}}}
The synth list apparently has a long cascade of "vxid mismatch" records:
{{{
(gdb) printf "%s\n", vtx->synth.vtqh_first.data[0]
:VSL "vxid mismatch (6489548:VSL "vxid mi
(gdb) printf "%s\n", vtx->synth.vtqh_first.list.vtqe_next.data[2]
h (6489548:VSL "vxid mismatch (6489548:VSL "vxid mi
(gdb) printf "%s\n",
vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.data[2]
6489548:VSL "vxid mi
(gdb) printf "%s\n",
vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
id mismatch (6489548:VSL "vxid mi
(gdb) printf "%s\n",
vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
id mismatch (6489548:VSL "vxid mi
(gdb) printf "%s\n",
vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
id mismatch (6489548:VSL "vxid mi
(gdb) printf "%s\n",
vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
id mismatch (6489548:VSL "vxid mi
(gdb) printf "%s\n",
vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
id mismatch (6489548:VSL "vxid mi
}}}
We see all of this as evidence that vtx_scan() is trapped in the while
loop.
--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator
More information about the varnish-bugs
mailing list