[Varnish] #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions

Sun Jul 12 22:50:51 CEST 2015

#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
---------------------------------+----------------------
 Reporter:  geoff                |       Type:  defect
   Status:  new                  |   Priority:  normal
Milestone:  Varnish 4.0 release  |  Component:  varnishd
  Version:  4.0.3                |   Severity:  critical
 Keywords:                       |
---------------------------------+----------------------
 We have been experiencing a problem with a VSL client -- under certain
 conditions the dispatcher callback is no longer called, and the process
 resident size increases slowly over a long period of time, until all
 system RAM is consumed.

 {{{
 #6  0x00007f55851be85d in VAS_Fail_default (
     func=0x7f55851ddd10 "vtx_synth_rec", file=0x7f55851dd28a
 "vsl_dispatch.c",
     line=1008, cond=0x7f55851dda96 "(synth) != 0", err=12,
 kind=VAS_ASSERT)
     at ../libvarnish/vas.c:67
 #7  0x00007f55851ce903 in vtx_synth_rec (vtx=0x7d24a0, tag=76,
     fmt=0x7f55851ddafc "%s (%u:%s \"%.*s\")") at vsl_dispatch.c:1008
 #8  0x00007f55851cee17 in vtx_diag_tag (vtx=0x7d24a0, ptr=0x125823fe80,
     reason=0x7f55851dda23 "vxid mismatch") at vsl_dispatch.c:1065
 #9  0x00007f55851cdefb in vtx_scan (vslq=0x73d0d0, vtx=0x7d24a0)
     at vsl_dispatch.c:863
 #10 0x00007f55851ce1c2 in vtx_force (vslq=0x73d0d0, vtx=0x7d24a0,
     reason=0x7f55851ddc87 "timeout") at vsl_dispatch.c:911
 #11 0x00007f55851cffd4 in VSLQ_Dispatch (vslq=0x73d0d0,
     func=0x40a5cd <dispatch>, priv=0x0) at vsl_dispatch.c:1345
 }}}

 That's err=12==ENOMEM for the ALLOC_OBJ() at the beginning of
 vtx_synth_rec(). As you can see, VSL forced out a transaction due to
 timeout, and vtx_scan() also found a VXID mismatch. The vtx_scan() call is
 the second one in vtx_force().

 We don't yet know why the VXID mismatch happened, and it's obviously not
 OK, we're still investigating, but I suspect that vtx_scan() ran into an
 endless loop here:

 {{{
         while (!(vtx->flags & VTX_F_COMPLETE) &&
             vslc_vtx_next(&vtx->c.cursor) == 1) {
                 ptr = vtx->c.cursor.rec.ptr;
                 if (VSL_ID(ptr) != vtx->key.vxid) {
                         (void)vtx_diag_tag(vtx, ptr, "vxid mismatch");
                         continue;
                 }
 }}}

 I can confirm from the core dump that `VSL_ID(ptr) != vtx->key.vxid` is
 true here, so it goes into vtx_diag_tag() to synthesize "vxid mismatch".
 There was already a synthetic record added by vtx_force(), and
 vtx_diag_tag() causes a new one to be added. Then the while loop goes back
 to call vslc_vtx_next(), which advances the cursor to the synth record
 that vtx_diag_tag() just added. The VXID mismatch still holds, so once
 again vtx_diag_tag() gets called. Again a synth record is added, and again
 vslc_vtx_next() advances to that record.

 Thus we have an endless loop, and a struct synth gets allocated every
 time. That's only 104 bytes, but slowly and surely all of RAM gets filled.

 Our VSL client has a statistic that is incremented every time the
 dispatcher is called, and we see in the log that this counter stops
 increasing when memory size starts to increase. The counter is stopped for
 about 10 to 20 minutes while the resident size increases, until all of RAM
 is full. The process finally took up about 75 GB of RAM, meaning that it
 had to be about 750 million allocations, assuming that it was allocating
 the 104 bytes of struct synth every time, which is probably why it takes
 so long.

 We can see the large mapping is in the heap (low addresses), so it must
 have been something that was malloc'ed (we've also seen this in pmap):

 {{{
 (gdb) info files
 [...]
         0x0000000000400000 - 0x0000000000401000 is load1a
         0x0000000000401000 - 0x0000000000401000 is load1b
         0x0000000000613000 - 0x0000000000614000 is load2
         0x0000000000614000 - 0x000000000061d000 is load3
         0x0000000000703000 - 0x0000000000724000 is load4
         0x0000000000724000 - 0x0000001258261000 is load5
         [...]
 }}}

 0x1258261000 - 0x724000 == 75GB.

 And we can see that the last pointer of the synth list goes deep into that
 large mapping.

 {{{
 (gdb) p vtx->synth
 $87 = {vtqh_first = 0x866840, vtqh_last = 0x125823ff48}
 }}}

 The synth list apparently has a long cascade of "vxid mismatch" records:

 {{{
 (gdb) printf "%s\n",  vtx->synth.vtqh_first.data[0]
 :VSL "vxid mismatch (6489548:VSL "vxid mi

 (gdb) printf "%s\n", vtx->synth.vtqh_first.list.vtqe_next.data[2]
 h (6489548:VSL "vxid mismatch (6489548:VSL "vxid mi

 (gdb) printf "%s\n",
 vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.data[2]
 6489548:VSL "vxid mi

 (gdb) printf "%s\n",
 vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
 id mismatch (6489548:VSL "vxid mi

 (gdb) printf "%s\n",
 vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
 id mismatch (6489548:VSL "vxid mi

 (gdb) printf "%s\n",
 vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
 id mismatch (6489548:VSL "vxid mi

 (gdb) printf "%s\n",
 vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
 id mismatch (6489548:VSL "vxid mi

 (gdb) printf "%s\n",
 vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
 id mismatch (6489548:VSL "vxid mi
 }}}

 We see all of this as evidence that vtx_scan() is trapped in the while
 loop.

-- 
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator