[Varnish] #1743: Workspace exhaustion under load

Fri May 29 06:43:13 CEST 2015

#1743: Workspace exhaustion under load
----------------------------------+----------------------
 Reporter:  geoff                 |       Type:  defect
   Status:  new                   |   Priority:  normal
Milestone:  Varnish 4.0 release   |  Component:  varnishd
  Version:  4.0.3                 |   Severity:  normal
 Keywords:  workspace, esi, load  |
----------------------------------+----------------------
 Following up on this message in varnish-misc:

 https://www.varnish-cache.org/lists/pipermail/varnish-
 misc/2015-May/024426.html

 The problem is workspace exhaustion in load tests: LostHeader appears in
 the logs, lostheader stats counter increases, and VMODs report
 insufficient workspace. We've only seen these problems under load. The
 currently productive Varnish3 setup runs against the same backends without
 the problem.

 In V3 we have sess_workspace=256KB. In V4 we started with
 workspace_client=workspace_backend=256KB, and doubled the value up to as
 high as 16MB, still getting the problem. At 32MB, varnishd filled up RAM.
 In V3, we run with well under 50% of available RAM on same-sized machines.

 When we captured logs that include LostHeader records, we found that the
 offending requests were always ESI includes. The apps have some deep ESI
 nesting, up to at least esi_level=7. In some but not all cases, when can
 see that there were backend retries due to VCL logic that retries requests
 after 5xx responses.

 The lostheader counter increases in bursts when this happens, often at a
 rate of about 2K/s, up to about 9K/s. The bursts seem to go on for about
 10-30 seconds, and then the increase rate goes back to 0. We have 3
 proxies in the cluster, and the error bursts don't necessarily happen on
 all 3 at the same time.

 The problem may be related to backend availability, but I'm not entirely
 sure of that. The backends occasionally redeploy while load tests are
 going on, and some of the error bursts may have come when this happened.
 They also tend to increase when the load is high, which may be just due to
 the higher load on varnishd, but might also be related to backends
 throwing errors under load. We had one run with no errors at all, in
 evening hours when there are no redeployments.

 On the other hand, we've had more runs with errors during evening hours,
 and sometimes the error bursts have come shortly after the load test
 starts, when load is still ramping up and is far from the maximum.

 VMODs in use are:

 * std and director
 * header (V4 version from https://github.com/varnish/libvmod-header)
 * urlcode (V4 version from https://github.com/fastly/libvmod-urlcode)
 * uuid (as updated for V4 at https://github.com/otto-de/libvmod-uuid)
 * re (https://code.uplex.de/uplex-varnish/libvmod-re)
 * vtstor (https://code.uplex.de/uplex-varnish/libvmod-vtstor)

 We tried working around the use of VMOD re in VCL, since it stores the
 subject of regex matches in workspace, and we use it for Cookie headers,
 which can be very large. But it didn't solve the problem.

 VMOD vtstor only uses workspace for the size of a VXID as string
 (otherwise it mallocs its own structures). uuid only uses workspace for
 the size of a UUID string.

 I'm learning how to read MEMPOOL.* stats, and I've noticed randry > 0,
 timeouts > 0 and surplus > 0. But my reading of the code makes me think
 that these don't indicate problems (except possibly surplus > 0?), and
 that mempools can't help you anyway if workspaces are too small.

-- 
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1743>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator