[Varnish] #1743: Workspace exhaustion under load
Varnish
varnish-bugs at varnish-cache.org
Fri May 29 06:43:13 CEST 2015
#1743: Workspace exhaustion under load
----------------------------------+----------------------
Reporter: geoff | Type: defect
Status: new | Priority: normal
Milestone: Varnish 4.0 release | Component: varnishd
Version: 4.0.3 | Severity: normal
Keywords: workspace, esi, load |
----------------------------------+----------------------
Following up on this message in varnish-misc:
https://www.varnish-cache.org/lists/pipermail/varnish-
misc/2015-May/024426.html
The problem is workspace exhaustion in load tests: LostHeader appears in
the logs, lostheader stats counter increases, and VMODs report
insufficient workspace. We've only seen these problems under load. The
currently productive Varnish3 setup runs against the same backends without
the problem.
In V3 we have sess_workspace=256KB. In V4 we started with
workspace_client=workspace_backend=256KB, and doubled the value up to as
high as 16MB, still getting the problem. At 32MB, varnishd filled up RAM.
In V3, we run with well under 50% of available RAM on same-sized machines.
When we captured logs that include LostHeader records, we found that the
offending requests were always ESI includes. The apps have some deep ESI
nesting, up to at least esi_level=7. In some but not all cases, when can
see that there were backend retries due to VCL logic that retries requests
after 5xx responses.
The lostheader counter increases in bursts when this happens, often at a
rate of about 2K/s, up to about 9K/s. The bursts seem to go on for about
10-30 seconds, and then the increase rate goes back to 0. We have 3
proxies in the cluster, and the error bursts don't necessarily happen on
all 3 at the same time.
The problem may be related to backend availability, but I'm not entirely
sure of that. The backends occasionally redeploy while load tests are
going on, and some of the error bursts may have come when this happened.
They also tend to increase when the load is high, which may be just due to
the higher load on varnishd, but might also be related to backends
throwing errors under load. We had one run with no errors at all, in
evening hours when there are no redeployments.
On the other hand, we've had more runs with errors during evening hours,
and sometimes the error bursts have come shortly after the load test
starts, when load is still ramping up and is far from the maximum.
VMODs in use are:
* std and director
* header (V4 version from https://github.com/varnish/libvmod-header)
* urlcode (V4 version from https://github.com/fastly/libvmod-urlcode)
* uuid (as updated for V4 at https://github.com/otto-de/libvmod-uuid)
* re (https://code.uplex.de/uplex-varnish/libvmod-re)
* vtstor (https://code.uplex.de/uplex-varnish/libvmod-vtstor)
We tried working around the use of VMOD re in VCL, since it stores the
subject of regex matches in workspace, and we use it for Cookie headers,
which can be very large. But it didn't solve the problem.
VMOD vtstor only uses workspace for the size of a VXID as string
(otherwise it mallocs its own structures). uuid only uses workspace for
the size of a UUID string.
I'm learning how to read MEMPOOL.* stats, and I've noticed randry > 0,
timeouts > 0 and surplus > 0. But my reading of the code makes me think
that these don't indicate problems (except possibly surplus > 0?), and
that mempools can't help you anyway if workspaces are too small.
--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1743>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator
More information about the varnish-bugs
mailing list