Workspace exhaustion with Varnish 4 under load

Geoff Simmons geoff at
Thu May 28 20:25:23 CEST 2015

Hash: SHA256

Hello all,

We're getting our first experience with testing Varnish 4 under load
(v4.0.3), and we're having severe problems with workspace exhaustion
- -- LostHeader in the logs, lostheader stats counter increasing, and
various VMODs reporting insufficient workspace errors in the log. The
lostheader stat increased in bursts of thousands per second, mostly
around 2K/s, max ca. 9K/s. The lost headers and VMOD errors make the
proxies broken, and we've only seen the problem under load (not at all
with Varnish3).

On V3 we have sess_workspace=256KB, so on V4 we started out with
workspace_client=workspace_backend=256KB. We tried doubling the value
on each test to address the problem, but as of 16MB (64 times the
value in V3), it still wasn't enough. At 32MB, varnishd filled up the
RAM and crashed. With V3, we run well under 50% of available RAM on
same-sized machines.

Workspace config is different in V4, and memory pools are altogether
new, so I'm wondering what our mistake is.

Do we need to tune the memory pools? My reading of the code makes me
think that it won't help -- if workspace is too small, nothing about
the mempools can change that. Unless there are situations in which a
thread can't get workspace from a mempool?

I'm also learning how to read the new MEMPOOL.* stats. The code makes
me think that MEMPOOL.*.randry > 0 and MEMPOOL.*.timeouts > 0 are not
a problem (?). Does MEMPOOL.*.surplus > 0 indicate a problem?

It wouldn't surprise me if we need larger workspaces in V4 than in V3,
just not >64 times as much.

We're using new VMODs in V4 (necessarily since they have to be
changed), and we've tried commenting out / working around some of them
that we thought were possible suspects for using excessive workspace,
but so far it hasn't helped.

The logs show that the offending requests are *always* ESI-included.
We have some deep ESI nesting, up to at least esi_level==7. And we
also have some retry/restart logic for error responses, all of which
uses the same workspace, and I see that some of these show workspace
exhaustion. But that's no different in our V3 setup -- the backend
apps are the same, so the ESI nesting is as well, and 256KB of
workspace is enough.

I'd be very grateful for any pointers about where we should be looking.

- -- 
** * * UPLEX - Nils Goroll Systemoptimierung

Scheffelstraße 32
22301 Hamburg

Tel +49 40 2880 5731
Mob +49 176 636 90917
Fax +49 40 42949753
Version: GnuPG v1.4.12 (GNU/Linux)


More information about the varnish-misc mailing list