Tuning varnish for high load

Fri Feb 29 21:23:34 CET 2008

On Thu, Feb 28, 2008 at 9:52 PM, Mark Smallcombe <mark at funnyordie.com> wrote:

>  What tuning recommendations do you have for varnish to help it handle high load?

Funny you should ask, I've been spending a lot of time with Varnish in
the lab.  Here are a few observations I've made:

(N.B.  We're using 4-CPU Xeon hardware running RHEL 4.5, which runs
the 2.6.9 Linux kernel.  All machines have at least 4GB RAM and run
the 64-bit Varnish build, but our results are equally applicable to
32-bit builds)

- When the cache hit ratio is very high (i.e. 100%), we discovered
that Varnish's default configuration of thread_pool_max is too high.
When there are too many worker threads, Varnish spends an inordinate
amount of time in system call space.  We're not sure whether this is
due to some flaw in Varnish, our ancient Linux kernel (we were unable
to test with a modern 2.6.22 or later kernel that apparently has a
better scheduler), or is just a fundamental problem when a threaded
daemon like Varnish tries to service thousands of concurrent
connections.  After much tweaking we determined that, on our hardware,
the optimal ratio of threads per CPU is about 16, or around 48-50
threads on a 4-CPU box.  To eliminate dropping work requests, it is
also advisable to raise overflow_max to a significantly higher ratio
than the default (e.g. 10000%).  This will cause Varnish to consume
somewhat more RAM, but will provide outstanding performance.  With
these tweaks, we were able to get Varnish to serve 10,000 concurrent
connections, flooding a Gigabit Ethernet channel with 5 KB cached
objects.

- Conversely, when the cache hit ratio is 0, the default of 100
threads is too low. (To create this scenario, we used 2 Varnish boxes:
 the front-end proxy was configured to "pass" all requests to an
optimized backend Varnish instance that served all requests from
cache.)  On the same 4-CPU hardware, we found that the optimal
thread_pool_max value in this situation is about 750.  Again, we were
able to serve 10,0000 concurrent connections after optimizing the
settings.

I find this interesting, because one would think that Varnish would be
making the system spend much more time in the scheduler in the second
scenario because it is doing significantly less work (no lookups, just
handing off connections to the appropriate backend).  I suspect that
there may be some thread-scalability issues with the cache lookup
process.   If someone with a suitably powerful lab setup (i.e. Gigabit
Ethernet, big hardware) can test with a more modern Linux kernel, I'd
be very interested in the results.  Feel free to contact me if you
need assistance with setup/analysis.

Finally: Varnish performance is absolutely atrocious on a 8-CPU RHEL
4.5 system -- so bad that I have to turn down thread_pool_max to 4 or
restrict it to run only on 4 CPUs via taskset(1).  I've heard that
MySQL has similar problems, so I suspect that this is a Linux kernel
issue.

Best regards,

--Michael