Strategies for splitting load across varnish instances? And avoiding single-point-of-failure?

Fri Jan 15 22:49:40 CET 2010

Lots of good suggestions; I would look to LVS and/or haproxy for going on the cheap; otherwise a NetScaler or F5 would do the trick.

With multiple caches, there are three ways I see to handle it:

1) Duplicate cached data on all Varnish instances.

This is a simple, stateless configuration, but it reduces your overall hit rate: 2 servers will halve your miss traffic; 3 servers will triple it, etc..  If your miss rate is low enough that your origin can handle the misses, this is simple, easy to implement, and offers good performance (CPU and link scalability).  If you lose a host, you would see essentially no miss rate increase.

But it won't scale forever.   For servers with 8GB of RAM, the machines will only provide 8GB of RAM overall for caching, whether it's 4 machines or 400.  (same applies to disk space)

2) Hash URLs or otherwise bucket incoming requests and send unique traffic to each Varnish instance.

This requires some smarts in your load balancer, but this means you can add as many Varnish instances as you want without losing hit rate, and each server's RAM footprint is additive.  8 servers with 8GB of RAM will provide 64GB of RAM overall for caching.

But there are caveats:

2a) If you lose a cache server, you will get a 100% miss rate for all objects that used to be directed to that server.  This might overwhelm your origin.

2b) If you resize the cache server pool, the hash (or other algorithm) will send different objects to different machines, which will increase misses (possibly catastrophically) and may overwhelm your origin.  Some implementations of URL hashing will cause a full rehashing of URLs if a single host even temporarily goes out of service.  Additionally, some hash implementations may also reduce the effect of server loss.

3) Hash/bucket URLs to cache pairs.

Same as 2), but for every hash bucket you would send those hits to two machines (think RAID-10).  This provides redundancy from the effects of 2a), and gives essentially infinite scalability for the price of doubling your miss rate once (two machines per bucket caching the same data).  The caveat from 2b) still applies.

With the right hash/bucket algorithm, 3) is probably the best choice.  But they all have drawbacks.

I'd love to hear other ideas too.

Hope it helps,
-- 
kb

John Norman wrote:
> Folks,
> 
> A couple more questions:
> 
> (1) Are they any good strategies for splitting load across Varnish
> front-ends? Or is the common practice to have just one Varnish server?
> 
> (2) How do people avoid single-point-of-failure for Varnish? Do people
> run Varnish on two servers, amassing similar local caches, but put
> something in front of the two Varnishes? Or round-robin-DNS?
> 
> John