Wildly different cache object count, cache size, and cache hit rate on 1 out of 3 servers (Exact same config)

Fri Jul 25 04:56:02 CEST 2014

I'm an engineer on the eng-ops side of my company and I cannot figure out
recent numbers from our monitor for Varnish. I've got three servers that
run production Solr, each with Varnish on the same node in front. These are
physical servers, plenty of resources.

For the past few days because of JAR incompatibilities I've been
downgrading them from Solr 4.8 to Solr 4.7.2. This I do not think would
impact Varnish (Since once downgraded they will serve the exact same data
again) but ever since work began the numbers for cache object count, cache
size, and cache hit rate have been... Weird.

I think maybe I've used the word 'weird' to quantify a problem maybe twice
in my career. Because usually there's some kinda reason and I can go well,
that's because of the thingymagummer going outta wack.

These machines are all provisioned via Chef. Same CentOS OS, same physical
specs, same versions down to patch level of all services, Varnish, et al.
Configs have been diffed, no discrepancies that might explain this. They
are behind a Stingray load balance and have been verified as getting pretty
much the same division of load consistently.

For the past few days with no real pattern one machine has been way above
the others. I mean variances of a million cached objects, variances of 5GB
on the cache size, glaring 50% variances on hit rate. If there's one
pattern it's that one machine arbitrarily did twice as much as one of the
others, which did about equal.

I'm really stumped here. We run jobs that query Solr every hour plus so it
stays used outside of normal business hours. But this is the other thing. I
just checked the monitoring before posting to get the variances and found
that at about 22:00 EST there was a convergence and all three are now at
the same low rate.

Could somehow they have been stabilizing caches and one just shot ahead for
no reason? If the variances were small I'd say fine... But first one
machine had massive variances, then another, and they have no pattern that
I can see. They didn't even correspond to the order they were brought out
of active service and downgraded.

I can post config if it helps. What could possibly cause one of three
identical Varnish machines to so drastically outperform the others?

Thanks! Hope this is the right place to ask. Really appreciate any thoughts
you folks can lend!

Jon Bogaty,
Magnetic, DevOps Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20140724/39220b40/attachment.html>