Varnish nighmare after upgrading : need help
raph at futomaki.net
Tue Nov 14 22:24:48 UTC 2017
First of all despite my mail subject I really appreciate varnish.
We use it a lot at work (hundred of instances) with success and
unfortunately some pain these time.
TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of our
infrastructure brought us some serious trouble and instability on this
And we are a bit desperate/frustrated
A bit of context :
This a very complex platform serving an IPTV service with some traffic.
(8k req/s in peak, even more when it work well).
It is compose of a two stage reverse proxy cache (3 x 2 varnish for
stage 1), 2 varnish for stage 2, (so 8 in total) and a lot of different
backends (php applications, nodejs apps, remote backends *sigh*, and
even pipe one). This a big historical spaghetti app. We plan to rebuild
it from scratch in 2018.
The first stage varnish are separate in two pool handling different
topology of clients.
A lot of the logic is in varnish/vcl itself, lot of url rewrite, lot of
manipulation of headers, choice of a backend, and even ESI processing...
The VCL of the stage 1 varnish are almost 3000 lines long.
But for now we have to leave/deal with it.
History of the problem :
At the beginning all varnish are in 2.x version. Things works almost well.
This summer we need to upgrade the varnish version to handle very long
header (a product requirement).
So after a short battle porting our vcl to vcl4.0 we start using varnish 4.
Shortly after thing begun to goes very bad.
The first issue we hit, is a memory exhaustion on both stage, and
We test a lot of things, and in the battle we upgrade to varnish5.
We fix it, resizing the pool, and using now file backend (from memory
Memory is now stable (we have large pool, 32G, and strange thing, we
never have object being nuke, which it good or bad it depend).
We have also fix a lot of things in our vcl.
The problem we fight against now is only on the stage1 varnish, and
specifically on one pool (the busiest one).
When everything goes well the average cpu usage is 30%, memory stabilize
around 12G, hit cache is around 0.85.
Problem happen randomly (not everyday) but during our peaks. The cpu
increase fasly to reach 350% (4 core) and load > 3/
When the problem is here varnish still deliver requests (we didn't see
dropped or reject connections) but our application begin to lost user,
including a big lot of business. I suspect this is because timeout are
very aggressive on the client side and varnish should answer slowly
-first question : how see response time of request of the varnish server
?. (varnishnsca something ?)
I also suspect some kind of request queuing, also stracing varnish when
it happen show a lot of futex wait ?!.
The frustrating part is restarting varnish fix the problem immediately,
and the cpu remains normal after, even if the trafic peak is not finish.
So there is clearly something stacked in varnish which cause our problem.
-second question : how to see number of stacked connections, long
connections and so on ?
At this stage we accept all kind of help / hints for debuging (and
regarding the business impact we can evaluate the help of a professional
PS : I always have the option to scale out, popping a lot of new varnish
instance, but this seems very frustrating...
More information about the varnish-misc