Varnish nighmare after upgrading : need help

Tue Nov 14 22:24:48 UTC 2017

Hello list,

First of all despite my mail subject I really appreciate varnish.
We use it a lot at work (hundred of instances) with success and 
unfortunately some pain these time.

TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of our 
infrastructure brought us some serious trouble and instability on this 
platform.
And we are a bit desperate/frustrated

Long story.

A bit of context :

This a very complex platform serving an IPTV service with some traffic. 
(8k req/s in peak, even more when it work well).
It is compose of a two stage reverse proxy cache (3 x 2 varnish for 
stage 1), 2 varnish for stage 2, (so 8 in total) and a lot of different 
backends (php applications, nodejs apps, remote backends *sigh*, and 
even pipe one). This a big historical spaghetti app. We plan to rebuild 
it from scratch in 2018.
The first stage varnish are separate in two pool handling different 
topology of clients.

A lot of the logic is in varnish/vcl itself, lot of url rewrite, lot of 
manipulation of headers, choice of a backend, and even ESI processing...
The VCL of the stage 1 varnish are almost 3000 lines long.

But for now we have to leave/deal with it.

History of the problem :

At the beginning all varnish are in 2.x version. Things works almost well.
This summer we need to upgrade the varnish version to handle very long 
header (a product requirement).
So after a short battle porting our vcl to vcl4.0 we start using varnish 4.
Shortly after thing begun to goes very bad.

The first issue we hit, is a memory exhaustion on both stage, and 
oom-killer...
We test a lot of things, and in the battle we upgrade to varnish5.
We fix it, resizing the pool, and using now file backend (from memory 
before).
Memory is now stable (we have large pool, 32G, and strange thing, we 
never have object being nuke, which it good or bad it depend).
We have also fix a lot of things in our vcl.

The problem we fight against now is only on the stage1 varnish, and 
specifically on one pool (the busiest one).
When everything goes well the average cpu usage is 30%, memory stabilize 
around 12G, hit cache is around 0.85.
Problem happen randomly (not everyday) but during our peaks. The cpu 
increase fasly to reach 350% (4 core) and load > 3/
When the problem is here varnish still deliver requests (we didn't see 
dropped or reject connections) but our application begin to lost user, 
including a big lot of business. I suspect this is because timeout are 
very aggressive on the client side and varnish should answer slowly

-first question : how see response time of request of the varnish server 
?. (varnishnsca something ?)

I also suspect some kind of request queuing, also stracing varnish when 
it happen show a lot of futex wait ?!.
The frustrating part is restarting varnish fix the problem immediately, 
and the cpu remains normal after, even if the trafic peak is not finish.
So there is clearly something stacked in varnish which cause our problem.

-second question : how to see number of stacked connections, long 
connections and so on ?

At this stage we accept all kind of help / hints for debuging (and 
regarding the business impact we can evaluate the help of a professional 
support)

PS : I always have the option to scale out, popping a lot of new varnish 
instance, but this seems very frustrating...

Best,

--
Raphael Mazelier