<div dir="auto">Hi,<div dir="auto"><br></div><div dir="auto">Let's look at the usual suspects first, can we get the output of "ps aux |grep varnish" and a pastebin of "varnishncsa -1"?</div><div dir="auto"><br></div><div dir="auto">Are you using any vmod?</div><div dir="auto"><br></div><div dir="auto">man varnishncsa will help craft a format line with the response time (on mobile now, I don't have access to it)</div><div dir="auto"><br></div><div dir="auto">Cheers,<br><br><div data-smartmail="gmail_signature" dir="auto">-- <br>Guillaume Quintard </div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Nov 14, 2017 23:25, "Raphael Mazelier" <<a href="mailto:raph@futomaki.net">raph@futomaki.net</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello list,<br>
<br>
First of all despite my mail subject I really appreciate varnish.<br>
We use it a lot at work (hundred of instances) with success and unfortunately some pain these time.<br>
<br>
TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of our infrastructure brought us some serious trouble and instability on this platform.<br>
And we are a bit desperate/frustrated<br>
<br>
<br>
Long story.<br>
<br>
A bit of context :<br>
<br>
This a very complex platform serving an IPTV service with some traffic. (8k req/s in peak, even more when it work well).<br>
It is compose of a two stage reverse proxy cache (3 x 2 varnish for stage 1), 2 varnish for stage 2, (so 8 in total) and a lot of different backends (php applications, nodejs apps, remote backends *sigh*, and even pipe one). This a big historical spaghetti app. We plan to rebuild it from scratch in 2018.<br>
The first stage varnish are separate in two pool handling different topology of clients.<br>
<br>
A lot of the logic is in varnish/vcl itself, lot of url rewrite, lot of manipulation of headers, choice of a backend, and even ESI processing...<br>
The VCL of the stage 1 varnish are almost 3000 lines long.<br>
<br>
But for now we have to leave/deal with it.<br>
<br>
History of the problem :<br>
<br>
At the beginning all varnish are in 2.x version. Things works almost well.<br>
This summer we need to upgrade the varnish version to handle very long header (a product requirement).<br>
So after a short battle porting our vcl to vcl4.0 we start using varnish 4.<br>
Shortly after thing begun to goes very bad.<br>
<br>
The first issue we hit, is a memory exhaustion on both stage, and oom-killer...<br>
We test a lot of things, and in the battle we upgrade to varnish5.<br>
We fix it, resizing the pool, and using now file backend (from memory before).<br>
Memory is now stable (we have large pool, 32G, and strange thing, we never have object being nuke, which it good or bad it depend).<br>
We have also fix a lot of things in our vcl.<br>
<br>
The problem we fight against now is only on the stage1 varnish, and specifically on one pool (the busiest one).<br>
When everything goes well the average cpu usage is 30%, memory stabilize around 12G, hit cache is around 0.85.<br>
Problem happen randomly (not everyday) but during our peaks. The cpu increase fasly to reach 350% (4 core) and load > 3/<br>
When the problem is here varnish still deliver requests (we didn't see dropped or reject connections) but our application begin to lost user, including a big lot of business. I suspect this is because timeout are very aggressive on the client side and varnish should answer slowly<br>
<br>
-first question : how see response time of request of the varnish server ?. (varnishnsca something ?)<br>
<br>
I also suspect some kind of request queuing, also stracing varnish when it happen show a lot of futex wait ?!.<br>
The frustrating part is restarting varnish fix the problem immediately, and the cpu remains normal after, even if the trafic peak is not finish.<br>
So there is clearly something stacked in varnish which cause our problem.<br>
<br>
-second question : how to see number of stacked connections, long connections and so on ?<br>
<br>
At this stage we accept all kind of help / hints for debuging (and regarding the business impact we can evaluate the help of a professional support)<br>
<br>
PS : I always have the option to scale out, popping a lot of new varnish instance, but this seems very frustrating...<br>
<br>
Best,<br>
<br>
--<br>
Raphael Mazelier<br>
<br>
<br>
______________________________<wbr>_________________<br>
varnish-misc mailing list<br>
<a href="mailto:varnish-misc@varnish-cache.org" target="_blank">varnish-misc@varnish-cache.org</a><br>
<a href="https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc" rel="noreferrer" target="_blank">https://www.varnish-cache.org/<wbr>lists/mailman/listinfo/varnish<wbr>-misc</a><br>
</blockquote></div></div>