Varnish nighmare after upgrading : need help

Wed Nov 15 22:52:50 UTC 2017

Hi,

Of course the evening was quite quiet and I have no spurious output to 
show. (schrodinger effect)

Anyway here the pastebin of the busiest period this night 
https://pastebin.com/536LM9Nx.

We use std, and director vmod.

Btw : I found the correct format for varnishncsa (varnishncsa -F '%h %r 
%s %{Varnish:handling}x %{Varnish:side}x %T %D' does the job).
Side question : why not include hit/miss in the default output ?

Thks for the help.

Best,

--
Raphael Mazelier

On 14/11/2017 23:41, Guillaume Quintard wrote:
> Hi,
>
> Let's look at the usual suspects first, can we get the output of "ps 
> aux |grep varnish" and a pastebin of "varnishncsa -1"?
>
> Are you using any vmod?
>
> man varnishncsa will help craft a format line with the response time 
> (on mobile now, I don't have access to it)
>
> Cheers,
>
> -- 
> Guillaume Quintard
>
> On Nov 14, 2017 23:25, "Raphael Mazelier" <raph at futomaki.net 
> <mailto:raph at futomaki.net>> wrote:
>
>     Hello list,
>
>     First of all despite my mail subject I really appreciate varnish.
>     We use it a lot at work (hundred of instances) with success and
>     unfortunately some pain these time.
>
>     TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of our
>     infrastructure brought us some serious trouble and instability on
>     this platform.
>     And we are a bit desperate/frustrated
>
>
>     Long story.
>
>     A bit of context :
>
>     This a very complex platform serving an IPTV service with some
>     traffic. (8k req/s in peak, even more when it work well).
>     It is compose of a two stage reverse proxy cache (3 x 2 varnish
>     for stage 1), 2 varnish for stage 2, (so 8 in total) and a lot of
>     different backends (php applications, nodejs apps, remote backends
>     *sigh*, and even pipe one). This a big historical spaghetti app.
>     We plan to rebuild it from scratch in 2018.
>     The first stage varnish are separate in two pool handling
>     different topology of clients.
>
>     A lot of the logic is in varnish/vcl itself, lot of url rewrite,
>     lot of manipulation of headers, choice of a backend, and even ESI
>     processing...
>     The VCL of the stage 1 varnish are almost 3000 lines long.
>
>     But for now we have to leave/deal with it.
>
>     History of the problem :
>
>     At the beginning all varnish are in 2.x version. Things works
>     almost well.
>     This summer we need to upgrade the varnish version to handle very
>     long header (a product requirement).
>     So after a short battle porting our vcl to vcl4.0 we start using
>     varnish 4.
>     Shortly after thing begun to goes very bad.
>
>     The first issue we hit, is a memory exhaustion on both stage, and
>     oom-killer...
>     We test a lot of things, and in the battle we upgrade to varnish5.
>     We fix it, resizing the pool, and using now file backend (from
>     memory before).
>     Memory is now stable (we have large pool, 32G, and strange thing,
>     we never have object being nuke, which it good or bad it depend).
>     We have also fix a lot of things in our vcl.
>
>     The problem we fight against now is only on the stage1 varnish,
>     and specifically on one pool (the busiest one).
>     When everything goes well the average cpu usage is 30%, memory
>     stabilize around 12G, hit cache is around 0.85.
>     Problem happen randomly (not everyday) but during our peaks. The
>     cpu increase fasly to reach 350% (4 core) and load > 3/
>     When the problem is here varnish still deliver requests (we didn't
>     see dropped or reject connections) but our application begin to
>     lost user, including a big lot of business. I suspect this is
>     because timeout are very aggressive on the client side and varnish
>     should answer slowly
>
>     -first question : how see response time of request of the varnish
>     server ?. (varnishnsca something ?)
>
>     I also suspect some kind of request queuing, also stracing varnish
>     when it happen show a lot of futex wait ?!.
>     The frustrating part is restarting varnish fix the problem
>     immediately, and the cpu remains normal after, even if the trafic
>     peak is not finish.
>     So there is clearly something stacked in varnish which cause our
>     problem.
>
>     -second question : how to see number of stacked connections, long
>     connections and so on ?
>
>     At this stage we accept all kind of help / hints for debuging (and
>     regarding the business impact we can evaluate the help of a
>     professional support)
>
>     PS : I always have the option to scale out, popping a lot of new
>     varnish instance, but this seems very frustrating...
>
>     Best,
>
>     --
>     Raphael Mazelier
>
>
>     _______________________________________________
>     varnish-misc mailing list
>     varnish-misc at varnish-cache.org <mailto:varnish-misc at varnish-cache.org>
>     https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>     <https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20171115/4aeaa92e/attachment-0001.html>