Varnish nighmare after upgrading : need help

Thu Nov 16 08:09:31 UTC 2017

I think we just replicate the ncsa default format line

-- 
Guillaume Quintard

On Nov 15, 2017 23:52, "Raphael Mazelier" <raph at futomaki.net> wrote:

> Hi,
>
> Of course the evening was quite quiet and I have no spurious output to
> show. (schrodinger effect)
>
> Anyway here the pastebin of the busiest period this night
> https://pastebin.com/536LM9Nx.
>
> We use std, and director vmod.
>
> Btw : I found the correct format for varnishncsa (varnishncsa -F  '%h %r
> %s %{Varnish:handling}x %{Varnish:side}x %T %D' does the job).
> Side question : why not include hit/miss in the default output ?
>
>
> Thks for the help.
>
> Best,
>
> --
> Raphael Mazelier
>
> On 14/11/2017 23:41, Guillaume Quintard wrote:
>
> Hi,
>
> Let's look at the usual suspects first, can we get the output of "ps aux
> |grep varnish" and a pastebin of "varnishncsa -1"?
>
> Are you using any vmod?
>
> man varnishncsa will help craft a format line with the response time (on
> mobile now, I don't have access to it)
>
> Cheers,
>
> --
> Guillaume Quintard
>
> On Nov 14, 2017 23:25, "Raphael Mazelier" <raph at futomaki.net> wrote:
>
>> Hello list,
>>
>> First of all despite my mail subject I really appreciate varnish.
>> We use it a lot at work (hundred of instances) with success and
>> unfortunately some pain these time.
>>
>> TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of our
>> infrastructure brought us some serious trouble and instability on this
>> platform.
>> And we are a bit desperate/frustrated
>>
>>
>> Long story.
>>
>> A bit of context :
>>
>> This a very complex platform serving an IPTV service with some traffic.
>> (8k req/s in peak, even more when it work well).
>> It is compose of a two stage reverse proxy cache (3 x 2 varnish for stage
>> 1), 2 varnish for stage 2, (so 8 in total) and a lot of different backends
>> (php applications, nodejs apps, remote backends *sigh*, and even pipe one).
>> This a big historical spaghetti app. We plan to rebuild it from scratch in
>> 2018.
>> The first stage varnish are separate in two pool handling different
>> topology of clients.
>>
>> A lot of the logic is in varnish/vcl itself, lot of url rewrite, lot of
>> manipulation of headers, choice of a backend, and even ESI processing...
>> The VCL of the stage 1 varnish are almost 3000 lines long.
>>
>> But for now we have to leave/deal with it.
>>
>> History of the problem :
>>
>> At the beginning all varnish are in 2.x version. Things works almost well.
>> This summer we need to upgrade the varnish version to handle very long
>> header (a product requirement).
>> So after a short battle porting our vcl to vcl4.0 we start using varnish
>> 4.
>> Shortly after thing begun to goes very bad.
>>
>> The first issue we hit, is a memory exhaustion on both stage, and
>> oom-killer...
>> We test a lot of things, and in the battle we upgrade to varnish5.
>> We fix it, resizing the pool, and using now file backend (from memory
>> before).
>> Memory is now stable (we have large pool, 32G, and strange thing, we
>> never have object being nuke, which it good or bad it depend).
>> We have also fix a lot of things in our vcl.
>>
>> The problem we fight against now is only on the stage1 varnish, and
>> specifically on one pool (the busiest one).
>> When everything goes well the average cpu usage is 30%, memory stabilize
>> around 12G, hit cache is around 0.85.
>> Problem happen randomly (not everyday) but during our peaks. The cpu
>> increase fasly to reach 350% (4 core) and load > 3/
>> When the problem is here varnish still deliver requests (we didn't see
>> dropped or reject connections) but our application begin to lost user,
>> including a big lot of business. I suspect this is because timeout are very
>> aggressive on the client side and varnish should answer slowly
>>
>> -first question : how see response time of request of the varnish server
>> ?. (varnishnsca something ?)
>>
>> I also suspect some kind of request queuing, also stracing varnish when
>> it happen show a lot of futex wait ?!.
>> The frustrating part is restarting varnish fix the problem immediately,
>> and the cpu remains normal after, even if the trafic peak is not finish.
>> So there is clearly something stacked in varnish which cause our problem.
>>
>> -second question : how to see number of stacked connections, long
>> connections and so on ?
>>
>> At this stage we accept all kind of help / hints for debuging (and
>> regarding the business impact we can evaluate the help of a professional
>> support)
>>
>> PS : I always have the option to scale out, popping a lot of new varnish
>> instance, but this seems very frustrating...
>>
>> Best,
>>
>> --
>> Raphael Mazelier
>>
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc at varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20171116/7cb9736b/attachment.html>