Varnish nighmare after upgrading : need help
Raphael Mazelier
raph at futomaki.net
Wed Nov 15 22:52:50 UTC 2017
Hi,
Of course the evening was quite quiet and I have no spurious output to
show. (schrodinger effect)
Anyway here the pastebin of the busiest period this night
https://pastebin.com/536LM9Nx.
We use std, and director vmod.
Btw : I found the correct format for varnishncsa (varnishncsa -F '%h %r
%s %{Varnish:handling}x %{Varnish:side}x %T %D' does the job).
Side question : why not include hit/miss in the default output ?
Thks for the help.
Best,
--
Raphael Mazelier
On 14/11/2017 23:41, Guillaume Quintard wrote:
> Hi,
>
> Let's look at the usual suspects first, can we get the output of "ps
> aux |grep varnish" and a pastebin of "varnishncsa -1"?
>
> Are you using any vmod?
>
> man varnishncsa will help craft a format line with the response time
> (on mobile now, I don't have access to it)
>
> Cheers,
>
> --
> Guillaume Quintard
>
> On Nov 14, 2017 23:25, "Raphael Mazelier" <raph at futomaki.net
> <mailto:raph at futomaki.net>> wrote:
>
> Hello list,
>
> First of all despite my mail subject I really appreciate varnish.
> We use it a lot at work (hundred of instances) with success and
> unfortunately some pain these time.
>
> TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of our
> infrastructure brought us some serious trouble and instability on
> this platform.
> And we are a bit desperate/frustrated
>
>
> Long story.
>
> A bit of context :
>
> This a very complex platform serving an IPTV service with some
> traffic. (8k req/s in peak, even more when it work well).
> It is compose of a two stage reverse proxy cache (3 x 2 varnish
> for stage 1), 2 varnish for stage 2, (so 8 in total) and a lot of
> different backends (php applications, nodejs apps, remote backends
> *sigh*, and even pipe one). This a big historical spaghetti app.
> We plan to rebuild it from scratch in 2018.
> The first stage varnish are separate in two pool handling
> different topology of clients.
>
> A lot of the logic is in varnish/vcl itself, lot of url rewrite,
> lot of manipulation of headers, choice of a backend, and even ESI
> processing...
> The VCL of the stage 1 varnish are almost 3000 lines long.
>
> But for now we have to leave/deal with it.
>
> History of the problem :
>
> At the beginning all varnish are in 2.x version. Things works
> almost well.
> This summer we need to upgrade the varnish version to handle very
> long header (a product requirement).
> So after a short battle porting our vcl to vcl4.0 we start using
> varnish 4.
> Shortly after thing begun to goes very bad.
>
> The first issue we hit, is a memory exhaustion on both stage, and
> oom-killer...
> We test a lot of things, and in the battle we upgrade to varnish5.
> We fix it, resizing the pool, and using now file backend (from
> memory before).
> Memory is now stable (we have large pool, 32G, and strange thing,
> we never have object being nuke, which it good or bad it depend).
> We have also fix a lot of things in our vcl.
>
> The problem we fight against now is only on the stage1 varnish,
> and specifically on one pool (the busiest one).
> When everything goes well the average cpu usage is 30%, memory
> stabilize around 12G, hit cache is around 0.85.
> Problem happen randomly (not everyday) but during our peaks. The
> cpu increase fasly to reach 350% (4 core) and load > 3/
> When the problem is here varnish still deliver requests (we didn't
> see dropped or reject connections) but our application begin to
> lost user, including a big lot of business. I suspect this is
> because timeout are very aggressive on the client side and varnish
> should answer slowly
>
> -first question : how see response time of request of the varnish
> server ?. (varnishnsca something ?)
>
> I also suspect some kind of request queuing, also stracing varnish
> when it happen show a lot of futex wait ?!.
> The frustrating part is restarting varnish fix the problem
> immediately, and the cpu remains normal after, even if the trafic
> peak is not finish.
> So there is clearly something stacked in varnish which cause our
> problem.
>
> -second question : how to see number of stacked connections, long
> connections and so on ?
>
> At this stage we accept all kind of help / hints for debuging (and
> regarding the business impact we can evaluate the help of a
> professional support)
>
> PS : I always have the option to scale out, popping a lot of new
> varnish instance, but this seems very frustrating...
>
> Best,
>
> --
> Raphael Mazelier
>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc at varnish-cache.org <mailto:varnish-misc at varnish-cache.org>
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
> <https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20171115/4aeaa92e/attachment-0001.html>
More information about the varnish-misc
mailing list