<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<tt>Hi,<br>
<br>
Of course the evening was quite quiet and I have no spurious
output to show. (schrodinger effect)<br>
<br>
Anyway here the pastebin of the busiest period this night
<a class="moz-txt-link-freetext" href="https://pastebin.com/536LM9Nx">https://pastebin.com/536LM9Nx</a>.<br>
<br>
We use std, and director vmod.<br>
<br>
Btw : I found the correct format for varnishncsa (varnishncsa -F
'%h %r %s %{Varnish:handling}x %{Varnish:side}x %T %D' does the
job).<br>
Side question : why not include hit/miss in the default output ?<br>
<br>
<br>
Thks for the help.<br>
<br>
Best,<br>
<br>
--<br>
Raphael Mazelier<br>
</tt><br>
<div class="moz-cite-prefix">On 14/11/2017 23:41, Guillaume Quintard
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAJ6ZYQyQU4zvtcb7m5cBBoWnop24HLMh=jF+gzNj1XZZHJmTHw@mail.gmail.com">
<div dir="auto">Hi,
<div dir="auto"><br>
</div>
<div dir="auto">Let's look at the usual suspects first, can we
get the output of "ps aux |grep varnish" and a pastebin of
"varnishncsa -1"?</div>
<div dir="auto"><br>
</div>
<div dir="auto">Are you using any vmod?</div>
<div dir="auto"><br>
</div>
<div dir="auto">man varnishncsa will help craft a format line
with the response time (on mobile now, I don't have access to
it)</div>
<div dir="auto"><br>
</div>
<div dir="auto">Cheers,<br>
<br>
<div data-smartmail="gmail_signature" dir="auto">-- <br>
Guillaume Quintard </div>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Nov 14, 2017 23:25, "Raphael
Mazelier" <<a href="mailto:raph@futomaki.net"
moz-do-not-send="true">raph@futomaki.net</a>> wrote:<br
type="attribution">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hello
list,<br>
<br>
First of all despite my mail subject I really appreciate
varnish.<br>
We use it a lot at work (hundred of instances) with success
and unfortunately some pain these time.<br>
<br>
TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of
our infrastructure brought us some serious trouble and
instability on this platform.<br>
And we are a bit desperate/frustrated<br>
<br>
<br>
Long story.<br>
<br>
A bit of context :<br>
<br>
This a very complex platform serving an IPTV service with
some traffic. (8k req/s in peak, even more when it work
well).<br>
It is compose of a two stage reverse proxy cache (3 x 2
varnish for stage 1), 2 varnish for stage 2, (so 8 in total)
and a lot of different backends (php applications, nodejs
apps, remote backends *sigh*, and even pipe one). This a big
historical spaghetti app. We plan to rebuild it from scratch
in 2018.<br>
The first stage varnish are separate in two pool handling
different topology of clients.<br>
<br>
A lot of the logic is in varnish/vcl itself, lot of url
rewrite, lot of manipulation of headers, choice of a
backend, and even ESI processing...<br>
The VCL of the stage 1 varnish are almost 3000 lines long.<br>
<br>
But for now we have to leave/deal with it.<br>
<br>
History of the problem :<br>
<br>
At the beginning all varnish are in 2.x version. Things
works almost well.<br>
This summer we need to upgrade the varnish version to handle
very long header (a product requirement).<br>
So after a short battle porting our vcl to vcl4.0 we start
using varnish 4.<br>
Shortly after thing begun to goes very bad.<br>
<br>
The first issue we hit, is a memory exhaustion on both
stage, and oom-killer...<br>
We test a lot of things, and in the battle we upgrade to
varnish5.<br>
We fix it, resizing the pool, and using now file backend
(from memory before).<br>
Memory is now stable (we have large pool, 32G, and strange
thing, we never have object being nuke, which it good or bad
it depend).<br>
We have also fix a lot of things in our vcl.<br>
<br>
The problem we fight against now is only on the stage1
varnish, and specifically on one pool (the busiest one).<br>
When everything goes well the average cpu usage is 30%,
memory stabilize around 12G, hit cache is around 0.85.<br>
Problem happen randomly (not everyday) but during our peaks.
The cpu increase fasly to reach 350% (4 core) and load >
3/<br>
When the problem is here varnish still deliver requests (we
didn't see dropped or reject connections) but our
application begin to lost user, including a big lot of
business. I suspect this is because timeout are very
aggressive on the client side and varnish should answer
slowly<br>
<br>
-first question : how see response time of request of the
varnish server ?. (varnishnsca something ?)<br>
<br>
I also suspect some kind of request queuing, also stracing
varnish when it happen show a lot of futex wait ?!.<br>
The frustrating part is restarting varnish fix the problem
immediately, and the cpu remains normal after, even if the
trafic peak is not finish.<br>
So there is clearly something stacked in varnish which cause
our problem.<br>
<br>
-second question : how to see number of stacked connections,
long connections and so on ?<br>
<br>
At this stage we accept all kind of help / hints for
debuging (and regarding the business impact we can evaluate
the help of a professional support)<br>
<br>
PS : I always have the option to scale out, popping a lot of
new varnish instance, but this seems very frustrating...<br>
<br>
Best,<br>
<br>
--<br>
Raphael Mazelier<br>
<br>
<br>
______________________________<wbr>_________________<br>
varnish-misc mailing list<br>
<a href="mailto:varnish-misc@varnish-cache.org"
target="_blank" moz-do-not-send="true">varnish-misc@varnish-cache.org</a><br>
<a
href="https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://www.varnish-cache.org/<wbr>lists/mailman/listinfo/varnish<wbr>-misc</a><br>
</blockquote>
</div>
</div>
</blockquote>
<br>
</body>
</html>