Compared performance of Varnish Cache on x86_64 and aarch64

Martin Grigorov martin.grigorov at
Mon Aug 3 15:14:33 UTC 2020


Thank you all for the feedback!
After some debugging it appeared that it is a bug in wrk - most of the
requests' latencies were 0 in the raw reports.

I've looked for a better maintained HTTP load testing tool and I liked it provides (correctly looking)
statistics, can measure latencies while using constant rate, and last but
not least can produce plot charts!
I will update my article and let you know once I'm done!


On Fri, Jul 31, 2020 at 4:43 PM Pål Hermunn Johansen <
hermunn at> wrote:

> I am sorry for being so late to the game, but here it goes:
> ons. 29. jul. 2020 kl. 14:12 skrev Poul-Henning Kamp <phk at>:
> > Your measurement says that there is 2/3 chance that the latency
> > is between:
> >
> >         655.40µs - 798.70µs     = -143.30µs
> >
> > and
> >         655.40µs + 798.70µs     = 1454.10µs
> No, it does not. There is no claim anywhere that the numbers are
> following a normal distribution or an approximation of it. Of course,
> the calculations you do demonstrate that the data is far from normally
> distributed (as expected).
> > You cannot conclude _anything_ from those numbers.
> There are two numbers, the average and the standard deviation, and
> they are calculated from the data, but the truth is hidden deeper in
> the data. By looking at the particular numbers, I agree completely
> that it is wrong to conclude that one is better than the other. I am
> not saying that the statements in the article are false, just that you
> do not have data to draw the conclusions.
> Furthermore I have to say that Geoff got things right (see below). As
> a mathematician, I have to say that statistics is hard, and trusting
> the output of wrk to draw conclusions is outright the wrong thing to
> do.
> In this case we have a luxury which you typically do not have: Data is
> essentially free. You can run many tests and you can run short or long
> tests with different parameters. A 30 second test is simply not enough
> for anything.
> As Geoff indicated, for each transaction you can extract many relevant
> values from varnishlog, with the status, hit/miss, time to first byte
> and time to last byte being the most obvious ones. They can be
> extracted and saved to a csv file by using varnishncsa with a custom
> format string, and you can use R (used it myself as a tool in my
> previous job - not a fan) to do statistical analysis on the data. The
> Student T suggestion from Geoff is a good idea, but just looking at
> one set of numbers without considering other factors is mathematically
> problematic.
> Anyway, some obvious questions then arise. For example:
> - How do the numbers between wrk and varnishlog/varnishncsa compare?
> Did wrk report a total number of transactions than varnish? If there
> is a discrepancy, then the errors might be because of some resource
> restraint (number of sockets or dropped syn packages?).
> - How does the average and maximum compare between varnish and wrk?
> - What is the CPU usage of the kernel, the benchmarking tool and the
> varnish processes in the tests?
> - What is the difference between the time to first byte and the time
> to last byte in Varnish for different object sizes?
> When Varnish writes to a socket, it hands bytes over to the kernel,
> and when the write call returns, we do not know how far the bytes have
> come, and how long it will take before they get to the final
> destination. The bytes may be in a kernel buffer, they might be on the
> network card, and they might be already received at the client's
> kernel, and they might have made it all into wrk (which may or may not
> have timestamped the response). Typically, depending on many things,
> Varnish will report faster times than what wrk, but since returning
> from the write call means that the calling thread must be rescheduled,
> it is even possible that wrk will see that some requests are faster
> than what Varnish reports. Running wrk2 with different speeds in a
> series of tests seems natural to me, so that you can observe when (and
> how) the system starts running into bottlenecks. Note that the
> bottleneck can just as well be in wrk2 itself or on the combined CPU
> usage of kernel + Varnish + wrk2.
> To complicate things even further: On your ARM vs. x64 tests, my guess
> is that both kernel parameters and parameters for the network are
> different, and the distributions probably have good reason to choose
> different values. It is very likely that these differences affect the
> performance of the systems in many ways, and that different tests will
> have different "optimal" tunings of kernel and network parameters.
> Sorry for rambling, but getting the statistics wrong is so easy. The
> question is very interesting, but if you want to draw conclusions, you
> should do the analysis, and (ideally) give access to the raw data in
> case anyone wants to have a look.
> Best,
> Pål
> fre. 31. jul. 2020 kl. 08:45 skrev Geoff Simmons <geoff at>:
> >
> > On 7/28/20 13:52, Martin Grigorov wrote:
> > >
> > > I've just posted an article [1] about comparing the performance of
> Varnish
> > > Cache on two similar
> > > machines - the main difference is the CPU architecture - x86_64 vs
> aarch64.
> > > It uses a specific use case - the backend service just returns a static
> > > content. The idea is
> > > to compare Varnish on the different architectures but also to compare
> > > Varnish against the backend HTTP server.
> > > What is interesting is that Varnish gives the same throughput as the
> > > backend server on x86_64 but on aarch64 it is around 30% slower than
> the
> > > backend.
> >
> > Does your test have an account of whether there were any errors in
> > backend fetches? Don't know if that explains anything, but with a
> > connect timeout of 10s and first byte timeout of 5m, any error would
> > have a considerable effect on the results of a 30 second test.
> >
> > The test tool output doesn't say anything I can see about error rates --
> > whether all responses had status 200, and if not, how many had which
> > other status. Ideally it should be all 200, otherwise the results may
> > not be valid.
> >
> > I agree with phk that a statistical analysis is needed for a robust
> > statement about differences between the two platforms. For that, you'd
> > need more than the summary stats shown in your blog post -- you need to
> > collect all of the response times. What I usually do is query Varnish
> > client request logs for Timestamp:Resp and save the number in the last
> > column.
> >
> > t.test() in R runs Student's t-test (me R fanboi).
> >
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the varnish-dev mailing list