Varnish performance musings

Devon H. O'Dell dho at fastly.com
Tue Apr 5 00:30:54 CEST 2016


On Mon, Apr 4, 2016 at 2:37 PM, Nils Goroll <slink at schokola.de> wrote:
> Hi Devon,
>
> thank you very much for the interesting writeup - despite the fact that I have
> so much unfinished Varnish work on my list already, I'd like to dump some
> thoughts in the hope that others may pick up on them:

Thanks for taking the time to reply!

>> All VRT functions operate on the same timestamps (for each VCL callback)
>
> This sounds perfectly reasonable, I think we should just do this.

I plan on sending a patch, but I need to learn more about VRT_CTX to
do it. As I mentioned previously, I'm still working on effectively a
2.1.4 codebase, and I'm not ultra-familiar with the architectural
changes, especially in the VCL/VRT area. If someone else wants to do
this quickly, that'd be fine by me. But I'd also be happy to learn a
bit more about current architecture and do it myself, if people are
happy to wait a little bit.

>>  1. TIM_real in our tree showed up in top functions of a synthetic,
>> all-hit workload.
>
> I once optimized a CLOCK_REALTIME bound app by caching real time and offsetting
> it with the TSC as long as the thread didn't migrate. This turned a syscall into
> a handful instructions for the fast path, but there's a portability question,
> and (unless one has constant_tsc) the inaccuracy due to speed stepping.

On x86, the TSC runs at a fixed speed regardless of C-state and
P-state, so while clock_gettime(3) might require constant_tsc to work
as expected,

static inline uint64_t rdtscp(void) {
    uint32_t eax, edx;
    __asm__ __volatile__("rdtscp" : "=a" (eax), "=d" (edx) :: "%ecx", "memory");
    return (((uint64_t)edx << 32) | eax);
}

will always provide a value increasing at a constant rate. (This may
not be true in a VM, and of course this doesn't help with portability,
and doesn't help on machines where you need to serialize rdtsc with
cpuid because no rdtscp instruction is presen. None of that is true
for us. One thing that is true is that C/P-state switches that change
clock speed can skew the result of the measurement, but there are ways
to solve this as well, sometimes just by ignoring it entirely since
the numbers are usually only good for eyeballing anyway.) When we need
performance measurements, we're usually using this `rdtscp` wrapper;
see for example https://github.com/fastly/librip for implementation
and https://9vx.org/post/ripping-rings/#ripping-rings:3dc2715b7d0d6dc78b3ff9173cd0b415
for write-up.

We also use this wrapper to plot time it takes to hold / wait locks
which we report, e.g.:

dho at cachenode:~$ varnishstat -1 | egrep lock.*expire
lock_cycles_held_expire 16595254263058  71858036.60 Cycles exp_mtx was held
lock_cycles_wait_expire 1821824709085   7888565.28 Cycles to acquire exp_mtx

These numbers get plotted along with other counters in our monitoring
services. I've found the resulting graphs extremely useful in
debugging production scenarios with livelock (both hold/wait time
spike tremendously) and deadlock (hold/wait time fall to 0) at a
glance. (The librip thing mentioned above is useful for finding out
exactly where deadlock happens, though that can be done semi-trivially
with defined lock hierarchies.) Hand-rolled TSC opens a ton of
possibilities when used judiciously.

>> 64-bit counters that represent the number of nanoseconds since the
>> epoch. We will run out of those in something like 540 years, and I'm
>> happy to make that someone else's problem :).
>
> Besides the fact that this should be "(far) enough (in the future) for
> everyone", I'm not even convinced that we need nanos in varnish. Couldn't we
> shave off some 3-8 bits or even use only micros?

Shaving off bits doesn't seem hugely useful to me without some other
information to pack in. A `double` is already 64 bits per the C
standard, so on C implementations `uint64_t`, we may as well use all
the bits.

There are some cases where we have timestamps that may be <1µs apart
depending on various error paths. Consider connect(2) returning EAGAIN
as a straw man. This failure mode might cause the back-end side of
things to error out in <1µs, and logging only microseconds isn't
hugely helpful in that case -- if the duration is 0, does that mean I
didn't run the code path, or it just executed faster than could be
measured with microseconds? This lack of information would have caused
us headaches in debugging various failure modes in the state machine
over the years.

>> ## PCRE
>>
>> There are other things we've done (like optimizing regexes that are
>> obviously prefix and suffix matches -- turns out lots of people write
>> things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if
>> (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't
>> see those as being as high priority for upstream; they're largely
>> issues for our multi-tenant use case.
>
> Years ago I learned about the fastly "Host header switch" problem and actually
> it has an interesting generalization: As we compile VCC we could as well compile
> pattern matchers also. I do this in
> https://code.uplex.de/uplex-varnish/dcs_classifier - and the results are pretty
> impressive.

The icache thrash from the domain lookup function is actually pretty
bad, and it's hard to maintain. We have an implementation of a minimal
perfect hash table that we were considering as a candidate, but it
doesn't work well for prefix or wildcard matching.

> I have no experience with it, but there's also re2c, which takes a generic
> approach to the problem.
>
> VCC could generate optimal matcher code for common yet simple expressions.
> Adding a glob pattern type could be an option.
>
> Ideally, I'd like the "Host header switch problem" solved by having a VCC
> statement which would compile something like this...
>
> select req.http.Host {
>         "www.foo.com": {
>                 call vcl_recv_foo;
>                 break;
>         }
>         "www.bar.*": {
>                 call vcl_recv_bar;
>                 break;
>         }
>         "*": {
>                 call ...
>         }
> }
>
> ...into classifier tree code.

Our "table" extension to VCL would effectively allow this, except it
does not support wildcards (we have a version that does, but it
requires N-1 lookups where N is the bottom level of the domain). I
have also done some work with qp tries
(http://dotat.at/prog/qp/README.html), and suspect that a
prefix-matching variant can be constructed. This is effectively how we
do matching on our surrogate keys, but using critbit. (I am looking
forward to replacing that with QP hopefully soon -- but then, I also
have a large backlog of Stuff To Do -- if only there were 48 hours in
a day!)

Currently all of our customer VCLs are compiled into separate DSOs, so
we do not have a single VCL that we indirect into based on
host/ip/service. I believe there's also a strong argument to be made
for the benefit of process isolation in multi-tenant environments, and
I'd like to encourage a solution to this problem that moves more
towards routing than centralization.

--dho

>
> Nils



More information about the varnish-dev mailing list