Varnish performance musings

Guillaume Quintard guillaume at varnish-software.com
Wed Apr 6 17:03:57 CEST 2016


Thanks for all the info, all of you.

I'm wondering, H/2 mandates lower-case for headers, allowing us to get rid
of strncmp, and in addition, the name length is known, so checking for
equality boils down to a memcmp. Would that affect the results you are
seing, Devon? I'm genuinely curious about the memory/speed balance of both
the LUT and the simple list we have, specially for "small" (<20?) sets of
headers.

Also, H/2 makes it an error to send two same-named headers, except for
cookies (exceptions are fun!), so I'm wondering if the section about
duplicate headers is worth it (again, I'm not judging, really asking).

-- 
Guillaume Quintard

On Wed, Apr 6, 2016 at 2:46 PM, Federico Schwindt <fgsch at lodoss.net> wrote:

> Hi,
>
> I'm all for the LUT idea and I think it makes even more sense when you
> consider H/2.
>
> Having the same timestamp during a particular method also makes sense.
> I believe we only need to change now calls to use ctx->now instead of
> getting the time on each call but haven't looked into detail.
>
> Being a bit smarter wrt some regular expressions might also be a good
> idea, specially for cases like the one you mentioned.
> FWIW, we have re-enabled JIT in recent Varnish versions for PCRE >= 8.32.
> Perhaps it's worth checking if this provides any benefit before we consider
> optimizing the custom patterns.
>
> On Mon, Apr 4, 2016 at 8:21 PM, Devon H. O'Dell <dho at fastly.com> wrote:
>
>> Hi all,
>>
>> Probably best to bring this discussion out of IRC at this point. I've
>> looked into Varnish performance (from CPU consumption perspective) on
>> our installations for nearly 3 years, and wanted to share some
>> findings (as well as some thoughts on how to address these things).
>> Unfortunately no TL;DR, but maybe it's good to separate these topics.
>>
>> It's going to be a bit before I'm going to have time to do a Varnish 4
>> test to get numbers -- our fork is based on 2.1.4 and I things do not
>> line up 1:1. Furthermore, getting noise-free numbers for just Varnish
>> is difficult for a number of reasons, but effectively I cannot get a
>> good sample of an individual process due in part to how perf performs
>> and aggregates its samples, and in part to how many threads we run.
>> But here are some of the things I've seen as problematic over the last
>> few years, and ideas for fixing them.
>>
>> ## Header Scanning
>>
>> The number one CPU hog in Varnish right now outside of PCRE is header
>> processing. Any header lookup or set is a linear scan, and any
>> operation on a non-existent header has worst-case time. Any headers
>> added with VCL become worst-case. Our plan for this turns header
>> access / modification into O(1):
>>
>>  a. All headers are accessed from offsetting inside a LUT.
>>  b. All headers in http_headers.h have a guaranteed slot in the LUT.
>>  c. VCL-accessed headers are managed at vcc-time and receive slots in
>> the LUT at initialization-time.
>>  d. VMODs have an interface such that headers they access outside of
>> headers defined in http_headers.h or VCL can be registered at
>> initialization time.
>>
>> This also provides a means for accessing multiple same-named headers.
>> We would introduce some VCL syntax to be able to access a specific
>> header (e.g. beresp.http.Set-Cookie[2]), to get the number of
>> occurrences of a header of a particular name. An interface would also
>> exist to be able to apply some function to all headers (possibly also
>> to all headers matching a specific name). The latter of these is
>> something we have already -- we've added a "collection" type to VCL,
>> as well as some functions to apply a function (called a "callback") to
>> members of the collection. Callbacks operate in a context in which
>> they are provided a key and value; they are not closures. This has
>> workspace overhead that I'm not entirely happy with yet, so we have
>> not made it a generally accessible thing yet.
>>
>> In the case of multiple headers with the same name, the storage format
>> would still be a LUT, but the next member for a header would appear
>> "chained" in some later part of the headers, any offset is defined as
>> a 3-tuple of (p, q, next). When next is NULL, only one header of a
>> particular instance appears. Since Varnish bounds the number of
>> headers that can be handled in a request, this table doesn't have to
>> be very large and can probably be bounded to (n_known_headers +
>> max_headers) entries.
>>
>> ## Time Storage as double-precision FP and printf(3)-family
>>
>> Varnish uses double-precision FP for storing timestamps. The rationale
>> for this is reasonable: a native type exists that can support
>> fractional seconds. Arithmetic between two timestamps can be easily
>> applied in code without relying on APIs that make said arithmetic
>> difficult to read. This is a good argument for having times stored in
>> a native format. Unfortunately, there are a few downsides: FP
>> operations are typically slow, error-prone at a hardware / compiler
>> level (https://github.com/varnishcache/varnish-cache/issues/1875 as a
>> recent example), and stringifying floating point numbers correctly is
>> really hard.
>>
>> I have just done a measurement of our production Varnish and TIM_foo
>> functions no longer appear as significant CPU users. I believe this is
>> because of a change I made a year or two ago in our fork that
>> snapshots timestamps before calling into VCL. All VRT functions
>> operate on the same timestamps, and therefore all VCL callbacks appear
>> to occur at the same time. (This has numerous beneficial properties
>> and only a few negative ones). Each VCL function gets its own
>> snapshot.
>>
>> However, printf(3)-family functions are still super-heavy CPU
>> consumers, accounting ~6.5% of total CPU time in our Varnish. A third
>> of this time is spent in `__printf_fp`, which is the glibc function
>> that handles representation of floating-point values. The *only* thing
>> Varnish really uses FP for is doubles; it's logical to assume without
>> doing a full audit is that something like 20% of printf(3)-family time
>> is spent converting double-precision numbers to strings and the
>> majority of the remaining time is format string parsing. From this
>> perspective, it is still worth it to analyze the performance of
>> VTIM-family functions to get an idea of their overhead:
>>
>>  1. TIM_real in our tree showed up in top functions of a synthetic,
>> all-hit workload. Annotating the function shows where samples saw most
>> of the time spent. In this sample, we can see that nearly 2/3ds of the
>> time is spent in setting up the call stack and 1/3 of the time is
>> doing FP-ops.
>>
>>       │    000000000000aec0 <TIM_real>:
>>  36.20 │      sub    $0x18,%rsp
>>  26.70 │      xor    %edi,%edi
>>        │      mov    %rsp,%rsi
>> ...
>>   1.81 │40:   cvtsi2 0x8(%rsp),%xmm1
>>  10.86 │      mulsd  0x8511(%rip),%xmm1        # 13420
>> <__func__.5739+0x13>
>>  19.00 │      cvtsi2 (%rsp),%xmm0
>>
>>  2. Other time-related functions have FP-dominating components.
>> TIM_format, for example, is dominated by nearly 2/3rds of its time on
>> a single FP instruction (this is probably partially due to pipeline
>> stalling).
>>
>>        │    000000000000ae40 <TIM_format>:
>>  62.16 │      cvttsd %xmm0,%rax
>>
>> Inlining these functions would be beneficial in the TIM_real (sorry, I
>> am still operating in V2 terminology, but I believe all of this still
>> applies to V4) sense, but moving away from double as a time storage
>> format would be beneficial in general. This would be done by using
>> 64-bit counters that represent the number of nanoseconds since the
>> epoch. We will run out of those in something like 540 years, and I'm
>> happy to make that someone else's problem :).
>>
>>  a. It reduces significant portion of overhead in VTIM-family functions
>>  b. It reduces significant portion of overhead in printf
>>  c. It maintains composability of time arithmetic
>>
>> The major downside is that timestamp printing now needs additional
>> work to print fractional time components.
>>
>> Finally, this gets a little into printf(3)-family inefficiencies as
>> well. Because it parses format strings every time, we've optimized a
>> number of places where we were using sprintf(3)-like interfaces to
>> simply use string buffers. There is VSB of course, but we also use
>> https://github.com/dhobsd/vstring (partially for FP stuff, partially
>> for allowing static-backed buffers to upgrade to dynamic ones if
>> necessary). The code overhead of string building is unfortunate, but
>> at 6.5% overhead to use printf(3), this is a real win. (Some of the
>> unlabeled blocks are things like _IO_default_xsputn, so the overhead
>> of printf(3) here is likely still higher than 6.5%). See
>> https://9vx.org/images/fg.png -- this was taken on a machine that is
>> handling nearly 12k RPS on top of ~3-4k threads. By moving to integer
>> times, conversion and printing would likely reduce the overhead of
>> printf(3) by 20% without actually changing consumption of printf.
>>
>> I am unclear how this applies to Varnish 4, but I think relatively
>> little is changed in this context between the versions.
>>
>> ## PCRE
>>
>> There are other things we've done (like optimizing regexes that are
>> obviously prefix and suffix matches -- turns out lots of people write
>> things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if
>> (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't
>> see those as being as high priority for upstream; they're largely
>> issues for our multi-tenant use case. We have done this already;
>> another thing we would like to do is to check regexes for things like
>> backtracking and use DFA-based matching where possible. In the flame
>> graph screenshot, the obvious VRT functions are PCRE.
>>
>> ## Expected Impact
>>
>> The expected impact of fixing these things is almost purely in
>> latency. For this machine handling 12k RPS, that is the constant
>> throughput bound, but we are bursting up to nearly 4k threads to serve
>> the load. If header processing, PCRE, and printf were reduced to 50%
>> of their current overhead, we'd expect to be able to handle the same
>> load with something like 350 fewer threads, which is a real win for
>> us. Note than even our 99%ile latency is largely covered by cache
>> hits, so these effects would improve service for the vast majority of
>> requests.
>>
>> Anyway, those are some thoughts. Looking forward to comments, though
>> maybe there's a better venue for that than this ML?
>>
>> --dho
>>
>> _______________________________________________
>> varnish-dev mailing list
>> varnish-dev at varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev
>
>
>
> _______________________________________________
> varnish-dev mailing list
> varnish-dev at varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-dev/attachments/20160406/acc32709/attachment-0001.html>


More information about the varnish-dev mailing list