Varnish performance musings

Wed Apr 6 14:46:56 CEST 2016

Hi,

I'm all for the LUT idea and I think it makes even more sense when you
consider H/2.

Having the same timestamp during a particular method also makes sense.
I believe we only need to change now calls to use ctx->now instead of
getting the time on each call but haven't looked into detail.

Being a bit smarter wrt some regular expressions might also be a good idea,
specially for cases like the one you mentioned.
FWIW, we have re-enabled JIT in recent Varnish versions for PCRE >= 8.32.
Perhaps it's worth checking if this provides any benefit before we consider
optimizing the custom patterns.

On Mon, Apr 4, 2016 at 8:21 PM, Devon H. O'Dell <dho at fastly.com> wrote:

> Hi all,
>
> Probably best to bring this discussion out of IRC at this point. I've
> looked into Varnish performance (from CPU consumption perspective) on
> our installations for nearly 3 years, and wanted to share some
> findings (as well as some thoughts on how to address these things).
> Unfortunately no TL;DR, but maybe it's good to separate these topics.
>
> It's going to be a bit before I'm going to have time to do a Varnish 4
> test to get numbers -- our fork is based on 2.1.4 and I things do not
> line up 1:1. Furthermore, getting noise-free numbers for just Varnish
> is difficult for a number of reasons, but effectively I cannot get a
> good sample of an individual process due in part to how perf performs
> and aggregates its samples, and in part to how many threads we run.
> But here are some of the things I've seen as problematic over the last
> few years, and ideas for fixing them.
>
> ## Header Scanning
>
> The number one CPU hog in Varnish right now outside of PCRE is header
> processing. Any header lookup or set is a linear scan, and any
> operation on a non-existent header has worst-case time. Any headers
> added with VCL become worst-case. Our plan for this turns header
> access / modification into O(1):
>
>  a. All headers are accessed from offsetting inside a LUT.
>  b. All headers in http_headers.h have a guaranteed slot in the LUT.
>  c. VCL-accessed headers are managed at vcc-time and receive slots in
> the LUT at initialization-time.
>  d. VMODs have an interface such that headers they access outside of
> headers defined in http_headers.h or VCL can be registered at
> initialization time.
>
> This also provides a means for accessing multiple same-named headers.
> We would introduce some VCL syntax to be able to access a specific
> header (e.g. beresp.http.Set-Cookie[2]), to get the number of
> occurrences of a header of a particular name. An interface would also
> exist to be able to apply some function to all headers (possibly also
> to all headers matching a specific name). The latter of these is
> something we have already -- we've added a "collection" type to VCL,
> as well as some functions to apply a function (called a "callback") to
> members of the collection. Callbacks operate in a context in which
> they are provided a key and value; they are not closures. This has
> workspace overhead that I'm not entirely happy with yet, so we have
> not made it a generally accessible thing yet.
>
> In the case of multiple headers with the same name, the storage format
> would still be a LUT, but the next member for a header would appear
> "chained" in some later part of the headers, any offset is defined as
> a 3-tuple of (p, q, next). When next is NULL, only one header of a
> particular instance appears. Since Varnish bounds the number of
> headers that can be handled in a request, this table doesn't have to
> be very large and can probably be bounded to (n_known_headers +
> max_headers) entries.
>
> ## Time Storage as double-precision FP and printf(3)-family
>
> Varnish uses double-precision FP for storing timestamps. The rationale
> for this is reasonable: a native type exists that can support
> fractional seconds. Arithmetic between two timestamps can be easily
> applied in code without relying on APIs that make said arithmetic
> difficult to read. This is a good argument for having times stored in
> a native format. Unfortunately, there are a few downsides: FP
> operations are typically slow, error-prone at a hardware / compiler
> level (https://github.com/varnishcache/varnish-cache/issues/1875 as a
> recent example), and stringifying floating point numbers correctly is
> really hard.
>
> I have just done a measurement of our production Varnish and TIM_foo
> functions no longer appear as significant CPU users. I believe this is
> because of a change I made a year or two ago in our fork that
> snapshots timestamps before calling into VCL. All VRT functions
> operate on the same timestamps, and therefore all VCL callbacks appear
> to occur at the same time. (This has numerous beneficial properties
> and only a few negative ones). Each VCL function gets its own
> snapshot.
>
> However, printf(3)-family functions are still super-heavy CPU
> consumers, accounting ~6.5% of total CPU time in our Varnish. A third
> of this time is spent in `__printf_fp`, which is the glibc function
> that handles representation of floating-point values. The *only* thing
> Varnish really uses FP for is doubles; it's logical to assume without
> doing a full audit is that something like 20% of printf(3)-family time
> is spent converting double-precision numbers to strings and the
> majority of the remaining time is format string parsing. From this
> perspective, it is still worth it to analyze the performance of
> VTIM-family functions to get an idea of their overhead:
>
>  1. TIM_real in our tree showed up in top functions of a synthetic,
> all-hit workload. Annotating the function shows where samples saw most
> of the time spent. In this sample, we can see that nearly 2/3ds of the
> time is spent in setting up the call stack and 1/3 of the time is
> doing FP-ops.
>
>       │    000000000000aec0 <TIM_real>:
>  36.20 │      sub    $0x18,%rsp
>  26.70 │      xor    %edi,%edi
>        │      mov    %rsp,%rsi
> ...
>   1.81 │40:   cvtsi2 0x8(%rsp),%xmm1
>  10.86 │      mulsd  0x8511(%rip),%xmm1        # 13420 <__func__.5739+0x13>
>  19.00 │      cvtsi2 (%rsp),%xmm0
>
>  2. Other time-related functions have FP-dominating components.
> TIM_format, for example, is dominated by nearly 2/3rds of its time on
> a single FP instruction (this is probably partially due to pipeline
> stalling).
>
>        │    000000000000ae40 <TIM_format>:
>  62.16 │      cvttsd %xmm0,%rax
>
> Inlining these functions would be beneficial in the TIM_real (sorry, I
> am still operating in V2 terminology, but I believe all of this still
> applies to V4) sense, but moving away from double as a time storage
> format would be beneficial in general. This would be done by using
> 64-bit counters that represent the number of nanoseconds since the
> epoch. We will run out of those in something like 540 years, and I'm
> happy to make that someone else's problem :).
>
>  a. It reduces significant portion of overhead in VTIM-family functions
>  b. It reduces significant portion of overhead in printf
>  c. It maintains composability of time arithmetic
>
> The major downside is that timestamp printing now needs additional
> work to print fractional time components.
>
> Finally, this gets a little into printf(3)-family inefficiencies as
> well. Because it parses format strings every time, we've optimized a
> number of places where we were using sprintf(3)-like interfaces to
> simply use string buffers. There is VSB of course, but we also use
> https://github.com/dhobsd/vstring (partially for FP stuff, partially
> for allowing static-backed buffers to upgrade to dynamic ones if
> necessary). The code overhead of string building is unfortunate, but
> at 6.5% overhead to use printf(3), this is a real win. (Some of the
> unlabeled blocks are things like _IO_default_xsputn, so the overhead
> of printf(3) here is likely still higher than 6.5%). See
> https://9vx.org/images/fg.png -- this was taken on a machine that is
> handling nearly 12k RPS on top of ~3-4k threads. By moving to integer
> times, conversion and printing would likely reduce the overhead of
> printf(3) by 20% without actually changing consumption of printf.
>
> I am unclear how this applies to Varnish 4, but I think relatively
> little is changed in this context between the versions.
>
> ## PCRE
>
> There are other things we've done (like optimizing regexes that are
> obviously prefix and suffix matches -- turns out lots of people write
> things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if
> (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't
> see those as being as high priority for upstream; they're largely
> issues for our multi-tenant use case. We have done this already;
> another thing we would like to do is to check regexes for things like
> backtracking and use DFA-based matching where possible. In the flame
> graph screenshot, the obvious VRT functions are PCRE.
>
> ## Expected Impact
>
> The expected impact of fixing these things is almost purely in
> latency. For this machine handling 12k RPS, that is the constant
> throughput bound, but we are bursting up to nearly 4k threads to serve
> the load. If header processing, PCRE, and printf were reduced to 50%
> of their current overhead, we'd expect to be able to handle the same
> load with something like 350 fewer threads, which is a real win for
> us. Note than even our 99%ile latency is largely covered by cache
> hits, so these effects would improve service for the vast majority of
> requests.
>
> Anyway, those are some thoughts. Looking forward to comments, though
> maybe there's a better venue for that than this ML?
>
> --dho
>
> _______________________________________________
> varnish-dev mailing list
> varnish-dev at varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-dev/attachments/20160406/670cd33f/attachment-0001.html>