Varnish performance musings

Devon H. O'Dell dho at fastly.com
Mon Apr 4 21:21:39 CEST 2016


Hi all,

Probably best to bring this discussion out of IRC at this point. I've
looked into Varnish performance (from CPU consumption perspective) on
our installations for nearly 3 years, and wanted to share some
findings (as well as some thoughts on how to address these things).
Unfortunately no TL;DR, but maybe it's good to separate these topics.

It's going to be a bit before I'm going to have time to do a Varnish 4
test to get numbers -- our fork is based on 2.1.4 and I things do not
line up 1:1. Furthermore, getting noise-free numbers for just Varnish
is difficult for a number of reasons, but effectively I cannot get a
good sample of an individual process due in part to how perf performs
and aggregates its samples, and in part to how many threads we run.
But here are some of the things I've seen as problematic over the last
few years, and ideas for fixing them.

## Header Scanning

The number one CPU hog in Varnish right now outside of PCRE is header
processing. Any header lookup or set is a linear scan, and any
operation on a non-existent header has worst-case time. Any headers
added with VCL become worst-case. Our plan for this turns header
access / modification into O(1):

 a. All headers are accessed from offsetting inside a LUT.
 b. All headers in http_headers.h have a guaranteed slot in the LUT.
 c. VCL-accessed headers are managed at vcc-time and receive slots in
the LUT at initialization-time.
 d. VMODs have an interface such that headers they access outside of
headers defined in http_headers.h or VCL can be registered at
initialization time.

This also provides a means for accessing multiple same-named headers.
We would introduce some VCL syntax to be able to access a specific
header (e.g. beresp.http.Set-Cookie[2]), to get the number of
occurrences of a header of a particular name. An interface would also
exist to be able to apply some function to all headers (possibly also
to all headers matching a specific name). The latter of these is
something we have already -- we've added a "collection" type to VCL,
as well as some functions to apply a function (called a "callback") to
members of the collection. Callbacks operate in a context in which
they are provided a key and value; they are not closures. This has
workspace overhead that I'm not entirely happy with yet, so we have
not made it a generally accessible thing yet.

In the case of multiple headers with the same name, the storage format
would still be a LUT, but the next member for a header would appear
"chained" in some later part of the headers, any offset is defined as
a 3-tuple of (p, q, next). When next is NULL, only one header of a
particular instance appears. Since Varnish bounds the number of
headers that can be handled in a request, this table doesn't have to
be very large and can probably be bounded to (n_known_headers +
max_headers) entries.

## Time Storage as double-precision FP and printf(3)-family

Varnish uses double-precision FP for storing timestamps. The rationale
for this is reasonable: a native type exists that can support
fractional seconds. Arithmetic between two timestamps can be easily
applied in code without relying on APIs that make said arithmetic
difficult to read. This is a good argument for having times stored in
a native format. Unfortunately, there are a few downsides: FP
operations are typically slow, error-prone at a hardware / compiler
level (https://github.com/varnishcache/varnish-cache/issues/1875 as a
recent example), and stringifying floating point numbers correctly is
really hard.

I have just done a measurement of our production Varnish and TIM_foo
functions no longer appear as significant CPU users. I believe this is
because of a change I made a year or two ago in our fork that
snapshots timestamps before calling into VCL. All VRT functions
operate on the same timestamps, and therefore all VCL callbacks appear
to occur at the same time. (This has numerous beneficial properties
and only a few negative ones). Each VCL function gets its own
snapshot.

However, printf(3)-family functions are still super-heavy CPU
consumers, accounting ~6.5% of total CPU time in our Varnish. A third
of this time is spent in `__printf_fp`, which is the glibc function
that handles representation of floating-point values. The *only* thing
Varnish really uses FP for is doubles; it's logical to assume without
doing a full audit is that something like 20% of printf(3)-family time
is spent converting double-precision numbers to strings and the
majority of the remaining time is format string parsing. From this
perspective, it is still worth it to analyze the performance of
VTIM-family functions to get an idea of their overhead:

 1. TIM_real in our tree showed up in top functions of a synthetic,
all-hit workload. Annotating the function shows where samples saw most
of the time spent. In this sample, we can see that nearly 2/3ds of the
time is spent in setting up the call stack and 1/3 of the time is
doing FP-ops.

      │    000000000000aec0 <TIM_real>:
 36.20 │      sub    $0x18,%rsp
 26.70 │      xor    %edi,%edi
       │      mov    %rsp,%rsi
...
  1.81 │40:   cvtsi2 0x8(%rsp),%xmm1
 10.86 │      mulsd  0x8511(%rip),%xmm1        # 13420 <__func__.5739+0x13>
 19.00 │      cvtsi2 (%rsp),%xmm0

 2. Other time-related functions have FP-dominating components.
TIM_format, for example, is dominated by nearly 2/3rds of its time on
a single FP instruction (this is probably partially due to pipeline
stalling).

       │    000000000000ae40 <TIM_format>:
 62.16 │      cvttsd %xmm0,%rax

Inlining these functions would be beneficial in the TIM_real (sorry, I
am still operating in V2 terminology, but I believe all of this still
applies to V4) sense, but moving away from double as a time storage
format would be beneficial in general. This would be done by using
64-bit counters that represent the number of nanoseconds since the
epoch. We will run out of those in something like 540 years, and I'm
happy to make that someone else's problem :).

 a. It reduces significant portion of overhead in VTIM-family functions
 b. It reduces significant portion of overhead in printf
 c. It maintains composability of time arithmetic

The major downside is that timestamp printing now needs additional
work to print fractional time components.

Finally, this gets a little into printf(3)-family inefficiencies as
well. Because it parses format strings every time, we've optimized a
number of places where we were using sprintf(3)-like interfaces to
simply use string buffers. There is VSB of course, but we also use
https://github.com/dhobsd/vstring (partially for FP stuff, partially
for allowing static-backed buffers to upgrade to dynamic ones if
necessary). The code overhead of string building is unfortunate, but
at 6.5% overhead to use printf(3), this is a real win. (Some of the
unlabeled blocks are things like _IO_default_xsputn, so the overhead
of printf(3) here is likely still higher than 6.5%). See
https://9vx.org/images/fg.png -- this was taken on a machine that is
handling nearly 12k RPS on top of ~3-4k threads. By moving to integer
times, conversion and printing would likely reduce the overhead of
printf(3) by 20% without actually changing consumption of printf.

I am unclear how this applies to Varnish 4, but I think relatively
little is changed in this context between the versions.

## PCRE

There are other things we've done (like optimizing regexes that are
obviously prefix and suffix matches -- turns out lots of people write
things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if
(strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't
see those as being as high priority for upstream; they're largely
issues for our multi-tenant use case. We have done this already;
another thing we would like to do is to check regexes for things like
backtracking and use DFA-based matching where possible. In the flame
graph screenshot, the obvious VRT functions are PCRE.

## Expected Impact

The expected impact of fixing these things is almost purely in
latency. For this machine handling 12k RPS, that is the constant
throughput bound, but we are bursting up to nearly 4k threads to serve
the load. If header processing, PCRE, and printf were reduced to 50%
of their current overhead, we'd expect to be able to handle the same
load with something like 350 fewer threads, which is a real win for
us. Note than even our 99%ile latency is largely covered by cache
hits, so these effects would improve service for the vast majority of
requests.

Anyway, those are some thoughts. Looking forward to comments, though
maybe there's a better venue for that than this ML?

--dho



More information about the varnish-dev mailing list