VIP23 (VSL refactoring) design sketch

Fri Apr 12 14:37:28 UTC 2019

On Fri, Apr 12, 2019 at 4:00 PM Geoff Simmons <geoff at uplex.de> wrote:
>
> So I said we shouldn't get to optimizing too soon, and here I am doing
> something like that. Because the discussion made me curious.
>
> In the attachment you'll find:
>
> - A first implementation of a function VSLwv() to write binary fields
> into a log record buffer, using a printf-like format string and a va_list.
>
> - Test calls for Timestamp, ReqAcct, ReqHeader and RespStatus (these are
> #if-0'd out here).
>
> - printf() calls to output the data from the binary log record (also
> #if-0'd out). This confirms that we can get the data back out as it went
> in, and experiments at how we could go about that.
>
> - A benchmark with ca. one gazillion runs of VSLwv() and the
> corresponding sprintf() calls, to compare the binary function with
> library printf code.
>
> - Results of running that through callgrind and perf record. To view the
> results, you'll need a tool like kcachegrind, and perf report.
>
> VSLwv() doesn't have bounds checking, to make sure that we don't write
> past vsl_reclen, which of course a real version must have. That will
> mean more conditional branching, but the branches will be predicted
> almost always.

Interesting, for textual logs the effect of vsl_reclen is
straightforward but how should we deal with binary records truncation?

> I grepped through current VSL*() calls, and these are more or less all
> of the format characters we currently use: %s, %.*s, %.ns (for a length
> n, as in %.20s), %d, %u, %x, %f, %j* and %z* (for (u)intmax_t and size_t
> types). There may be one or two more or less, but that's close to all of
> it. I added %hu for a response status as unsigned short (we currently
> have %u for that). (There are some %p's for tags like ExpKill, mostly
> only interesting for devs.)
>
> Note that for the binary writes, we don't have to care if the data type
> is unsigned, just copy the right number of bytes. We'll want the %u
> formats for type checking, but VSLwv() can do the same thing for %d and
> %u, and %zd and %zu, and so on. It's the output code that needs to get
> signedness right.

Should we though? If someone runs Varnish in a virtual machine on a
mainframe server and runs varnishlog -w don't we want to be able to
varnishlog -r that file on a "regular" workstation?

> This version always writes a length field as an unsigned short before a
> string (maybe it should be uint16_t), and string fields are not
> NUL-terminated.

Until the vsl_reclen question is sorted out, you can always copy the
string and grab its length while doing so, and once it's written and
you have the length you can write it at the field length's address.

I agree we should settle for uin16_t for that.

Having null-terminated strings open a zero-copy opportunity on the VSL
consumer side.

> So a final version would need the bounds checking, may handle one or two
> more or fewer format chars, and may handle strings differently. Other
> than that, I think this is about everything we'll need it to do.
>
> What did the performance tools say?
>
> Overall, VSLwv() did noticeably better than the library printf calls on
> the various measures: fewer CPU cycles, better memory cache hit rates
> (both instruction and data), and, generally, fewer branch mispredicts.
> Not surprising, because library printf has quite a bit more to do.
>
> As @Nils suspected, the weak spot is branch prediction. My gcc compiled
> the switch statement as a jump table (jmpq *rax), and kcachegrind
> reports 86% of indirect branch mispredicts there, far more than anything
> else -- glibc vfprintf comes in next with 13%.
>
> On the other hand, the library printf code has about twice as many
> mispredicts for conditional branches, resulting in more mispredicts for
> all types of branches.
>
> So: this first shot at coding something up has a sore spot, but overall
> appears to be measurably better than the printf stuff.
>
> Some other lessons learned:
>
> The binary records are not necessarily shorter than the ASCII-formatted
> records. Particularly ReqAcct, which we have as 6 uintmax_t fields. That
> comes to 48 bytes on my machine, about twice as long as a string like
> "110 0 110 210 224 434". I don't think we can do much about that.
> Request/responses can transfer up to gigabytes, which is why we need
> uintmax_t. Unless we want to try something like variable-length integer
> fields (which I doubt).

We could, but then that would force us to pre-compute variable-length
field sizes. (are there others than strings?)

> HTTP header records, with two variable-length strings, also won't be any
> shorter than the ASCII format. If they have two 2-byte length fields,
> they may come out as a couple of bytes longer (especially if they're
> also NUL-terminated).

I think we never discussed the prefix filtering. With binary logs it
should become explicit and a different kind of field that is
essentially a string for a specific purpose.

Since the presence of a prefix may be optional, we'd need to change
std.log() and its underlying VRT function to something like this:

$Module std
$Function VOID log([STRING prefix], STRING msg)

And inside, if VCL <= 4.1 and prefix is null, look for a colon
character and extract the prefix from the message if present.

Individual records in the shmlog would then need a bit flag to tell
whether the first field is a prefix or not.

> In the large, I think we'll be saving VSL space -- the binary Timestamp
> records, for example, end up about half as long as the ASCII-formatted
> version. But we won't be saving space in every record.

But we also expect to save a lot on the consumer side.

> As you may notice, some of this code is ... not pretty. With idioms like
> this:
>
>         *((int *)p) = va_arg(ap, int);
>
>         printf("%.*s: %f %f %f\n\n", *((unsigned short *)&buf[24]),
>                &buf[26], *((double *)&buf[0]), *((double *)&buf[8]),
>                *((double *)&buf[16]));
>
> ... I had pleasant thoughts like "phk is gonna kill me".
>
> If we're going with something like this, we'll probably want some macros
> and/or inline functions for better sanity, say:
>
> #define CASTP(ptr, type) (*((type *)(ptr)))
>
> And then:
>
>         CASTP(p, int) = va_arg(ap, int);
>
> For the output code, we could have the Python tool generate format
> strings (perhaps more than one, such as "space-separated" and
> "header-like"), field accessors, and varargs lists.
>
> Say something like:
>
> WS_FMT(SLT_Timestamp) ==> "%.*s: %.6f %.6f %.6f"
> FLD_LEN(buf, SLT_Timestamp, 0) ==> CASTP(&buf[24], unsigned short)
> FLD(buf, SLT_Timestamp, 0) ==> &buf[26]
> FLD(buf, SLT_Timestamp, 1) ==> CASTP(&buf[0], double)
> FLD(buf, SLT_Timestamp, 2) ==> CASTP(&buf[8], double)
> FLD(buf, SLT_Timestamp, 3) ==> CASTP(&buf[16], double)
>
> FMT_ARGS(buf, SLT_Timestamp) ==> \
> FLD_LEN(buf, SLT_Timestamp, 0), FLD(buf, SLT_Timestamp, 0), \
>         FLD(buf, SLT_Timestamp, 2), FLD(buf, SLT_Timestamp, 3)
>
> So then it's:
>
>         printf(WS_FMT(SLT_Timestamp), FMT_ARGS(buf, SLT_Timestamp));
>
> ... all generated so it's sure to be right. Something like the FLD()
> accessors will of course also be needed for VSLQ, varnishncsa etc.
>
>
> Best,
> Geoff
> --
> ** * * UPLEX - Nils Goroll Systemoptimierung
>
> Scheffelstraße 32
> 22301 Hamburg
>
> Tel +49 40 2880 5731
> Mob +49 176 636 90917
> Fax +49 40 42949753
>
> http://uplex.de