Varnish performance musings

Mon Apr 4 23:30:49 CEST 2016

On Mon, Apr 4, 2016 at 1:12 PM, Geoffrey Simmons <geoff at uplex.de> wrote:
>> On Apr 4, 2016, at 9:21 PM, Devon H. O'Dell <dho at fastly.com> wrote:
>>
>> ## PCRE
>>
>> There are other things we've done (like optimizing regexes that are
>> obviously prefix and suffix matches -- turns out lots of people write
>> things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if
>> (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't
>> see those as being as high priority for upstream; they're largely
>> issues for our multi-tenant use case. We have done this already;
>> another thing we would like to do is to check regexes for things like
>> backtracking and use DFA-based matching where possible. In the flame
>> graph screenshot, the obvious VRT functions are PCRE.
>
> You might be interested in this, although it's new as can be (just today tagged as v0.1) -- a VMOD to access Google's RE2 regular expression lib:
>
> https://code.uplex.de/uplex-varnish/libvmod-re2
>
> For those not familiar with RE2: it limits the syntax so that patterns are regular languages in the strictly formal sense. Most notably, backrefs within a pattern are not allowed. That means that the matcher can run as DFAs/NFAs, there is never any backtracking, and the time requirement for matches is always linear in the length of the string to be matched.

Thanks for pointing this out! We have also considered RE2, but one of
the problems is determining whether a particular regular expression is
actually regular. One way would be to first try to compile with RE2,
then fall back to PCRE if that didn't work. But there are also other
problems -- we expose regular expression groups into VCL through
`re.group.N` where `N` is some number in [0, 9] (we do not currently
support named captures, and I don't have any plans to implement that)
-- and providing cross-API consistency there is also problematic.

PCRE does allow one to execute regexps in a DFA context using
`pcre_dfa_exec`, but this also changes the semantics of matching as
well due to what PCRE calls "auto-possessification". So we've
continued to punt on PCRE improvements in our Varnish until we have
time to think about all the semantic changes.

> So far this is just a proof of concept, and I haven't done any performance testing. From the documentation, I suspect that there are certain kinds of use cases for Varnish where RE2 would perform better than PCRE, and many cases where it doesn't make much difference (such as the prefix or suffix matches you mentioned). But that's all speculation until it's been tested.

The prefix and suffix cases are really interesting. Several people
here were surprised at the limitations of regex optimizations
performed by the engines. There are some optimizations that can be
performed, but we don't currently run pcre_study either, and I'm not
sure the range of things it actually improves. Most of the performance
problems I see with regular expressions are more to do with the
regexes themselves either being poorly written, or just the wrong tool
for the job. Anyway, I believe that RE2 will outperform PCRE for a
great many things; I haven't benchmarked it against pcre_dfa_exec
(though I've seen some benchmarks in the context of pcre-jit, which is
another thing I haven't tried at all), but my primary concern about
just adopting RE2 is about the classes of regexes that simply don't
work in RE2 context. We do have a non-trivial number of customers who
rely on being able to use effectively non-regular expressions.

--dho

>
> Best,
> Geoff
>
>