<div dir="ltr"><div>Thank you so much Geoff for that very useful knowledge dump!</div><div><br></div><div>Good call out on the .*, I realized I carried them over too, when I copy-pasted the regex from the pure vcl example (where it's needed) to the vmod one.</div><div><br></div><div>And so, just to be clear about it:</div><div>- vmod-re is based on libpcre2</div><div>- vmod-re2 is based on libre2</div><div>Correct?<br></div><div><br></div><div>I see no way I'm going to misremember that, at all :-D<br></div><div><br></div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>-- <br></div><div>Guillaume Quintard<br></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Sep 1, 2023 at 7:47 AM Geoff Simmons <<a href="mailto:geoff@uplex.de">geoff@uplex.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Sorry, I get nerdy about this subject and can't help following up.<br>

<br>

I said:<br>

<br>

> - pcre2 regex matching is generally faster than re2 matching. The point <br>

> of re2 regexen is that matches won't go into catastrophic backtracking <br>

> on pathological cases.<br>

<br>

Should have mentioned that pcre2 is even better at subexpression <br>

capture, which is what the OP's question is all about.<br>

<br>

> sub vcl_init {<br>

>      new query_pattern = re.regex(".*(q=)(.*?)(\&|$).*");<br>

> }<br>

<br>

OMG no. Like this please:<br>

<br>

        new query_pattern = re.regex("\b(q=)(.*?)(?:\&|$)");<br>

<br>

I have sent an example of a pcre regex with .* (two of them!) to a <br>

public mailing list, for which I will burn in hell.<br>

<br>

To match a name-value pair in a cookie, use a regex with \b for 'word <br>

boundary' in front of the name. That way it will match either at the <br>

beginning of the Cookie value, or following an ampersand.<br>

<br>

And ?: tells pcre not to bother capturing the last expression in <br>

parentheses (they're just for grouping).<br>

<br>

Avoid .* in pcre regexen if you possibly can. You can, almost always.<br>

<br>

With .* at the beginning, the pcre matcher searches all the way to the <br>

end of the string, and then backtracks all the way back, looking for the <br>

first letter to match. In this case 'q', and it will stop and search and <br>

backtrack at any other 'q' that it may find while working backwards.<br>

<br>

pcre2 fortunately has an optimization that ignores a trailing .* if it <br>

has found a match up until there, so that it doesn't busily match the <br>

dot against every character left in the string. So this time .* does no <br>

harm, but it's superfluous, and violates the golden rule of pcre: avoid <br>

.* if at all possible.<br>

<br>

Incidentally, this is an area where re2 does have an advantage over <br>

pcre2. The efficiency of pcre2 matching depends crucially on how you <br>

write the regex, because details like \b instead of .* give it hints for <br>

pruning the search. While re2 matching usually isn't as fast as pcre2 <br>

matching against well-written patterns, re2 doesn't depend so much on <br>

that sort of thing.<br>

<br>

<br>

OK I can chill now,<br>

Geoff<br>

-- <br>

** * * UPLEX - Nils Goroll Systemoptimierung<br>

<br>

Scheffelstraße 32<br>

22301 Hamburg<br>

<br>

Tel +49 40 2880 5731<br>

Mob +49 176 636 90917<br>

Fax +49 40 42949753<br>

<br>

<a href="http://uplex.de" rel="noreferrer" target="_blank">http://uplex.de</a><br>

<br>

_______________________________________________<br>

varnish-misc mailing list<br>

<a href="mailto:varnish-misc@varnish-cache.org" target="_blank">varnish-misc@varnish-cache.org</a><br>

<a href="https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc" rel="noreferrer" target="_blank">https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc</a><br>

</blockquote></div>