<div dir="ltr"><div>Thank you so much Geoff for that very useful knowledge dump!</div><div><br></div><div>Good call out on the .*, I realized I carried them over too, when I copy-pasted the regex from the pure vcl example (where it's needed) to the vmod one.</div><div><br></div><div>And so, just to be clear about it:</div><div>- vmod-re is based on libpcre2</div><div>- vmod-re2 is based on libre2</div><div>Correct?<br></div><div><br></div><div>I see no way I'm going to misremember that, at all :-D<br></div><div><br></div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>-- <br></div><div>Guillaume Quintard<br></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Sep 1, 2023 at 7:47 AM Geoff Simmons <<a href="mailto:geoff@uplex.de">geoff@uplex.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Sorry, I get nerdy about this subject and can't help following up.<br>
<br>
I said:<br>
<br>
> - pcre2 regex matching is generally faster than re2 matching. The point <br>
> of re2 regexen is that matches won't go into catastrophic backtracking <br>
> on pathological cases.<br>
<br>
Should have mentioned that pcre2 is even better at subexpression <br>
capture, which is what the OP's question is all about.<br>
<br>
> sub vcl_init {<br>
> new query_pattern = re.regex(".*(q=)(.*?)(\&|$).*");<br>
> }<br>
<br>
OMG no. Like this please:<br>
<br>
new query_pattern = re.regex("\b(q=)(.*?)(?:\&|$)");<br>
<br>
I have sent an example of a pcre regex with .* (two of them!) to a <br>
public mailing list, for which I will burn in hell.<br>
<br>
To match a name-value pair in a cookie, use a regex with \b for 'word <br>
boundary' in front of the name. That way it will match either at the <br>
beginning of the Cookie value, or following an ampersand.<br>
<br>
And ?: tells pcre not to bother capturing the last expression in <br>
parentheses (they're just for grouping).<br>
<br>
Avoid .* in pcre regexen if you possibly can. You can, almost always.<br>
<br>
With .* at the beginning, the pcre matcher searches all the way to the <br>
end of the string, and then backtracks all the way back, looking for the <br>
first letter to match. In this case 'q', and it will stop and search and <br>
backtrack at any other 'q' that it may find while working backwards.<br>
<br>
pcre2 fortunately has an optimization that ignores a trailing .* if it <br>
has found a match up until there, so that it doesn't busily match the <br>
dot against every character left in the string. So this time .* does no <br>
harm, but it's superfluous, and violates the golden rule of pcre: avoid <br>
.* if at all possible.<br>
<br>
Incidentally, this is an area where re2 does have an advantage over <br>
pcre2. The efficiency of pcre2 matching depends crucially on how you <br>
write the regex, because details like \b instead of .* give it hints for <br>
pruning the search. While re2 matching usually isn't as fast as pcre2 <br>
matching against well-written patterns, re2 doesn't depend so much on <br>
that sort of thing.<br>
<br>
<br>
OK I can chill now,<br>
Geoff<br>
-- <br>
** * * UPLEX - Nils Goroll Systemoptimierung<br>
<br>
Scheffelstraße 32<br>
22301 Hamburg<br>
<br>
Tel +49 40 2880 5731<br>
Mob +49 176 636 90917<br>
Fax +49 40 42949753<br>
<br>
<a href="http://uplex.de" rel="noreferrer" target="_blank">http://uplex.de</a><br>
<br>
_______________________________________________<br>
varnish-misc mailing list<br>
<a href="mailto:varnish-misc@varnish-cache.org" target="_blank">varnish-misc@varnish-cache.org</a><br>
<a href="https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc" rel="noreferrer" target="_blank">https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc</a><br>
</blockquote></div>