[Varnish] #666: pmatch: regular expression subexpression references
Varnish
varnish-bugs at projects.linpro.no
Sat Mar 13 16:34:03 CET 2010
#666: pmatch: regular expression subexpression references
-------------------------+--------------------------------------------------
Reporter: slink | Owner: phk
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: trunk
Severity: normal | Keywords: pmatch, match, subexpression, back reference
-------------------------+--------------------------------------------------
== Intro ==
The patch in this ticket adds the "pmatch" feature to the VCL to
reference sub-expressions specified in the regular expression used
with the match (~) operator.
Please refer to [http://perldoc.perl.org/perlre.html Perl Regular
Expression Syntax documentation] for information on the regular
expressions accepted by the match operator in varnish trunk.
== Usage ==
Usage of the pmatch feature is simple. The variable
pmatch.''<n>''
contains the ''nth'' subexpression of the last successful match operator
preceding the statement in the current VCL procedure. pmatch.0
contains the full match.
pmatch scope is local to the VCL procedure (sub <name>) it is used in,
so pmatch will never contain sub-expressions from matches in other VCL
procedures.
Note that it is not possible to use pmatch to retrieve matches from a
match before the last successful match. It will only ever contain
matches from the previous successful match within the same VCL
procedure.
Unsuccessful matches will not alter pmatch, neither will the regsub
nor regsuball expressions.
Use on invalid subexpression returns the empty string ("").
To summarize, pmatch should behave like the ($1, $2, ...) variables in
perl.
== Examples ==
The following two simple examples, derived from the bundled test
(tests/v00027.vtc), should illustrate the use:
{{{
# back reference to whole matched string
if (resp.http.Foobar ~ "ar") {
set resp.http.Snafu1 = pmatch.0
}
}}}
After a successful match, pmatch.0 references the whole matched
string, so in this example, the Snafu1 header will be set to "ar".
{{{
if (resp.http.Foobar ~ "(b)(a)(r)(f)") {
# back references to other sub-res
set resp.http.Snafu3 = "_" pmatch.0 "_"
pmatch.5
pmatch.4
pmatch.3
pmatch.2
"p_";
}}}
This example illustrates several aspects:
* pmatch.5 refers to an undefined sub-expression, so it will return
the empty string (there is no fifth sub-expression in the regular
expression).
* pmatch.0 contains the full match, "barf"
So, upon a successful match, the Snafu3 header will be set to
"_barf_frap_".
The examples were taken from the regsub test.
== Implementation notes ==
* The name "pmatch" stems from the fact that I have originally
implemented this feature for Varnish 2.0, which uses the C library RE
functions. I'm open to suggestions for better names.
* To hold references, a struct vrt_pmatch called local_pmatch is declared
and initialized for each VCL procedure using VRT_init_proc.
(include/vrt.h, lib/libvcl/vcc_parse.c)
I've chosen this approach after I had used truly VCL-global scope using
state saved in the sp for some time and realized that this implies too
many additional problems (unclear state of pmatch in procedures, question
of memory management).
* The struct vrt_pmatch does not provide space to store the actual
matches, this is allocated from sp->http->ws (as I don't know of a better
alternative). (bin/varnishd/cache_vrt_re.c:VRT_re_match())
* I've put some effort into trying to minimize the space consumed by only
keeping in the workspace those references which could possibly be reached.
* Also, at VCL compile time, an upper bound on the maximum pmatch
reference for each match is determined to minimize ws space requirements.
(lib/libvcl/vcc_string.c:_vcc_regexp())
One aspect I am not completely happy with is that I need a list per
struct proc in the VCC parser, so I moved the declarations of structures
private to vcc_xref.c to vcc_compile.h.
If this is considered unclean, I shall implement a more general service
for other components to store information per procedure.
* While space for the matches is allocated in VRT_re_match (if at all),
the substrings themselves are only copied in VRT_r_pmatch, so
subexpressions which are not referenced should never consume space on the
ws for the strings, and only use minimal space for references if it has
been determined that they could ever be accessed.
(Note that subexpressions should be avoided if they are not needed, for
instance by using the (?:) pcre syntax for grouping)
* On the other hand, currently each VRT_re_match will result in a
seperate copy of the subexpression. I've added code to implement a caching
mechanism, but disabled it, because I think that without additional usage
statistics, we cannot determine when caching the subexpression strings
will pay off (see bin/varnishd/cache_vrt.c)
* I've added pcre_study to VRE_compile to get best performance at run
time and added a minor memory management optimization to VRE_exec.
Please note that the patch depends on the patch in ticket #665 (and the
patch in #663 should also be applied).
--
Ticket URL: <http://www.varnish-cache.org/ticket/666>
Varnish <http://varnish.projects.linpro.no/>
The Varnish HTTP Accelerator
More information about the varnish-bugs
mailing list