[Varnish] #666: pmatch: regular expression subexpression references

Varnish varnish-bugs at projects.linpro.no
Sat Mar 13 16:34:03 CET 2010


#666: pmatch: regular expression subexpression references
-------------------------+--------------------------------------------------
 Reporter:  slink        |       Owner:  phk                                         
     Type:  enhancement  |      Status:  new                                         
 Priority:  normal       |   Milestone:                                              
Component:  varnishd     |     Version:  trunk                                       
 Severity:  normal       |    Keywords:  pmatch, match, subexpression, back reference
-------------------------+--------------------------------------------------
 == Intro ==


 The patch in this ticket adds the "pmatch" feature to the VCL to
 reference sub-expressions specified in the regular expression used
 with the match (~) operator.

 Please refer to [http://perldoc.perl.org/perlre.html Perl Regular
 Expression Syntax documentation] for information on the regular
 expressions accepted by the match operator in varnish trunk.

 == Usage ==

 Usage of the pmatch feature is simple. The variable

     pmatch.''<n>''

 contains the ''nth'' subexpression of the last successful match operator
 preceding the statement in the current VCL procedure. pmatch.0
 contains the full match.

 pmatch scope is local to the VCL procedure (sub <name>) it is used in,
 so pmatch will never contain sub-expressions from matches in other VCL
 procedures.

 Note that it is not possible to use pmatch to retrieve matches from a
 match before the last successful match. It will only ever contain
 matches from the previous successful match within the same VCL
 procedure.

 Unsuccessful matches will not alter pmatch, neither will the regsub
 nor regsuball expressions.

 Use on invalid subexpression returns the empty string ("").

 To summarize, pmatch should behave like the ($1, $2, ...) variables in
 perl.

 == Examples ==

 The following two simple examples, derived from the bundled test
 (tests/v00027.vtc), should illustrate the use:

 {{{
 # back reference to whole matched string
 if (resp.http.Foobar ~ "ar") {
         set resp.http.Snafu1 = pmatch.0
 }
 }}}

 After a successful match, pmatch.0 references the whole matched
 string, so in this example, the Snafu1 header will be set to "ar".

 {{{
 if (resp.http.Foobar ~ "(b)(a)(r)(f)") {
         # back references to other sub-res
         set resp.http.Snafu3 = "_" pmatch.0 "_"
             pmatch.5
             pmatch.4
             pmatch.3
             pmatch.2
             "p_";
 }}}

 This example illustrates several aspects:

  * pmatch.5 refers to an undefined sub-expression, so it will return
    the empty string (there is no fifth sub-expression in the regular
    expression).

  * pmatch.0 contains the full match, "barf"

 So, upon a successful match, the Snafu3 header will be set to
 "_barf_frap_".

 The examples were taken from the regsub test.

 == Implementation notes ==

  * The name "pmatch" stems from the fact that I have originally
 implemented this feature for Varnish 2.0, which uses the C library RE
 functions. I'm open to suggestions for better names.

  * To hold references, a struct vrt_pmatch called local_pmatch is declared
 and initialized for each VCL procedure using VRT_init_proc.
 (include/vrt.h, lib/libvcl/vcc_parse.c)

   I've chosen this approach after I had used truly VCL-global scope using
 state saved in the sp for some time and realized that this implies too
 many additional problems (unclear state of pmatch in procedures, question
 of memory management).

  * The struct vrt_pmatch does not provide space to store the actual
 matches, this is allocated from sp->http->ws (as I don't know of a better
 alternative). (bin/varnishd/cache_vrt_re.c:VRT_re_match())

  * I've put some effort into trying to minimize the space consumed by only
 keeping in the workspace those references which could possibly be reached.

  * Also, at VCL compile time, an upper bound on the maximum pmatch
 reference for each match is determined to minimize ws space requirements.
 (lib/libvcl/vcc_string.c:_vcc_regexp())

   One aspect I am not completely happy with is that I need a list per
 struct proc in the VCC parser, so I moved the declarations of structures
 private to vcc_xref.c to vcc_compile.h.

   If this is considered unclean, I shall implement a more general service
 for other components to store information per procedure.

  * While space for the matches is allocated in VRT_re_match (if at all),
 the substrings themselves are only copied in VRT_r_pmatch, so
 subexpressions which are not referenced should never consume space on the
 ws for the strings, and only use minimal space for references if it has
 been determined that they could ever be accessed.

   (Note that subexpressions should be avoided if they are not needed, for
 instance by using the (?:) pcre syntax for grouping)

  * On the other hand, currently each VRT_re_match will result in a
 seperate copy of the subexpression. I've added code to implement a caching
 mechanism, but disabled it, because I think that without additional usage
 statistics, we cannot determine when caching the subexpression strings
 will pay off (see bin/varnishd/cache_vrt.c)

  * I've added pcre_study to VRE_compile to get best performance at run
 time and added a minor memory management optimization to VRE_exec.

 Please note that the patch depends on the patch in ticket #665 (and the
 patch in #663 should also be applied).

-- 
Ticket URL: <http://www.varnish-cache.org/ticket/666>
Varnish <http://varnish.projects.linpro.no/>
The Varnish HTTP Accelerator




More information about the varnish-bugs mailing list