Parsing non-English URLs

Felix felix at seconddrawer.com.au
Tue May 25 12:53:14 CEST 2010


Just to add to the discussion, I have Varnish running in front of a
couple of Thai language sites.

The URL '/กลางประเทศไทย' corresponds to the following entry in
varnishlog in Varnish 2.1.1:

13 RxURL  c  /%E0%B8%81%E0%B8%A5%E0%B8%B2%E0%B8%87%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2

which is just all the high bits escaped as %nn sequences. This is
actually a result of the browser (in this case Chrome) doing the
conversion. This is confirmed by a netcat session.

I am not sure if all browsers do the same conversion. Some more details
might be gleaned from here:
http://code.google.com/p/browsersec/wiki/Part1#Unicode_in_URLs

But obviously varnish needs to be able to cope with these conversions.

-felix

On Tue, May 25, 2010 at 01:08:26PM +0300, Angie T. Muhammad wrote:
> Thank you Sam for your response. I already logged requests to cached Arabic
> URLs and here is the result of one request:
> ===========================================================================================
> Cookie: SESScfc90a62c81b7bfc6f292320b1d0b8ca=t7t650vu5qu02916unbtil9o66;
> SESS50745c6a3729e7f46278f7d281511580=qjc658f7cthp6dvj65rt6a8c64;
> SESS8348e9a0e0f6133hash*%ntrol: max-age=0%c9c2n9td5uuvj0hp73;
> SESSb323fb39997d18c5bde4c32f7bc0ffe1=0r5ve4k3i2ubmqu
> ▒±␊: 0  ┼: ┐␊␊⎻-▒┌␋┴␊     806  
> ===========================================================================================
> 
> I tried opening the log file with less, vim, and tail but all what am
> getting is either binary (less) or stuff like above (tail).
> I even tried limiting the accepted charset header sent by the browser to
> UTF-8 but failed. Here is my config for limiting the charset under sub
> vcl_rcv { } :
> ======================================
>   if (req.http.Accept-Charset) {
>   remove req.http.Accept-Charset;
>   set req.http.Accept-Charset = "utf-8";
>   }
> ======================================
> 
> I also tried including C header files as follows:
> ===================================
> C{
> #include <string.h>
> #include <locale.h>
> #include <wctype.h>
> #include <wchar.h>
> #include <curses.h>
> }C
> ===================================
> but it did not give me any result.
> 
> I am thinking of recompiling with ncurses wchar enabled. Any ideas?
> 
> 
> 2010/5/24 Sam Crawford <samcrawford at gmail.com>
> 
> > It's not one that I'm familiar with, but if it were me, I'd try
> > running varnishlog whilst putting a request for one of these URLs
> > through. See how varnish prints it out in the RxURL field. This might
> > give you some clues as how to specify it in the rules.
> >
> > Thanks,
> >
> > Sam
> >
> >
> > 2010/5/23 Angie T. Muhammad <angie.tawfik at gmail.com>:
> > > Hello Varnish team
> > >
> > > I have varnish v. 2.1.2 on production and test servers . We are running a
> > > bilingual news website.
> > > On my test server I am trying to parse non-English URLs like follows:
> > >
> > > .......................
> > >   else if (req.url == "/تقارير") {
> > >       set beresp.http.X-Cacheable = "Yes";
> > >       set beresp.ttl = 60m;
> > >       return(deliver);
> > >      }
> > >  .......................
> > >
> > > The word in bold red is in Arabic and it is a right-to-left language. The
> > > link can not be made in English and has no English equivalent. In case
> > you
> > > are wondering, the word means "reports". My sole problem now is that
> > varnish
> > > applies all other if-statements with full English URLs but not this one
> > with
> > > Arabiv. Even if I try regex say: req.url ~ "^/تقارير" instead of the ==
> > > sign, it starts with no errors but does not apply the rule.
> > >
> > > I tried the following:
> > > 1- Reversing the letters of the arabic word, so تقارير  would be ريراقت
> > but
> > > it did not work
> > > 2- Copying the link directly into /etc/varnish/default.vcl, it produces
> > > something like: %D9%88%D8%B3%D9%88%D9%85%D8%A7%D8%AA
> > >      Such html address handling prevents varnish from starting
> > >
> > > Any ideas? Your help is really appreciated.
> > >
> > >
> > > --
> > > All the best,
> > > Angie
> > >
> > > _______________________________________________
> > > varnish-misc mailing list
> > > varnish-misc at varnish-cache.org
> > > http://lists.varnish-cache.org/mailman/listinfo/varnish-misc
> > >
> >
> 
> 
> 
> -- 
> All the best,
> Angie

> _______________________________________________
> varnish-misc mailing list
> varnish-misc at varnish-cache.org
> http://lists.varnish-cache.org/mailman/listinfo/varnish-misc


-- 
  email: felix at seconddrawer.com.au
    web: http://seconddrawer.com.au/
    gpg: E6FC 5BC6 268D B874 E546 8F6F A2BB 220B D5F6 92E3

Please don't send me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html




More information about the varnish-misc mailing list