Parsing non-English URLs
Felix
felix at seconddrawer.com.au
Tue May 25 12:53:14 CEST 2010
Just to add to the discussion, I have Varnish running in front of a
couple of Thai language sites.
The URL '/กลางประเทศไทย' corresponds to the following entry in
varnishlog in Varnish 2.1.1:
13 RxURL c /%E0%B8%81%E0%B8%A5%E0%B8%B2%E0%B8%87%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
which is just all the high bits escaped as %nn sequences. This is
actually a result of the browser (in this case Chrome) doing the
conversion. This is confirmed by a netcat session.
I am not sure if all browsers do the same conversion. Some more details
might be gleaned from here:
http://code.google.com/p/browsersec/wiki/Part1#Unicode_in_URLs
But obviously varnish needs to be able to cope with these conversions.
-felix
On Tue, May 25, 2010 at 01:08:26PM +0300, Angie T. Muhammad wrote:
> Thank you Sam for your response. I already logged requests to cached Arabic
> URLs and here is the result of one request:
> ===========================================================================================
> Cookie: SESScfc90a62c81b7bfc6f292320b1d0b8ca=t7t650vu5qu02916unbtil9o66;
> SESS50745c6a3729e7f46278f7d281511580=qjc658f7cthp6dvj65rt6a8c64;
> SESS8348e9a0e0f6133hash*%ntrol: max-age=0%c9c2n9td5uuvj0hp73;
> SESSb323fb39997d18c5bde4c32f7bc0ffe1=0r5ve4k3i2ubmqu
> ▒±␊: 0 ┼: ┐␊␊⎻-▒┌␋┴␊ 806
> ===========================================================================================
>
> I tried opening the log file with less, vim, and tail but all what am
> getting is either binary (less) or stuff like above (tail).
> I even tried limiting the accepted charset header sent by the browser to
> UTF-8 but failed. Here is my config for limiting the charset under sub
> vcl_rcv { } :
> ======================================
> if (req.http.Accept-Charset) {
> remove req.http.Accept-Charset;
> set req.http.Accept-Charset = "utf-8";
> }
> ======================================
>
> I also tried including C header files as follows:
> ===================================
> C{
> #include <string.h>
> #include <locale.h>
> #include <wctype.h>
> #include <wchar.h>
> #include <curses.h>
> }C
> ===================================
> but it did not give me any result.
>
> I am thinking of recompiling with ncurses wchar enabled. Any ideas?
>
>
> 2010/5/24 Sam Crawford <samcrawford at gmail.com>
>
> > It's not one that I'm familiar with, but if it were me, I'd try
> > running varnishlog whilst putting a request for one of these URLs
> > through. See how varnish prints it out in the RxURL field. This might
> > give you some clues as how to specify it in the rules.
> >
> > Thanks,
> >
> > Sam
> >
> >
> > 2010/5/23 Angie T. Muhammad <angie.tawfik at gmail.com>:
> > > Hello Varnish team
> > >
> > > I have varnish v. 2.1.2 on production and test servers . We are running a
> > > bilingual news website.
> > > On my test server I am trying to parse non-English URLs like follows:
> > >
> > > .......................
> > > else if (req.url == "/تقارير") {
> > > set beresp.http.X-Cacheable = "Yes";
> > > set beresp.ttl = 60m;
> > > return(deliver);
> > > }
> > > .......................
> > >
> > > The word in bold red is in Arabic and it is a right-to-left language. The
> > > link can not be made in English and has no English equivalent. In case
> > you
> > > are wondering, the word means "reports". My sole problem now is that
> > varnish
> > > applies all other if-statements with full English URLs but not this one
> > with
> > > Arabiv. Even if I try regex say: req.url ~ "^/تقارير" instead of the ==
> > > sign, it starts with no errors but does not apply the rule.
> > >
> > > I tried the following:
> > > 1- Reversing the letters of the arabic word, so تقارير would be ريراقت
> > but
> > > it did not work
> > > 2- Copying the link directly into /etc/varnish/default.vcl, it produces
> > > something like: %D9%88%D8%B3%D9%88%D9%85%D8%A7%D8%AA
> > > Such html address handling prevents varnish from starting
> > >
> > > Any ideas? Your help is really appreciated.
> > >
> > >
> > > --
> > > All the best,
> > > Angie
> > >
> > > _______________________________________________
> > > varnish-misc mailing list
> > > varnish-misc at varnish-cache.org
> > > http://lists.varnish-cache.org/mailman/listinfo/varnish-misc
> > >
> >
>
>
>
> --
> All the best,
> Angie
> _______________________________________________
> varnish-misc mailing list
> varnish-misc at varnish-cache.org
> http://lists.varnish-cache.org/mailman/listinfo/varnish-misc
--
email: felix at seconddrawer.com.au
web: http://seconddrawer.com.au/
gpg: E6FC 5BC6 268D B874 E546 8F6F A2BB 220B D5F6 92E3
Please don't send me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
More information about the varnish-misc
mailing list