ESI and search engine spiders

Stewart Robinson stewsnooze at gmail.com
Wed Aug 11 17:25:52 CEST 2010


Hi,

Whilst this looks excellent and I may use it to serve different
content to other types of users I think you should read, if you
haven't already, this URL which discourages this sort of behaviour.

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=66355

Great VCL though!
Stew


On 11 August 2010 16:20, Rob S <rtshilston at gmail.com> wrote:
>
> Michael Loftis wrote:
>>
>>
>> --On Tuesday, August 10, 2010 9:05 PM +0100 Rob S <rtshilston at gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> On one site we run behind varnish, we've got a "most popular" widget
>>> displayed on every page (much like http://www.bbc.co.uk/news/).  However,
>>> we have difficulties where this pollutes search engines, as searches for
>>> a specific popular headline tend not to link directly to the article
>>> itself, but to one of the index pages with high Google pagerank or
>>> similar.
>>>
>>> What I'd like to know is how other Varnish users might have served
>>> different ESI content based on whether it's a bot or not.
>>>
>>> My initial idea was to set an "X-Not-For-Bots: 1" header on the URL that
>>> generates the most-popular fragment, then do something like (though
>>> untested):
>>>
>>
>> ESI goes through all the normal steps, so a <esi:include
>> src="/esi/blargh"> is fired off starting with vcl_receive looking just
>> exactly like the browser had hit the cache with that as the req.url -- the
>> entire req object is the same -- i am *not* certain that headers you've
>> added get propogated as I've not tested that (and all of my rules are built
>> on the assumption that is not the case, just to be sure)
>>
>> So there's no need to do it in vcl_deliver, in fact, you're far better
>> handling it in vcl_recv and/or vcl_hash (actually you really SHOULD handle
>> it in vcl_hash and change the hash for these search engine specific objects
>> else you'll serve them to regular users)...
>>
>>
>> for example -- assume vcl_recv sets X-BotDetector in the req header...
>> (not tested)::
>>
>>
>> sub vcl_hash {
>>  // always take into account the url and host
>>  set req.hash += req.url;
>>  if (req.http.host) {
>>   set req.hash += req.http.host;
>>  } else {
>>   set req.hash += server.ip;
>>  }
>>
>>  if(req.http.X-BotDetector == "1") {
>>   set req.hash += "bot detector";
>>  }
>> }
>>
>>
>> You still have to do the detection inside of varnish, I don't see any way
>> around that.  The reason is that only varnish knows who it's talking to, and
>> varnish needs to decide which object to spit out.  Working properly what
>> happens is essentially the webserver sends back a 'template' for the page
>> containing the page specific stuff, and pointers to a bunch of ESI
>> fragments.  The ESI fragments are also cache objects/requests...So what
>> happens is the cache takes this template, fills in ESI fragments (from cache
>> if it can, fetching them if it needs to, treating them just as if the web
>> browser had run to the ESI url)
>>
>>
>> This is actually exactly how I handle menu's that change based on a users
>> authentication status.  The browser gets a cookie.  The ESI URL is formed as
>> either 'authenticated' 'personalized' or 'global' -- authenticated means it
>> varies only on the clients login state, personalized takes into account the
>> actual session we're working with.  And global means everyone gets the same
>> cache regardless (we strip cookies going into these ESI URLs and coming from
>> these ESI URLs in the vcl_recv/vcl_fetch code, the vcl_fetch code looks for
>> some special headers set that indicate that the recv has decided it needs to
>> ditch set-cookies -- this is mostly a safety measure to prevent a session
>> sticking to a client it shouldn't due to any bugs in code)
>>
>> The basic idea is borrowed from
>> <http://varnish-cache.org/wiki/VCLExampleCachingLoggedInUsers> and
>> <http://varnish-cache.org/wiki/VCLExampleCacheCookies>
>>
>> HTH!
>
> Thanks.  We've proved this works with a simple setup:
>
> sub vcl_recv {
>       ....
>       // Establish if the visitor is a search engine:
>       set req.http.X-IsABot = "0";
>       if (req.http.user-agent ~ "Yahoo! Slurp") { set req.http.X-IsABot =
> "1"; }
>       if (req.http.X-IsABot == "0" && req.http.user-agent ~ "Googlebot") {
> set req.http.X-IsABot = "1"; }
>       if (req.http.X-IsABot == "0" && req.http.user-agent ~ "msnbot") { set
> req.http.X-IsABot = "1"; }
>       ....
>
> }
> ...
> sub vcl_hash {
>       set req.hash += req.url;
>       if (req.http.host) {
>               set req.hash += req.http.host;
>       } else {
>               set req.hash += server.ip;
>       }
>
>       if (req.http.X-IsABot == "1") {
>               set req.hash += "for-bot";
>       } else {
>               set req.hash += "for-non-bot";
>       }
>       hash;
> }
>
> The main HTML has a simple ESI, which loads a page fragment whose PHP reads:
>
> if ($_SERVER["HTTP_X_ISABOT"]) {
>
>       echo "<!-- The list of popular posts is not displayed to search
> engines -->";
> } else {
>             // calculate most popular
>       echo "The most popular article is XYZ";
> }
>
>
>
> Thanks again.
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc at varnish-cache.org
> http://lists.varnish-cache.org/mailman/listinfo/varnish-misc
>




More information about the varnish-misc mailing list