ESI and search engine spiders

Chris Hecker checker at d6.com
Wed Aug 11 17:32:58 CEST 2010


On that note, why not use robots.txt and a clear path name to turn off 
bots for the lists?

Chris

On 2010/08/11 08:25, Stewart Robinson wrote:
> Hi,
>
> Whilst this looks excellent and I may use it to serve different
> content to other types of users I think you should read, if you
> haven't already, this URL which discourages this sort of behaviour.
>
> http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=66355
>
> Great VCL though!
> Stew
>
>
> On 11 August 2010 16:20, Rob S<rtshilston at gmail.com>  wrote:
>>
>> Michael Loftis wrote:
>>>
>>>
>>> --On Tuesday, August 10, 2010 9:05 PM +0100 Rob S<rtshilston at gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> On one site we run behind varnish, we've got a "most popular" widget
>>>> displayed on every page (much like http://www.bbc.co.uk/news/).  However,
>>>> we have difficulties where this pollutes search engines, as searches for
>>>> a specific popular headline tend not to link directly to the article
>>>> itself, but to one of the index pages with high Google pagerank or
>>>> similar.
>>>>
>>>> What I'd like to know is how other Varnish users might have served
>>>> different ESI content based on whether it's a bot or not.
>>>>
>>>> My initial idea was to set an "X-Not-For-Bots: 1" header on the URL that
>>>> generates the most-popular fragment, then do something like (though
>>>> untested):
>>>>
>>>
>>> ESI goes through all the normal steps, so a<esi:include
>>> src="/esi/blargh">  is fired off starting with vcl_receive looking just
>>> exactly like the browser had hit the cache with that as the req.url -- the
>>> entire req object is the same -- i am *not* certain that headers you've
>>> added get propogated as I've not tested that (and all of my rules are built
>>> on the assumption that is not the case, just to be sure)
>>>
>>> So there's no need to do it in vcl_deliver, in fact, you're far better
>>> handling it in vcl_recv and/or vcl_hash (actually you really SHOULD handle
>>> it in vcl_hash and change the hash for these search engine specific objects
>>> else you'll serve them to regular users)...
>>>
>>>
>>> for example -- assume vcl_recv sets X-BotDetector in the req header...
>>> (not tested)::
>>>
>>>
>>> sub vcl_hash {
>>>   // always take into account the url and host
>>>   set req.hash += req.url;
>>>   if (req.http.host) {
>>>    set req.hash += req.http.host;
>>>   } else {
>>>    set req.hash += server.ip;
>>>   }
>>>
>>>   if(req.http.X-BotDetector == "1") {
>>>    set req.hash += "bot detector";
>>>   }
>>> }
>>>
>>>
>>> You still have to do the detection inside of varnish, I don't see any way
>>> around that.  The reason is that only varnish knows who it's talking to, and
>>> varnish needs to decide which object to spit out.  Working properly what
>>> happens is essentially the webserver sends back a 'template' for the page
>>> containing the page specific stuff, and pointers to a bunch of ESI
>>> fragments.  The ESI fragments are also cache objects/requests...So what
>>> happens is the cache takes this template, fills in ESI fragments (from cache
>>> if it can, fetching them if it needs to, treating them just as if the web
>>> browser had run to the ESI url)
>>>
>>>
>>> This is actually exactly how I handle menu's that change based on a users
>>> authentication status.  The browser gets a cookie.  The ESI URL is formed as
>>> either 'authenticated' 'personalized' or 'global' -- authenticated means it
>>> varies only on the clients login state, personalized takes into account the
>>> actual session we're working with.  And global means everyone gets the same
>>> cache regardless (we strip cookies going into these ESI URLs and coming from
>>> these ESI URLs in the vcl_recv/vcl_fetch code, the vcl_fetch code looks for
>>> some special headers set that indicate that the recv has decided it needs to
>>> ditch set-cookies -- this is mostly a safety measure to prevent a session
>>> sticking to a client it shouldn't due to any bugs in code)
>>>
>>> The basic idea is borrowed from
>>> <http://varnish-cache.org/wiki/VCLExampleCachingLoggedInUsers>  and
>>> <http://varnish-cache.org/wiki/VCLExampleCacheCookies>
>>>
>>> HTH!
>>
>> Thanks.  We've proved this works with a simple setup:
>>
>> sub vcl_recv {
>>        ....
>>        // Establish if the visitor is a search engine:
>>        set req.http.X-IsABot = "0";
>>        if (req.http.user-agent ~ "Yahoo! Slurp") { set req.http.X-IsABot =
>> "1"; }
>>        if (req.http.X-IsABot == "0"&&  req.http.user-agent ~ "Googlebot") {
>> set req.http.X-IsABot = "1"; }
>>        if (req.http.X-IsABot == "0"&&  req.http.user-agent ~ "msnbot") { set
>> req.http.X-IsABot = "1"; }
>>        ....
>>
>> }
>> ...
>> sub vcl_hash {
>>        set req.hash += req.url;
>>        if (req.http.host) {
>>                set req.hash += req.http.host;
>>        } else {
>>                set req.hash += server.ip;
>>        }
>>
>>        if (req.http.X-IsABot == "1") {
>>                set req.hash += "for-bot";
>>        } else {
>>                set req.hash += "for-non-bot";
>>        }
>>        hash;
>> }
>>
>> The main HTML has a simple ESI, which loads a page fragment whose PHP reads:
>>
>> if ($_SERVER["HTTP_X_ISABOT"]) {
>>
>>        echo "<!-- The list of popular posts is not displayed to search
>> engines -->";
>> } else {
>>              // calculate most popular
>>        echo "The most popular article is XYZ";
>> }
>>
>>
>>
>> Thanks again.
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc at varnish-cache.org
>> http://lists.varnish-cache.org/mailman/listinfo/varnish-misc
>>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc at varnish-cache.org
> http://lists.varnish-cache.org/mailman/listinfo/varnish-misc
>




More information about the varnish-misc mailing list