ESI and search engine spiders

Rob S rtshilston at gmail.com
Wed Aug 11 17:20:32 CEST 2010


Michael Loftis wrote:
>
>
> --On Tuesday, August 10, 2010 9:05 PM +0100 Rob S 
> <rtshilston at gmail.com> wrote:
>
>> Hi,
>>
>> On one site we run behind varnish, we've got a "most popular" widget
>> displayed on every page (much like http://www.bbc.co.uk/news/).  
>> However,
>> we have difficulties where this pollutes search engines, as searches for
>> a specific popular headline tend not to link directly to the article
>> itself, but to one of the index pages with high Google pagerank or
>> similar.
>>
>> What I'd like to know is how other Varnish users might have served
>> different ESI content based on whether it's a bot or not.
>>
>> My initial idea was to set an "X-Not-For-Bots: 1" header on the URL that
>> generates the most-popular fragment, then do something like (though
>> untested):
>>
>
> ESI goes through all the normal steps, so a <esi:include 
> src="/esi/blargh"> is fired off starting with vcl_receive looking just 
> exactly like the browser had hit the cache with that as the req.url -- 
> the entire req object is the same -- i am *not* certain that headers 
> you've added get propogated as I've not tested that (and all of my 
> rules are built on the assumption that is not the case, just to be sure)
>
> So there's no need to do it in vcl_deliver, in fact, you're far better 
> handling it in vcl_recv and/or vcl_hash (actually you really SHOULD 
> handle it in vcl_hash and change the hash for these search engine 
> specific objects else you'll serve them to regular users)...
>
>
> for example -- assume vcl_recv sets X-BotDetector in the req header... 
> (not tested)::
>
>
> sub vcl_hash {
>  // always take into account the url and host
>  set req.hash += req.url;
>  if (req.http.host) {
>    set req.hash += req.http.host;
>  } else {
>    set req.hash += server.ip;
>  }
>
>  if(req.http.X-BotDetector == "1") {
>    set req.hash += "bot detector";
>  }
> }
>
>
> You still have to do the detection inside of varnish, I don't see any 
> way around that.  The reason is that only varnish knows who it's 
> talking to, and varnish needs to decide which object to spit out.  
> Working properly what happens is essentially the webserver sends back 
> a 'template' for the page containing the page specific stuff, and 
> pointers to a bunch of ESI fragments.  The ESI fragments are also 
> cache objects/requests...So what happens is the cache takes this 
> template, fills in ESI fragments (from cache if it can, fetching them 
> if it needs to, treating them just as if the web browser had run to 
> the ESI url)
>
>
> This is actually exactly how I handle menu's that change based on a 
> users authentication status.  The browser gets a cookie.  The ESI URL 
> is formed as either 'authenticated' 'personalized' or 'global' -- 
> authenticated means it varies only on the clients login state, 
> personalized takes into account the actual session we're working 
> with.  And global means everyone gets the same cache regardless (we 
> strip cookies going into these ESI URLs and coming from these ESI URLs 
> in the vcl_recv/vcl_fetch code, the vcl_fetch code looks for some 
> special headers set that indicate that the recv has decided it needs 
> to ditch set-cookies -- this is mostly a safety measure to prevent a 
> session sticking to a client it shouldn't due to any bugs in code)
>
> The basic idea is borrowed from 
> <http://varnish-cache.org/wiki/VCLExampleCachingLoggedInUsers> and 
> <http://varnish-cache.org/wiki/VCLExampleCacheCookies>
>
> HTH!

Thanks.  We've proved this works with a simple setup:

sub vcl_recv {
        ....
        // Establish if the visitor is a search engine:
        set req.http.X-IsABot = "0";
        if (req.http.user-agent ~ "Yahoo! Slurp") { set 
req.http.X-IsABot = "1"; }
        if (req.http.X-IsABot == "0" && req.http.user-agent ~ 
"Googlebot") { set req.http.X-IsABot = "1"; }
        if (req.http.X-IsABot == "0" && req.http.user-agent ~ "msnbot") 
{ set req.http.X-IsABot = "1"; }
        ....

}
...
sub vcl_hash {
        set req.hash += req.url;
        if (req.http.host) {
                set req.hash += req.http.host;
        } else {
                set req.hash += server.ip;
        }

        if (req.http.X-IsABot == "1") {
                set req.hash += "for-bot";
        } else {
                set req.hash += "for-non-bot";
        }
        hash;
}

The main HTML has a simple ESI, which loads a page fragment whose PHP reads:

if ($_SERVER["HTTP_X_ISABOT"]) {

        echo "<!-- The list of popular posts is not displayed to search 
engines -->";
} else {
       
        // calculate most popular
        echo "The most popular article is XYZ";
}



Thanks again.




More information about the varnish-misc mailing list