ESI and search engine spiders

Rob S rtshilston at gmail.com
Wed Aug 11 17:48:41 CEST 2010


Chris, Stew,

This was always going to be controversial.  It's a very tough balancing 
act.  In our view, we're not serving additional content to seed the 
search engines, and so it's reasonable.  We are removing content that 
users find useful, but which might make it harder for the search engine 
to make a good judgement about the site overall.  Yahoo explains why 
this is desirable:
> Webpages often include headers, footers, navigational sections, 
> repeated boilerplate text, copyright notices, ad sections, or dynamic 
> content that is useful to users, but not to search engines. Webmasters 
> can apply the "robots-nocontent" attribute to indicate to search 
> engines any content that is extraneous to the main unique content of 
> the page.
A few blogs have picked up on examples of prominent sites implementing 
cloaking to different extents, such as 
http://www.seroundtable.com/archives/021504.html and 
http://www.seoegghead.com/blog/seo/the-google-cloaking-hypocrisy-p32.html.  
Then there's also the First Click Free approach 
(http://www.google.com/support/webmasters/bin/answer.py?answer=74536), 
which many people might feel is a bit borderline.

I agree many people could use the code below to try to boost their 
search engine results by including lots of keywords or links, but I'm 
confident that there are many legitimate reasons to do this.  I'd love 
not to do this in Varnish / HTTP, but there don't appear to be other 
widely supported solutions.  In this case (and addressing Chris' point) 
it's not possible to use robots.txt as we're not trying to block the 
entire page, just a subset of it.   There are ways of hinting to a 
Google Search Appliance to turn off indexing of a portion of the page 
(googleon/off tags, see 
http://perishablepress.com/press/2009/08/23/tell-google-to-not-index-certain-parts-of-your-page/), 
but these aren't supported by the normal googlebot, nor other search 
engines.  Yahoo has a robots-nocontent class that can be added to HTML 
elements, but again, it's a single solution for just one search engine 
(http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-14.html).  
I've heard of (but can't find a link) a discussion to add a specific 
attribute to any HTML element to the new HTML5 standard, but that this 
wasn't adopted.

Someone reading this might know a magic answer, but in the meantime, 
we'll be making minor page alterations to help ensure users find 
relevant results when searching Google, even if that involves a little 
cloaking to suppress a small portion of the page.


Rob


Chris Hecker wrote:
>
> On that note, why not use robots.txt and a clear path name to turn off 
> bots for the lists?
>
> Chris
>
> On 2010/08/11 08:25, Stewart Robinson wrote:
>> Hi,
>>
>> Whilst this looks excellent and I may use it to serve different
>> content to other types of users I think you should read, if you
>> haven't already, this URL which discourages this sort of behaviour.
>>
>> http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=66355 
>>
>>
>> Great VCL though!
>> Stew
>>
>>
>> On 11 August 2010 16:20, Rob S<rtshilston at gmail.com>  wrote:
>>>
>>> Michael Loftis wrote:
>>>>
>>>>
>>>> --On Tuesday, August 10, 2010 9:05 PM +0100 Rob 
>>>> S<rtshilston at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> On one site we run behind varnish, we've got a "most popular" widget
>>>>> displayed on every page (much like http://www.bbc.co.uk/news/).  
>>>>> However,
>>>>> we have difficulties where this pollutes search engines, as 
>>>>> searches for
>>>>> a specific popular headline tend not to link directly to the article
>>>>> itself, but to one of the index pages with high Google pagerank or
>>>>> similar.
>>>>>
>>>>> What I'd like to know is how other Varnish users might have served
>>>>> different ESI content based on whether it's a bot or not.
>>>>>
>>>>> My initial idea was to set an "X-Not-For-Bots: 1" header on the 
>>>>> URL that
>>>>> generates the most-popular fragment, then do something like (though
>>>>> untested):
>>>>>
>>>>
>>>> ESI goes through all the normal steps, so a<esi:include
>>>> src="/esi/blargh">  is fired off starting with vcl_receive looking 
>>>> just
>>>> exactly like the browser had hit the cache with that as the req.url 
>>>> -- the
>>>> entire req object is the same -- i am *not* certain that headers 
>>>> you've
>>>> added get propogated as I've not tested that (and all of my rules 
>>>> are built
>>>> on the assumption that is not the case, just to be sure)
>>>>
>>>> So there's no need to do it in vcl_deliver, in fact, you're far better
>>>> handling it in vcl_recv and/or vcl_hash (actually you really SHOULD 
>>>> handle
>>>> it in vcl_hash and change the hash for these search engine specific 
>>>> objects
>>>> else you'll serve them to regular users)...
>>>>
>>>>
>>>> for example -- assume vcl_recv sets X-BotDetector in the req header...
>>>> (not tested)::
>>>>
>>>>
>>>> sub vcl_hash {
>>>>   // always take into account the url and host
>>>>   set req.hash += req.url;
>>>>   if (req.http.host) {
>>>>    set req.hash += req.http.host;
>>>>   } else {
>>>>    set req.hash += server.ip;
>>>>   }
>>>>
>>>>   if(req.http.X-BotDetector == "1") {
>>>>    set req.hash += "bot detector";
>>>>   }
>>>> }
>>>>
>>>>
>>>> You still have to do the detection inside of varnish, I don't see 
>>>> any way
>>>> around that.  The reason is that only varnish knows who it's 
>>>> talking to, and
>>>> varnish needs to decide which object to spit out.  Working properly 
>>>> what
>>>> happens is essentially the webserver sends back a 'template' for 
>>>> the page
>>>> containing the page specific stuff, and pointers to a bunch of ESI
>>>> fragments.  The ESI fragments are also cache objects/requests...So 
>>>> what
>>>> happens is the cache takes this template, fills in ESI fragments 
>>>> (from cache
>>>> if it can, fetching them if it needs to, treating them just as if 
>>>> the web
>>>> browser had run to the ESI url)
>>>>
>>>>
>>>> This is actually exactly how I handle menu's that change based on a 
>>>> users
>>>> authentication status.  The browser gets a cookie.  The ESI URL is 
>>>> formed as
>>>> either 'authenticated' 'personalized' or 'global' -- authenticated 
>>>> means it
>>>> varies only on the clients login state, personalized takes into 
>>>> account the
>>>> actual session we're working with.  And global means everyone gets 
>>>> the same
>>>> cache regardless (we strip cookies going into these ESI URLs and 
>>>> coming from
>>>> these ESI URLs in the vcl_recv/vcl_fetch code, the vcl_fetch code 
>>>> looks for
>>>> some special headers set that indicate that the recv has decided it 
>>>> needs to
>>>> ditch set-cookies -- this is mostly a safety measure to prevent a 
>>>> session
>>>> sticking to a client it shouldn't due to any bugs in code)
>>>>
>>>> The basic idea is borrowed from
>>>> <http://varnish-cache.org/wiki/VCLExampleCachingLoggedInUsers>  and
>>>> <http://varnish-cache.org/wiki/VCLExampleCacheCookies>
>>>>
>>>> HTH!
>>>
>>> Thanks.  We've proved this works with a simple setup:
>>>
>>> sub vcl_recv {
>>>        ....
>>>        // Establish if the visitor is a search engine:
>>>        set req.http.X-IsABot = "0";
>>>        if (req.http.user-agent ~ "Yahoo! Slurp") { set 
>>> req.http.X-IsABot =
>>> "1"; }
>>>        if (req.http.X-IsABot == "0"&&  req.http.user-agent ~ 
>>> "Googlebot") {
>>> set req.http.X-IsABot = "1"; }
>>>        if (req.http.X-IsABot == "0"&&  req.http.user-agent ~ 
>>> "msnbot") { set
>>> req.http.X-IsABot = "1"; }
>>>        ....
>>>
>>> }
>>> ...
>>> sub vcl_hash {
>>>        set req.hash += req.url;
>>>        if (req.http.host) {
>>>                set req.hash += req.http.host;
>>>        } else {
>>>                set req.hash += server.ip;
>>>        }
>>>
>>>        if (req.http.X-IsABot == "1") {
>>>                set req.hash += "for-bot";
>>>        } else {
>>>                set req.hash += "for-non-bot";
>>>        }
>>>        hash;
>>> }
>>>
>>> The main HTML has a simple ESI, which loads a page fragment whose 
>>> PHP reads:
>>>
>>> if ($_SERVER["HTTP_X_ISABOT"]) {
>>>
>>>        echo "<!-- The list of popular posts is not displayed to search
>>> engines -->";
>>> } else {
>>>              // calculate most popular
>>>        echo "The most popular article is XYZ";
>>> }
>>>
>>>
>>>
>>> Thanks again.
>>>





More information about the varnish-misc mailing list