ESI and search engine spiders

Wed Aug 11 18:25:05 CEST 2010

Ah, right, I've never used ESI, so I forgot it's already glommed 
together when the client gets it.

Chris

On 2010/08/11 08:48, Rob S wrote:
> Chris, Stew,
>
> This was always going to be controversial. It's a very tough balancing
> act. In our view, we're not serving additional content to seed the
> search engines, and so it's reasonable. We are removing content that
> users find useful, but which might make it harder for the search engine
> to make a good judgement about the site overall. Yahoo explains why this
> is desirable:
>> Webpages often include headers, footers, navigational sections,
>> repeated boilerplate text, copyright notices, ad sections, or dynamic
>> content that is useful to users, but not to search engines. Webmasters
>> can apply the "robots-nocontent" attribute to indicate to search
>> engines any content that is extraneous to the main unique content of
>> the page.
> A few blogs have picked up on examples of prominent sites implementing
> cloaking to different extents, such as
> http://www.seroundtable.com/archives/021504.html and
> http://www.seoegghead.com/blog/seo/the-google-cloaking-hypocrisy-p32.html.
> Then there's also the First Click Free approach
> (http://www.google.com/support/webmasters/bin/answer.py?answer=74536),
> which many people might feel is a bit borderline.
>
> I agree many people could use the code below to try to boost their
> search engine results by including lots of keywords or links, but I'm
> confident that there are many legitimate reasons to do this. I'd love
> not to do this in Varnish / HTTP, but there don't appear to be other
> widely supported solutions. In this case (and addressing Chris' point)
> it's not possible to use robots.txt as we're not trying to block the
> entire page, just a subset of it. There are ways of hinting to a Google
> Search Appliance to turn off indexing of a portion of the page
> (googleon/off tags, see
> http://perishablepress.com/press/2009/08/23/tell-google-to-not-index-certain-parts-of-your-page/),
> but these aren't supported by the normal googlebot, nor other search
> engines. Yahoo has a robots-nocontent class that can be added to HTML
> elements, but again, it's a single solution for just one search engine
> (http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-14.html). I've
> heard of (but can't find a link) a discussion to add a specific
> attribute to any HTML element to the new HTML5 standard, but that this
> wasn't adopted.
>
> Someone reading this might know a magic answer, but in the meantime,
> we'll be making minor page alterations to help ensure users find
> relevant results when searching Google, even if that involves a little
> cloaking to suppress a small portion of the page.
>
>
> Rob
>
>
> Chris Hecker wrote:
>>
>> On that note, why not use robots.txt and a clear path name to turn off
>> bots for the lists?
>>
>> Chris
>>
>> On 2010/08/11 08:25, Stewart Robinson wrote:
>>> Hi,
>>>
>>> Whilst this looks excellent and I may use it to serve different
>>> content to other types of users I think you should read, if you
>>> haven't already, this URL which discourages this sort of behaviour.
>>>
>>> http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=66355
>>>
>>>
>>> Great VCL though!
>>> Stew
>>>
>>>
>>> On 11 August 2010 16:20, Rob S<rtshilston at gmail.com> wrote:
>>>>
>>>> Michael Loftis wrote:
>>>>>
>>>>>
>>>>> --On Tuesday, August 10, 2010 9:05 PM +0100 Rob
>>>>> S<rtshilston at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On one site we run behind varnish, we've got a "most popular" widget
>>>>>> displayed on every page (much like http://www.bbc.co.uk/news/).
>>>>>> However,
>>>>>> we have difficulties where this pollutes search engines, as
>>>>>> searches for
>>>>>> a specific popular headline tend not to link directly to the article
>>>>>> itself, but to one of the index pages with high Google pagerank or
>>>>>> similar.
>>>>>>
>>>>>> What I'd like to know is how other Varnish users might have served
>>>>>> different ESI content based on whether it's a bot or not.
>>>>>>
>>>>>> My initial idea was to set an "X-Not-For-Bots: 1" header on the
>>>>>> URL that
>>>>>> generates the most-popular fragment, then do something like (though
>>>>>> untested):
>>>>>>
>>>>>
>>>>> ESI goes through all the normal steps, so a<esi:include
>>>>> src="/esi/blargh"> is fired off starting with vcl_receive looking just
>>>>> exactly like the browser had hit the cache with that as the req.url
>>>>> -- the
>>>>> entire req object is the same -- i am *not* certain that headers
>>>>> you've
>>>>> added get propogated as I've not tested that (and all of my rules
>>>>> are built
>>>>> on the assumption that is not the case, just to be sure)
>>>>>
>>>>> So there's no need to do it in vcl_deliver, in fact, you're far better
>>>>> handling it in vcl_recv and/or vcl_hash (actually you really SHOULD
>>>>> handle
>>>>> it in vcl_hash and change the hash for these search engine specific
>>>>> objects
>>>>> else you'll serve them to regular users)...
>>>>>
>>>>>
>>>>> for example -- assume vcl_recv sets X-BotDetector in the req header...
>>>>> (not tested)::
>>>>>
>>>>>
>>>>> sub vcl_hash {
>>>>> // always take into account the url and host
>>>>> set req.hash += req.url;
>>>>> if (req.http.host) {
>>>>> set req.hash += req.http.host;
>>>>> } else {
>>>>> set req.hash += server.ip;
>>>>> }
>>>>>
>>>>> if(req.http.X-BotDetector == "1") {
>>>>> set req.hash += "bot detector";
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> You still have to do the detection inside of varnish, I don't see
>>>>> any way
>>>>> around that. The reason is that only varnish knows who it's talking
>>>>> to, and
>>>>> varnish needs to decide which object to spit out. Working properly
>>>>> what
>>>>> happens is essentially the webserver sends back a 'template' for
>>>>> the page
>>>>> containing the page specific stuff, and pointers to a bunch of ESI
>>>>> fragments. The ESI fragments are also cache objects/requests...So what
>>>>> happens is the cache takes this template, fills in ESI fragments
>>>>> (from cache
>>>>> if it can, fetching them if it needs to, treating them just as if
>>>>> the web
>>>>> browser had run to the ESI url)
>>>>>
>>>>>
>>>>> This is actually exactly how I handle menu's that change based on a
>>>>> users
>>>>> authentication status. The browser gets a cookie. The ESI URL is
>>>>> formed as
>>>>> either 'authenticated' 'personalized' or 'global' -- authenticated
>>>>> means it
>>>>> varies only on the clients login state, personalized takes into
>>>>> account the
>>>>> actual session we're working with. And global means everyone gets
>>>>> the same
>>>>> cache regardless (we strip cookies going into these ESI URLs and
>>>>> coming from
>>>>> these ESI URLs in the vcl_recv/vcl_fetch code, the vcl_fetch code
>>>>> looks for
>>>>> some special headers set that indicate that the recv has decided it
>>>>> needs to
>>>>> ditch set-cookies -- this is mostly a safety measure to prevent a
>>>>> session
>>>>> sticking to a client it shouldn't due to any bugs in code)
>>>>>
>>>>> The basic idea is borrowed from
>>>>> <http://varnish-cache.org/wiki/VCLExampleCachingLoggedInUsers> and
>>>>> <http://varnish-cache.org/wiki/VCLExampleCacheCookies>
>>>>>
>>>>> HTH!
>>>>
>>>> Thanks. We've proved this works with a simple setup:
>>>>
>>>> sub vcl_recv {
>>>> ....
>>>> // Establish if the visitor is a search engine:
>>>> set req.http.X-IsABot = "0";
>>>> if (req.http.user-agent ~ "Yahoo! Slurp") { set req.http.X-IsABot =
>>>> "1"; }
>>>> if (req.http.X-IsABot == "0"&& req.http.user-agent ~ "Googlebot") {
>>>> set req.http.X-IsABot = "1"; }
>>>> if (req.http.X-IsABot == "0"&& req.http.user-agent ~ "msnbot") { set
>>>> req.http.X-IsABot = "1"; }
>>>> ....
>>>>
>>>> }
>>>> ...
>>>> sub vcl_hash {
>>>> set req.hash += req.url;
>>>> if (req.http.host) {
>>>> set req.hash += req.http.host;
>>>> } else {
>>>> set req.hash += server.ip;
>>>> }
>>>>
>>>> if (req.http.X-IsABot == "1") {
>>>> set req.hash += "for-bot";
>>>> } else {
>>>> set req.hash += "for-non-bot";
>>>> }
>>>> hash;
>>>> }
>>>>
>>>> The main HTML has a simple ESI, which loads a page fragment whose
>>>> PHP reads:
>>>>
>>>> if ($_SERVER["HTTP_X_ISABOT"]) {
>>>>
>>>> echo "<!-- The list of popular posts is not displayed to search
>>>> engines -->";
>>>> } else {
>>>> // calculate most popular
>>>> echo "The most popular article is XYZ";
>>>> }
>>>>
>>>>
>>>>
>>>> Thanks again.
>>>>
>
>