ESI and search engine spiders
Chris Hecker
checker at d6.com
Wed Aug 11 17:32:58 CEST 2010
On that note, why not use robots.txt and a clear path name to turn off
bots for the lists?
Chris
On 2010/08/11 08:25, Stewart Robinson wrote:
> Hi,
>
> Whilst this looks excellent and I may use it to serve different
> content to other types of users I think you should read, if you
> haven't already, this URL which discourages this sort of behaviour.
>
> http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=66355
>
> Great VCL though!
> Stew
>
>
> On 11 August 2010 16:20, Rob S<rtshilston at gmail.com> wrote:
>>
>> Michael Loftis wrote:
>>>
>>>
>>> --On Tuesday, August 10, 2010 9:05 PM +0100 Rob S<rtshilston at gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> On one site we run behind varnish, we've got a "most popular" widget
>>>> displayed on every page (much like http://www.bbc.co.uk/news/). However,
>>>> we have difficulties where this pollutes search engines, as searches for
>>>> a specific popular headline tend not to link directly to the article
>>>> itself, but to one of the index pages with high Google pagerank or
>>>> similar.
>>>>
>>>> What I'd like to know is how other Varnish users might have served
>>>> different ESI content based on whether it's a bot or not.
>>>>
>>>> My initial idea was to set an "X-Not-For-Bots: 1" header on the URL that
>>>> generates the most-popular fragment, then do something like (though
>>>> untested):
>>>>
>>>
>>> ESI goes through all the normal steps, so a<esi:include
>>> src="/esi/blargh"> is fired off starting with vcl_receive looking just
>>> exactly like the browser had hit the cache with that as the req.url -- the
>>> entire req object is the same -- i am *not* certain that headers you've
>>> added get propogated as I've not tested that (and all of my rules are built
>>> on the assumption that is not the case, just to be sure)
>>>
>>> So there's no need to do it in vcl_deliver, in fact, you're far better
>>> handling it in vcl_recv and/or vcl_hash (actually you really SHOULD handle
>>> it in vcl_hash and change the hash for these search engine specific objects
>>> else you'll serve them to regular users)...
>>>
>>>
>>> for example -- assume vcl_recv sets X-BotDetector in the req header...
>>> (not tested)::
>>>
>>>
>>> sub vcl_hash {
>>> // always take into account the url and host
>>> set req.hash += req.url;
>>> if (req.http.host) {
>>> set req.hash += req.http.host;
>>> } else {
>>> set req.hash += server.ip;
>>> }
>>>
>>> if(req.http.X-BotDetector == "1") {
>>> set req.hash += "bot detector";
>>> }
>>> }
>>>
>>>
>>> You still have to do the detection inside of varnish, I don't see any way
>>> around that. The reason is that only varnish knows who it's talking to, and
>>> varnish needs to decide which object to spit out. Working properly what
>>> happens is essentially the webserver sends back a 'template' for the page
>>> containing the page specific stuff, and pointers to a bunch of ESI
>>> fragments. The ESI fragments are also cache objects/requests...So what
>>> happens is the cache takes this template, fills in ESI fragments (from cache
>>> if it can, fetching them if it needs to, treating them just as if the web
>>> browser had run to the ESI url)
>>>
>>>
>>> This is actually exactly how I handle menu's that change based on a users
>>> authentication status. The browser gets a cookie. The ESI URL is formed as
>>> either 'authenticated' 'personalized' or 'global' -- authenticated means it
>>> varies only on the clients login state, personalized takes into account the
>>> actual session we're working with. And global means everyone gets the same
>>> cache regardless (we strip cookies going into these ESI URLs and coming from
>>> these ESI URLs in the vcl_recv/vcl_fetch code, the vcl_fetch code looks for
>>> some special headers set that indicate that the recv has decided it needs to
>>> ditch set-cookies -- this is mostly a safety measure to prevent a session
>>> sticking to a client it shouldn't due to any bugs in code)
>>>
>>> The basic idea is borrowed from
>>> <http://varnish-cache.org/wiki/VCLExampleCachingLoggedInUsers> and
>>> <http://varnish-cache.org/wiki/VCLExampleCacheCookies>
>>>
>>> HTH!
>>
>> Thanks. We've proved this works with a simple setup:
>>
>> sub vcl_recv {
>> ....
>> // Establish if the visitor is a search engine:
>> set req.http.X-IsABot = "0";
>> if (req.http.user-agent ~ "Yahoo! Slurp") { set req.http.X-IsABot =
>> "1"; }
>> if (req.http.X-IsABot == "0"&& req.http.user-agent ~ "Googlebot") {
>> set req.http.X-IsABot = "1"; }
>> if (req.http.X-IsABot == "0"&& req.http.user-agent ~ "msnbot") { set
>> req.http.X-IsABot = "1"; }
>> ....
>>
>> }
>> ...
>> sub vcl_hash {
>> set req.hash += req.url;
>> if (req.http.host) {
>> set req.hash += req.http.host;
>> } else {
>> set req.hash += server.ip;
>> }
>>
>> if (req.http.X-IsABot == "1") {
>> set req.hash += "for-bot";
>> } else {
>> set req.hash += "for-non-bot";
>> }
>> hash;
>> }
>>
>> The main HTML has a simple ESI, which loads a page fragment whose PHP reads:
>>
>> if ($_SERVER["HTTP_X_ISABOT"]) {
>>
>> echo "<!-- The list of popular posts is not displayed to search
>> engines -->";
>> } else {
>> // calculate most popular
>> echo "The most popular article is XYZ";
>> }
>>
>>
>>
>> Thanks again.
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc at varnish-cache.org
>> http://lists.varnish-cache.org/mailman/listinfo/varnish-misc
>>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc at varnish-cache.org
> http://lists.varnish-cache.org/mailman/listinfo/varnish-misc
>
More information about the varnish-misc
mailing list