ESI and search engine spiders
Rob S
rtshilston at gmail.com
Wed Aug 11 17:20:32 CEST 2010
Michael Loftis wrote:
>
>
> --On Tuesday, August 10, 2010 9:05 PM +0100 Rob S
> <rtshilston at gmail.com> wrote:
>
>> Hi,
>>
>> On one site we run behind varnish, we've got a "most popular" widget
>> displayed on every page (much like http://www.bbc.co.uk/news/).
>> However,
>> we have difficulties where this pollutes search engines, as searches for
>> a specific popular headline tend not to link directly to the article
>> itself, but to one of the index pages with high Google pagerank or
>> similar.
>>
>> What I'd like to know is how other Varnish users might have served
>> different ESI content based on whether it's a bot or not.
>>
>> My initial idea was to set an "X-Not-For-Bots: 1" header on the URL that
>> generates the most-popular fragment, then do something like (though
>> untested):
>>
>
> ESI goes through all the normal steps, so a <esi:include
> src="/esi/blargh"> is fired off starting with vcl_receive looking just
> exactly like the browser had hit the cache with that as the req.url --
> the entire req object is the same -- i am *not* certain that headers
> you've added get propogated as I've not tested that (and all of my
> rules are built on the assumption that is not the case, just to be sure)
>
> So there's no need to do it in vcl_deliver, in fact, you're far better
> handling it in vcl_recv and/or vcl_hash (actually you really SHOULD
> handle it in vcl_hash and change the hash for these search engine
> specific objects else you'll serve them to regular users)...
>
>
> for example -- assume vcl_recv sets X-BotDetector in the req header...
> (not tested)::
>
>
> sub vcl_hash {
> // always take into account the url and host
> set req.hash += req.url;
> if (req.http.host) {
> set req.hash += req.http.host;
> } else {
> set req.hash += server.ip;
> }
>
> if(req.http.X-BotDetector == "1") {
> set req.hash += "bot detector";
> }
> }
>
>
> You still have to do the detection inside of varnish, I don't see any
> way around that. The reason is that only varnish knows who it's
> talking to, and varnish needs to decide which object to spit out.
> Working properly what happens is essentially the webserver sends back
> a 'template' for the page containing the page specific stuff, and
> pointers to a bunch of ESI fragments. The ESI fragments are also
> cache objects/requests...So what happens is the cache takes this
> template, fills in ESI fragments (from cache if it can, fetching them
> if it needs to, treating them just as if the web browser had run to
> the ESI url)
>
>
> This is actually exactly how I handle menu's that change based on a
> users authentication status. The browser gets a cookie. The ESI URL
> is formed as either 'authenticated' 'personalized' or 'global' --
> authenticated means it varies only on the clients login state,
> personalized takes into account the actual session we're working
> with. And global means everyone gets the same cache regardless (we
> strip cookies going into these ESI URLs and coming from these ESI URLs
> in the vcl_recv/vcl_fetch code, the vcl_fetch code looks for some
> special headers set that indicate that the recv has decided it needs
> to ditch set-cookies -- this is mostly a safety measure to prevent a
> session sticking to a client it shouldn't due to any bugs in code)
>
> The basic idea is borrowed from
> <http://varnish-cache.org/wiki/VCLExampleCachingLoggedInUsers> and
> <http://varnish-cache.org/wiki/VCLExampleCacheCookies>
>
> HTH!
Thanks. We've proved this works with a simple setup:
sub vcl_recv {
....
// Establish if the visitor is a search engine:
set req.http.X-IsABot = "0";
if (req.http.user-agent ~ "Yahoo! Slurp") { set
req.http.X-IsABot = "1"; }
if (req.http.X-IsABot == "0" && req.http.user-agent ~
"Googlebot") { set req.http.X-IsABot = "1"; }
if (req.http.X-IsABot == "0" && req.http.user-agent ~ "msnbot")
{ set req.http.X-IsABot = "1"; }
....
}
...
sub vcl_hash {
set req.hash += req.url;
if (req.http.host) {
set req.hash += req.http.host;
} else {
set req.hash += server.ip;
}
if (req.http.X-IsABot == "1") {
set req.hash += "for-bot";
} else {
set req.hash += "for-non-bot";
}
hash;
}
The main HTML has a simple ESI, which loads a page fragment whose PHP reads:
if ($_SERVER["HTTP_X_ISABOT"]) {
echo "<!-- The list of popular posts is not displayed to search
engines -->";
} else {
// calculate most popular
echo "The most popular article is XYZ";
}
Thanks again.
More information about the varnish-misc
mailing list