ESI and search engine spiders

Michael Loftis mloftis at wgops.com
Tue Aug 10 23:12:11 CEST 2010



--On Tuesday, August 10, 2010 9:05 PM +0100 Rob S <rtshilston at gmail.com> 
wrote:

> Hi,
>
> On one site we run behind varnish, we've got a "most popular" widget
> displayed on every page (much like http://www.bbc.co.uk/news/).  However,
> we have difficulties where this pollutes search engines, as searches for
> a specific popular headline tend not to link directly to the article
> itself, but to one of the index pages with high Google pagerank or
> similar.
>
> What I'd like to know is how other Varnish users might have served
> different ESI content based on whether it's a bot or not.
>
> My initial idea was to set an "X-Not-For-Bots: 1" header on the URL that
> generates the most-popular fragment, then do something like (though
> untested):
>

ESI goes through all the normal steps, so a <esi:include src="/esi/blargh"> 
is fired off starting with vcl_receive looking just exactly like the 
browser had hit the cache with that as the req.url -- the entire req object 
is the same -- i am *not* certain that headers you've added get propogated 
as I've not tested that (and all of my rules are built on the assumption 
that is not the case, just to be sure)

So there's no need to do it in vcl_deliver, in fact, you're far better 
handling it in vcl_recv and/or vcl_hash (actually you really SHOULD handle 
it in vcl_hash and change the hash for these search engine specific objects 
else you'll serve them to regular users)...


for example -- assume vcl_recv sets X-BotDetector in the req header... (not 
tested)::


sub vcl_hash {
  // always take into account the url and host
  set req.hash += req.url;
  if (req.http.host) {
    set req.hash += req.http.host;
  } else {
    set req.hash += server.ip;
  }

  if(req.http.X-BotDetector == "1") {
    set req.hash += "bot detector";
  }
}


You still have to do the detection inside of varnish, I don't see any way 
around that.  The reason is that only varnish knows who it's talking to, 
and varnish needs to decide which object to spit out.  Working properly 
what happens is essentially the webserver sends back a 'template' for the 
page containing the page specific stuff, and pointers to a bunch of ESI 
fragments.  The ESI fragments are also cache objects/requests...So what 
happens is the cache takes this template, fills in ESI fragments (from 
cache if it can, fetching them if it needs to, treating them just as if the 
web browser had run to the ESI url)


This is actually exactly how I handle menu's that change based on a users 
authentication status.  The browser gets a cookie.  The ESI URL is formed 
as either 'authenticated' 'personalized' or 'global' -- authenticated means 
it varies only on the clients login state, personalized takes into account 
the actual session we're working with.  And global means everyone gets the 
same cache regardless (we strip cookies going into these ESI URLs and 
coming from these ESI URLs in the vcl_recv/vcl_fetch code, the vcl_fetch 
code looks for some special headers set that indicate that the recv has 
decided it needs to ditch set-cookies -- this is mostly a safety measure to 
prevent a session sticking to a client it shouldn't due to any bugs in code)

The basic idea is borrowed from 
<http://varnish-cache.org/wiki/VCLExampleCachingLoggedInUsers> and 
<http://varnish-cache.org/wiki/VCLExampleCacheCookies>

HTH!





More information about the varnish-misc mailing list