Best practice for not caching content requested by crawlers

Lasse Karstensen lasse.karstensen at gmail.com
Wed Jul 25 13:17:06 CEST 2012


Damon Snyder:
> Hi Lasse,
> Thanks! I forgot to mention this in the original email, but we are using
> varnish 2.1.5. Here is what I ended up doing:
> sub vcl_fetch {
>     ...
>     if (req.http.User-Agent ~
> "(?i)(msn|google|bing|yandex|youdao|exa|mj12|omgili|flr-|ahrefs|blekko)bot"
> ||
>         req.http.User-Agent ~
> "(?i)(magpie|mediapartners|sogou|baiduspider|nutch|yahoo.*slurp|genieo)") {
>         set beresp.http.X-Bot-Bypass = "YES";
>         set beresp.ttl = 0s;
>         return (pass);
>     }
>     ...
> }

Hi Damon.

Just a quick note; doing this check in vcl_fetch will lead to serialisation
of backend requests. This will hurt your HTTP response times, and since these
bots take response time into account, probably also hurt your search engine
visibility.

I'd advice you to do this test in vcl_miss, and also not override beresp.ttl
so that Varnish stores the hit_for_pass object for a while.

If you need to set the debug header you can store it temporarily in 
req.http.x-bot-bypass and check/set resp.http.x-bot-bypass in vcl_deliver.

-- 
Lasse Karstensen
Varnish Software AS
http://www.varnish-software.com/



More information about the varnish-misc mailing list