Best practice for not caching content requested by crawlers

Damon Snyder damon at huddler-inc.com
Tue Jul 24 18:53:16 CEST 2012


Hi Lasse,
Thanks! I forgot to mention this in the original email, but we are using
varnish 2.1.5. Here is what I ended up doing:

sub vcl_fetch {
    ...

    if (req.http.User-Agent ~
"(?i)(msn|google|bing|yandex|youdao|exa|mj12|omgili|flr-|ahrefs|blekko)bot"
||
        req.http.User-Agent ~
"(?i)(magpie|mediapartners|sogou|baiduspider|nutch|yahoo.*slurp|genieo)") {
        set beresp.http.X-Bot-Bypass = "YES";
        set beresp.ttl = 0s;
        return (pass);
    }

    ...
}

The X-Bot-Bypass was just for testing this configuration. With this
filtering and a lower ttl for some of our other objects, our nuking is now
at 0. The hit rate hasn't changed, but I think we need more granularity in
our hit rate metrics. For example, perhaps we should be looking at non-bot
hitrates.

Thanks,
Damon


On Fri, Jul 20, 2012 at 3:44 AM, Lasse Karstensen <
lasse.karstensen at gmail.com> wrote:

> Lasse Karstensen:
> [..]
> > sub vcl_miss {
> >     if (req.http.user-agent ~ "(?i)yandex|msnbot") {
> >         return(pass);
> >     }
> > }
> > You can probably use openddr/deviceatlas/$favorite_detectionengine to get
> > better accuracy than this regex.
>
> I took at look at some access logs and updated devicedetect.vcl a bit so
> it has rudimentary bot detection:
>
>
> https://github.com/varnish/varnish-devicedetect/blob/master/devicedetect.vcl
>
>
> --
> Lasse Karstensen
> Varnish Software AS
> http://www.varnish-software.com/
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc at varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20120724/8e3c119f/attachment.html>


More information about the varnish-misc mailing list