Best practice for not caching content requested by crawlers
Lasse Karstensen
lasse.karstensen at gmail.com
Fri Jul 20 11:04:09 CEST 2012
Damon Snyder:
> We have reason to believe that we have some amount of cache pollution from
> crawlers. We believe this to be the case after we attempted to determine
> the size of our hot data set.
[..]
> So my question is, what is the best practice for doing this? If a request
> comes from the crawler and its in the cache, I'm fine serving it from the
> cache. However if the request comes from the crawler and its not in the
> cache, I don't want varnish to cache it.
I'm not clear on whether this is a good idea or not, but you can do
it in VCL like this:
sub vcl_miss {
if (req.http.user-agent ~ "(?i)yandex|msnbot") {
return(pass);
}
}
You can probably use openddr/deviceatlas/$favorite_detectionengine to get
better accuracy than this regex.
--
Lasse Karstensen
Varnish Software AS
http://www.varnish-software.com/
More information about the varnish-misc
mailing list