Best practice for not caching content requested by crawlers

Thu Jul 19 19:09:27 CEST 2012

Hi Everyone,
We have reason to believe that we have some amount of cache pollution from
crawlers. We believe this to be the case after we attempted to determine
the size of our hot data set.

To determine the size of our hot dataset, we summed up the request sizes
for all requests that had a hit rate > 1 over a nine hour period that
included the peaks for the day. The sum of the sizes from this measurement
came out to about 5GB.

We have allocated 18GB (using malloc) to varnish and we are nuking at the
rate of about 80 per second on a box whose hit rate is hovering around 70%.
This suggests to me that we have a lot of data in the cache that is not
actively being requested. Its not hot. The goal of this effort is to more
accurately determine if we need to add additional varnish capacity (more
memory). I'm using size your
cache<https://www.varnish-cache.org/docs/2.1/tutorial/sizing_your_cache.html>as
a guide and taking the advice there to try and reduce the n_lru_nuked
rate, hopefully driving it to 0.

As an experiment to both improve our hit rate, and ensure we are getting
the most out of the memory we have allocated to varnish, I want to explore
configuring varnish to not place into the cache requests that are coming
from crawlers. I'm defining crawlers as requests with User-Agent headers
containing strings like Googlebot, msnbot, etc.

So my question is, what is the best practice for doing this? If a request
comes from the crawler and its in the cache, I'm fine serving it from the
cache. However if the request comes from the crawler and its not in the
cache, I don't want varnish to cache it.

Any suggestions would be appreciated.

Thanks,
Damon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20120719/9008fc88/attachment.html>