Varnish 3.x Expiry Thread overload

Mon Aug 26 11:10:10 CEST 2013

Hi all,

I decided to share this issue with you because I was unable to confirm it's
a bug, a performance degradation over the time or a miss-configuration on
my side. So, here's the problem:

We have an scenario where we have several varnish servers that suffer this
issue at some point, always once the mmap file storage is filled up, but
not immediately after it gets filled up (in fact, weeks afterwards).

We have ~1100GB of SSD disk storage and ~64GB of RAM per node. Our content
is what industry defines it as "cold content" or "long tail". We have
objects which size is 1K-5K and others that is 50K. Our Varnish startup
configuration is [1]. Our VCL config is very simple: some Host and
Content-Encoding header homogenization and some content switching rules in
order to route the request to different backends.

Time ago (varnish 3.0.3) we suffered "Can't get storage" errors and we
decided to increase the nuke_limit to 500 (when default was 10, and
reverted to 50 by mainstream due to a unexpected behavior [5]) because we
have planned to store objects bigger than 60K and hence a 500K object may
need to evict from cache up to 500 1K objects in order to get space.
Probably this was not the smartest decision and we should deploy different
varnish instances for different type of objects.

Now we are facing a new issue, and we're quite lost about the root cause.
The timeline of the problem is the following:

   - Node is added to production and start storing data in the file storage
   (July 14th)
   - There is a point that stored objects are bigger than the active ram,
   so IOPS over the disk increase. All working as expected.
   - mmap storage gets filled up, so evictions start happening (July 18h)
      - lru_nuked_objects is 200 per sec at peak
      - cache insertions is the same rate, so apparently 1 object is
      inserted and 1 object is added. (Am I understanding incorrectly the
      nuke_limit parameter and/or lru_nuked_objects item?)
   - We detect the issue on August 8th, so we removed the affected node
   from production (10:53AM), and we have seen the following behavior:
      - Varnish keeps consuming a single core (I guess Expiry Thread, which
      is a single thread) until 11:16AM
      - Cache insertions stopped at 10:53AM, but lru_nuked_objects have
      been happening until 11:16AM

As soon as the node is out of production, varnishlog reports no client
request but prints LRU ExpKill messages [2] until 11:16AM. After this only
Backend_health messages and requests issued by monitoring system are seen.
At this point the varnishstat items "SMF.s0.c_freed" and "SMF.s0.c_fail"
increase, even "SMF.s0.c_bytes" only increments from time to time.

If we add the node back to production without restarting the process it
runs fine only for ~15 seconds, and then it happens again.  We kept this
node for experiments because it's totally reproducible.

Other remarkable fact is that some request appear to be taking tens of
hundreds of seconds [3]. Our timeouts are now the default, but apparently
threads get stuck for a while until they get space for storing the object
in cache, but when thread successfully stores the object it's too late and
client already closed connection due to timeout. We know we have thread
pile-up issues because file descriptors increase by 4x-5x until reaching 5k.

My theory is that Expiry thread is being overloaded by an incorrect
nuke_limit parameter (it tries to evict too many objects), so cache does
not freed space fast enough and threads are blocked waiting for storage. We
can't protect against it (fail fast approach) because there is no
configurable timeout for writing on storage [4]. Obviously, this is Varnish
 internals.

As the time goes on, we have seen this behavior happening again (indeed,
twice) in 4 different nodes that already suffered this issue in the past.
Fortunately we are maintaining a spreadsheet with the odd behaviors and
they happened 23, 25 and 30 days after process was started. It's just a
curious coincidence, right?

Some facts that may be useful:
- If we restart the process, problem is solved.
- We are using "critbit" hashing algorithm instead of "classic". In fact,
we haven't tried any other because we have seen it's the 3.x default
hashing and we assumed it's stable enough and has no weird bugs.
- We haven't tried branch 2.x because we are running Debian Wheezy and 3.x
was the version it provides. We upgraded to 3.0.4 from Varnish Debian
packages in order to make sure it was not a bug solved on 3.0.4.
- Our traffic per node is not really heavy (200 mbps). Indeed, the only
challenge is the big catalog of objects from different sizes we have.

Does anybody have an idea about what's going on?

Regards,

[1]
/usr/sbin/varnishd -P /var/run/varnishd.pid -a :80 -T localhost:6082 -f
/etc/varnish/default.vcl -S /etc/varnish/secret -s
file,/srv/cache/varnish_cache,1100G -p nuke_limit 500 -u varnish -g varnish -w
500,8000,30

[2]
0 ExpKill      - 2385044933 LRU

[3]
 2614 ReqEnd       - 2627025831 1375951760.351629972 1375952903.284181118
0.000064850 1142.932451248 0.000099897
 2167 ReqEnd       - 2627022893 1375951754.715034962 1375952903.284289598
0.000098705 1148.569141150 0.000113487
 1891 ReqEnd       - 2627024599 1375951758.109493732 1375952903.284304857
0.000055552 1145.174724579 0.000086546
 1469 ReqEnd       - 2627040233 1375951785.068578243 1375952903.284320831
0.000058889 1118.215640306 0.000102282
 2155 ReqEnd       - 2627022876 1375951754.677411079 1375952903.284334898
0.000061274 1148.606820107 0.000103712

[4]
https://www.varnish-software.com/static/book/Tuning.html

[5]
https://www.varnish-cache.org/trac/ticket/1012

-- 
Ricardo Bartolomé Méndez
Systems Engineer
+34.672.194.729
tuenti.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20130826/896d9c77/attachment.html>