Varnish threads growing until no response

Mon Jul 25 11:17:42 CEST 2016

Hi all,

we are encountering a strange problem with one of our setups. We use Varnish since 2.1.5 (now 4.1.3) - the first time we use ESI heavily.

Problem description:
Randomly (sometimes after 20 days, but most of the time every 1-2 days) threads are growing from the default 400 until Varnish cannot create more (2000 is max configured) and the site is not responding. The site has an average of about 23 reqs/s and normal daily peak of 70 reqs/s.

What we also see is an increased counter on MAIN.busy_sleep - I don’t know if this is relevant.

I’m not sure if we are hitting a bug or if we have a vcl misconfiguration on our side.

Here is some information about our setup:

1.) Startup

varnishd -f /etc/varnish/fuf.vcl \
        -h critbit \
        -a 127.0.0.1:6081 \
        -T 127.0.0.1:6082 \
        -t 120 \
        -S /etc/varnish/secret \
        -s malloc,12G \
        -p thread_pool_min=100 \
        -p thread_pool_max=500 \
        -p thread_pools=4 \
        -p thread_pool_add_delay=2 \
        -p vcc_allow_inline_c=on \
        -p feature=+esi_disable_xml_check \
        -p feature=+esi_ignore_other_elements \
        -p feature=+esi_ignore_https"

2.) Varnish configuration:

- fuf.vcl (main configuration, includes other files) - http://pastebin.com/bVMH6E1t
- fuf-acl.vcl - http://pastebin.com/Y2RdLchK
- fuf-error.vcl - http://pastebin.com/ypr1SBGX
- fuf-extended_cache_control.vcl - http://pastebin.com/grB9qaPB
- fuf-hash.vcl - http://pastebin.com/ZXVQJaz3
- fuf-local.vcl - http://pastebin.com/uhBucRti

3.) sysctl.conf

net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1
net.ipv6.conf.eth0.disable_ipv6=1
net.ipv4.conf.default.rp_filter=1
net.ipv4.conf.all.rp_filter=1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
vm.swappiness=10
net.ipv4.tcp_fin_timeout=15
net.ipv4.tcp_tw_recycle=0
net.ipv4.tcp_tw_reuse=0
net.ipv4.tcp_syncookies=0
net.ipv4.tcp_max_syn_backlog=60000
net.core.somaxconn=60000
net.ipv4.tcp_max_orphans=262144
net.ipv4.tcp_fin_timeout=15

4.) Various debugging files before we restarted Varnish (before hitting 2000 threads)
- netstat -an (public IPs deleted) - http://pastebin.com/HunafJpG
- varnishstat -1 output - http://pastebin.com/FDVDvssp
- strace on Varnish child process - available but quite big (68 MB). I can provide it if necessary! There are obviously a lot of clone() syscalls …

5.) Setup:
Single physical server, 32 GB RAM. No swapping, no I/O wait. /var/lib/varnish mounted on tmpfs
Pound on Port 80,443 (HTTPS termination) -> Varnish on 127.0.0.1:6081 -> Nginx on 127.0.0.1:8080 -> UWSGI -> Score application

6.) Used software:
- OS: Debian 8.5, firewall enabled, ports 80 and 443 open to the public on INPUT chain, OUTPUT rule ACCEPT
- Pound
- Varnish 4.1.3 from Varnish repository
- Nginx (Gzip compression turned off, all timeout settings are default)
- UWSGI
- Score (application)
- Postgresql
- Elasticsearch

If you need more information please tell me what you need!

Thanks a lot for your help - we are desperate in finding the misconfiguration …

Michael Dosser

-- 
strg.at gmbh  michael.dosser at strg.at
    gumpendorferstrasse 132, top 9, 1060 wien
    tel +43 (1) 526 56 29  mobile +43 699 1 7777 164