ban lurker causing deadlock in varnish 2.1.3

Fri Oct 15 03:48:23 CEST 2010

I'll try to get a ticket filed for this tomorrow, but I also had some  
related questions, so here's the short version:

We have three CentOS servers running varnish 2.1.3, fronting around  
850 sites.  About once a week, varnish will hang on one of those  
servers.  We finally managed to get a backtrace, and it seems to be  
pointing the finger at the ban lurker.  As far as I can tell, if the  
ban lurker happens to start processing an object at the same time that  
a request is looking up that object from the cache, the two can get  
stuck trying to lock ban_mtx and oh->mtx.

The backtrace shows the ban lurker thread, which would already have  
ban_mtx locked at this point:

#0  0x000000390a80d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000000390a808e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x000000390a808cdc in pthread_mutex_lock () from /lib64/ 
libpthread.so.0
#3  0x0000000000421a69 in Lck__Lock ()
#4  0x000000000041afea in HSH_FindBan ()
#5  0x0000000000410b43 in ban_lurker ()
#6  0x0000000000424429 in wrk_bgthread ()
#7  0x000000390a80673d in start_thread () from /lib64/libpthread.so.0
#8  0x000000390a0d3d1d in clone () from /lib64/libc.so.6

and a large number of other threads, which have locked their  
respective oh->mtxs and are trying to lock ban_mtx:

#0  0x000000390a80d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000000390a808e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x000000390a808cdc in pthread_mutex_lock () from /lib64/ 
libpthread.so.0
#3  0x0000000000421a69 in Lck__Lock ()
#4  0x000000000040f926 in ban_check_object ()
#5  0x000000000041c42f in HSH_Lookup ()
#6  0x0000000000411810 in cnt_lookup ()
#7  0x0000000000413ce4 in CNT_Session ()
#8  0x0000000000424668 in wrk_do_cnt_sess ()
#9  0x000000000042396e in wrk_thread_real ()
#10 0x000000390a80673d in start_thread () from /lib64/libpthread.so.0
#11 0x000000390a0d3d1d in clone () from /lib64/libc.so.6

Now for the questions.  First, aside from the above information and  
the full backtrace, is there anything else that would be helpful to  
include in the ticket?  Getting a full core dump would be problematic,  
as we're using "-s malloc,45G", and even without that varnish has a  
nasty habit of dying if we even think about using gdb or strace on  
it.  Since it might be relevant, ban_lurker_sleep is set to 0.0005.   
I'm sure this increases the odds of a deadlock occurring as compared  
to setting it to 0.1 or 0.01, but it also helps keep our ban list  
fairly short.

Second, if we were to turn off the ban lurker, or even just slow it  
down, how large can we allow the ban list to get before we might see  
an impact on performance?  Each machine has 8 quad-core 2GHz CPUs, so  
I assume the answer is "quite large", but one of the servers had  
almost 2400 bans added over an 8-hour period today, and it's nice to  
have the lurker keeping the active list short.  Am I worrying for no  
good reason?

Has anybody else had similar problems?

Ryan