Thread creation rate problems in Varnish (3.0 and trunk)

Fri Jun 22 14:23:57 CEST 2012

Kristian and I have been looking into the algorithms governing thread
creation in Varnish a bit lately, in order to explain some situations where
Varnish is starved on number of threads but new threads are not created as
fast as expected. The type of workload where problems have been seen is
typically workloads where the connections are very long lived. As
connections come in and when the idle threads has been exhausted, the
throughput is starved on new threads not being created quickly enough. In
tailored test setups we have seen delays in excess of 5 seconds before a
new thread is created, while the queue length is increasing and we are far
from max threads.

>From looking at the algorithms we have discovered some problems. Posting
them to the list to open it up for discussion around how it should be
resolved.

1. There is no guarantee that a signal on the herder_cond will actually
cause a thread to be created. As the herder runs without any locking, any
signal sent while it is busy (e.g. creating a thread) is lost, and it will
go back to sleep without ever dealing with the second signal. The busy time
includes the thread_pool_add_delay wait time, increasing the likelihood of
this happening. This applies both to 3.0 and trunk, although the problem is
somewhat less visible in trunk because the herder_cond sleep time is
limited by thread_pool_purge_delay (from the merge of the add and remove
thread herders).

2. There is a mechanism in the thread breeder that looks at the queue
length of the last run, and will only create a thread if the queue length
has increased in the mean time. I guess the thought behind this is to stop
creating threads when we are in a positive trend and the queue length is
shortening. This mechanism does not work very well when the connections are
long lived and already executing threads are likely to continue to be busy
serving the same connection, as it limits the number of threads created to
always be one regardless of queue length. Possibly the test should be a
larger_than_or_equal, or this limiting factor should go away. This problem
exists both in 3.0 and trunk.

3. The thread herder is not locking the data structures when
determining whether a thread should be created or not, opening memory races
on the queue lengths and possibly choosing not to add a thread even when
there is a queue.

Specific to trunk and the new acceptor code, these problems open up a
situation where the acceptor task is queued for a later thread, and the
breeder is signaled but that signal is lost and the acceptor is left not
running. Thus no new connections are accepted and no new tasks are queued,
so the breeder isn't signaled and no threads are created (queue length is
not increasing). This scenario probably accounts for the longest periods of
no threads created during testing.

Regards,
Martin Blix Grydeland

-- 
Martin Blix Grydeland
Varnish Software AS
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-dev/attachments/20120622/0edd22e4/attachment.html>