[Varnish] #1823: vcl in discarded mode does not clear up
Varnish
varnish-bugs at varnish-cache.org
Mon Dec 7 03:07:50 CET 2015
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------
Comment (by lochii):
I've been staring at the code for a while and I have a theory about what
is happening.
in the test case, we send 1000 requests to the acceptor, and, though the
backend is unreachable, we end up with a busy object and thus a waitlist.
Debugging, I count over 1000 requests being added to the waitinglist, but
only 300 being removed before activity ceases (and the acceptor TCP
sockets time out), leaving a high refcount on the objhead.
The 300 actually comes from 100 x 3 , as 3 is the default rush exponent
(so for each hsh_rush(), 3 times as many list items are removed as the
function is called), case in point, if we increase the rush exponent
parameter on the command line, more waitlist items are removed for each
hsh_rush() called and the situation improves.
now, why is hsh_rush() only called 100 times?
debugging suggests:
- 25 of these attempts come from the backend fetch error code path,
vbf_stp_error() -> HSH_Unbusy() -> hsh_rush()
- 25 of these come from cnt_deliver() -> HSH_DerefObjCore() ->
hsh_rush()
- 25 of these come from exp_expire() -> HSH_DerefObjCore() -> hsh_rush()
- 25 of these come from VBO_DerefBusyObj() - HSH_DerefObjCore() ->
hsh_rush()
it seems to be that these come from in-flight requests, before the
waitlisting started,
after the waitlisting starts, all other requests seem to go onto the list
to wait for the above
to come and remove them (but this doesn't happen, as we never reach the
required number of hsh_rush() calls)
I did hope for some epoll magic to come and save the day with the expiring
acceptor sockets, but alas, the sockfds
are only checked by a straighforward poll() , from VTCP_check_hup() , and
this only happens after the request is back off the waitlist!
(in the HTTP1 FSM)
so, I noticed that hsh_rush() has a ditching mechanism, if it tries to
reschedule the req using SES_Reschedule_Req() and there
are no available workers to do this on to, the entire waitlist (and req,
and session) are all ditched.
It seems that if we open up this ditching mechanism to the HSH_Unbusy()
that is called from vbf_stp_error() , then
when the backend is (or becomes) dysfunctional, we can stop queuing
requests (as we may not be able to service them any time soon).
Though I'm not sure if this is a good thing or not, I wrote a simple patch
to demonstrate this (see attached) which does appear to work
and, in the test case above, resolves the stuck waitlist (and refcnts)
issue.
--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:11>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator
More information about the varnish-bugs
mailing list