[Varnish] #1823: vcl in discarded mode does not clear up

Mon Dec 7 03:07:50 CET 2015

#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
 Reporter:  hamed.gholamian  |       Owner:
     Type:  defect           |      Status:  new
 Priority:  normal           |   Milestone:
Component:  varnishd         |     Version:  4.1.0
 Severity:  normal           |  Resolution:
 Keywords:                   |
-----------------------------+--------------------

Comment (by lochii):

 I've been staring at the code for a while and I have a theory about what
 is happening.

 in the test case, we send 1000 requests to the acceptor, and, though the
 backend is unreachable, we end up with a busy object and thus a waitlist.

 Debugging,  I count over 1000 requests being added to the waitinglist, but
 only 300 being removed before activity ceases (and the acceptor TCP
 sockets time out), leaving a high refcount on the objhead.

 The 300 actually comes from 100 x 3 , as 3 is the default rush exponent
 (so for each hsh_rush(), 3 times as many list items are removed as the
 function is called), case in point, if we increase the rush exponent
 parameter on the command line, more waitlist items are removed for each
 hsh_rush() called and the situation improves.

 now, why is hsh_rush() only called 100 times?

 debugging suggests:

 - 25 of these attempts come from the backend fetch error code path,
 vbf_stp_error() -> HSH_Unbusy() -> hsh_rush()
 - 25 of these come from  cnt_deliver() -> HSH_DerefObjCore() ->
 hsh_rush()
 - 25 of these come from exp_expire()  -> HSH_DerefObjCore() ->  hsh_rush()
 - 25 of these come from VBO_DerefBusyObj() - HSH_DerefObjCore() ->
 hsh_rush()

  it seems to be that these come from in-flight requests, before the
 waitlisting started,
 after the waitlisting starts, all other requests seem to go onto the list
 to wait for the above
 to come and remove them (but this doesn't happen, as we never reach the
 required number of hsh_rush() calls)

 I did hope for some epoll magic to come and save the day with the expiring
 acceptor sockets, but alas, the sockfds
 are only checked by a straighforward poll() , from VTCP_check_hup() , and
 this only happens after the request is back off the waitlist!
 (in the HTTP1 FSM)

 so, I noticed that hsh_rush() has a ditching mechanism, if it tries to
 reschedule the req using SES_Reschedule_Req() and there
 are no available workers to do this on to, the entire waitlist (and req,
 and session) are all ditched.

 It seems that if we open up this ditching mechanism to the HSH_Unbusy()
 that is called from vbf_stp_error() , then
 when the backend is (or becomes) dysfunctional, we can stop queuing
 requests (as we may not be able to service them any time soon).

 Though I'm not sure if this is a good thing or not, I wrote a simple patch
 to demonstrate this (see attached) which does appear to work
 and, in the test case above, resolves the stuck waitlist (and refcnts)
 issue.

-- 
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:11>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator