[Varnish] #1823: vcl in discarded mode does not clear up

Varnish varnish-bugs at varnish-cache.org
Mon Dec 7 16:14:52 CET 2015


#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
 Reporter:  hamed.gholamian  |       Owner:
     Type:  defect           |      Status:  new
 Priority:  normal           |   Milestone:
Component:  varnishd         |     Version:  4.1.0
 Severity:  normal           |  Resolution:
 Keywords:                   |
-----------------------------+--------------------

Comment (by lochii):

 {{{
 [12:37] <ruben_varnish:#varnish-hacking> right
 [12:37] <martin:#varnish-hacking> phk: to me the cause of #1823 isn't
 mapped yet. I do not believe it is the SES_Reschedule_Req() that fails and
 stops the propagation
 [12:37] -beb:#varnish-hacking- #1823: vcl in discarded mode does not clear
 up [new] https://www.varnish-cache.org/trac/ticket/1823
 [12:40] <martin:#varnish-hacking> and the single DerefObjcore on any
 objcore of an objhead should be sufficient to cause the entire list to be
 rescheduled
 [12:40] <phk:#varnish-hacking> martin, not if we have a "leak" somewhere
 that stops the chain.
 [12:41] <martin:#varnish-hacking> phk: exactly, so i still believe there
 must be a leak somewhere that we haven't spotted
 [12:41] <martin:#varnish-hacking> the patch as it is would prob just mask
 it
 [12:41] <phk:#varnish-hacking> martin, agree.  But now we know that to be
 the case.
 [12:42] <martin:#varnish-hacking> yes, so that is definitive progress
 [12:42] <phk:#varnish-hacking> agreed too, I don't think that's the proper
 solution.  It may still be a good idea to speed up draining the waiting
 list though, but it's not a proper fix for the bug.
 [12:43] <martin:#varnish-hacking> my huch of friday that I was following
 was the single OC ref held by the busyobj, and somehow the initiating
 client worker and the fetch worker ref of the busyobj
                                   got screwed up
 [12:44] <martin:#varnish-hacking> causing that ref to sit idle
 [12:44] <phk:#varnish-hacking> martin, Well, that would be quick to test
 if we have a VTC.
 [12:45] <martin:#varnish-hacking> yeah
 [12:45] <phk:#varnish-hacking> I think we should focus on writing the
 VTC...
 [12:45] <lochii:#varnish-hacking> martin: re: 'single DerefObjcore' , how
 should that work?
 [12:45] <lochii:#varnish-hacking> ultimately, it just calls a normal
 hsh_rush()
 [12:45] <lochii:#varnish-hacking> it isn't doing anything special
 [12:45] <martin:#varnish-hacking> which in turn would wake up some thread
 somewhere
 [12:45] <lochii:#varnish-hacking> yes, it does
 [12:45] <lochii:#varnish-hacking> which calls another hsh_rush()
 [12:45] <martin:#varnish-hacking> which would have a ref of some objcore
 on the same objhead
 [12:46] <lochii:#varnish-hacking> but not enough times
 [12:46] <lochii:#varnish-hacking> if the wl is big enough
 [12:46] <phk:#varnish-hacking> lochii, the point is that "big enough"
 shouldn't matter.
 [12:46] <martin:#varnish-hacking> which would in turn cause hsh_rush to be
 called on the WL of the objhead when it's dereferenced
 [12:46] Notification successfully posted.
 [12:46] <martin:#varnish-hacking> so the chain is propagated
 [12:47] <lochii:#varnish-hacking> right, but the hsh_rush only removes
 $rush_exponent items
 [12:47] <lochii:#varnish-hacking> from the wl
 [12:47] <lochii:#varnish-hacking> in this case, 3
 [12:47] <martin:#varnish-hacking> true, but for each of those
 $rush_exponent more should be woken
 [12:48] <phk:#varnish-hacking> lochii, but each of those should also
 remove 3
 [12:48] Notification successfully posted.
 [12:48] <phk:#varnish-hacking> lochii, it's supposed to be a cascade, and
 somewhere it isn't
 [12:48] Notification successfully posted.
 [12:48] <lochii:#varnish-hacking> hsh_rush() does
 wrk->stats->busy_wakeup++;
 [12:48] <lochii:#varnish-hacking> but that's about it
 [12:50] <lochii:#varnish-hacking> how does (or should) this wakeup work?
 [12:50] <martin:#varnish-hacking> lochii: it calls SES_Reschedule_Req()
 [12:50] Notification successfully posted.
 [12:50] <lochii:#varnish-hacking> yes
 [12:50] <martin:#varnish-hacking> that causes a req to be processed that
 is already waiting on this objhead
 [12:51] <lochii:#varnish-hacking> ok, so that puts a SES_Proto_Req on the
 task list
 [12:51] <martin:#varnish-hacking> so it must get an objcore ref on the
 same objhead
 [12:51] <martin:#varnish-hacking> which when deref'ed would call hsh_rush
 again
 12:51] <martin:#varnish-hacking> so if those statements are all true, the
 list would not stall
 [12:52] <lochii:#varnish-hacking> I only ever see 25 SES_Reschedule_Req()s
 being called
 [12:52] <lochii:#varnish-hacking> for 1000 requests
 [12:52] <martin:#varnish-hacking> yeah, and that is wrong
 [12:52] <lochii:#varnish-hacking> sorry, 300
 [12:52] <phk:#varnish-hacking> are there any failures to get worker
 threads ?
 [12:53] <phk:#varnish-hacking> that might break the chain
 [12:53] <martin:#varnish-hacking> the stats are negative on that
 [12:53] <lochii:#varnish-hacking> well, if the reschedule fails, it is
 supposed to ditch the wl
 [12:54] <lochii:#varnish-hacking> since you cascade the return of
 pool_task
 [12:57] <lochii:#varnish-hacking> when it calls SES_Proto_Req ->
 HTTP1_Session , it ends up in S_STP_H1BUSY
 [12:58] <lochii:#varnish-hacking> and this does in turn HSH_DerefObjHead()
 and Req_Cleanup()
 [12:58] <martin:#varnish-hacking> ah
 [12:58] <martin:#varnish-hacking> there you have it
 [12:58] <martin:#varnish-hacking> hsh_derefobjhead wouldn't call hsh_ursh
 [12:59] <lochii:#varnish-hacking> yes
 [12:59] <lochii:#varnish-hacking> it doesn't
 [12:59] <lochii:#varnish-hacking> I dont know why
 [12:59] <lochii:#varnish-hacking> but I've just been guessing by staring
 at this code for so long, as to how it is supposed to work
 [13:00] <martin:#varnish-hacking> right - so that's the broken link
 [13:00] <lochii:#varnish-hacking> ok
 [13:00] <martin:#varnish-hacking> the VTCP_check_hup notices the client
 has gone away while on the waitinglist
 [13:00] <lochii:#varnish-hacking> yes, I saw that
 [13:00] <martin:#varnish-hacking> which causes a quick escape of the
 handling
 [13:00] <martin:#varnish-hacking> which causes a quick escape of the
 handling
 [13:00] <lochii:#varnish-hacking> wondering why when epoll waiter is
 enabled that we don't add epoll monitoring for these sockets
 [13:03] <lochii:#varnish-hacking> ok, testing now with hsh_rush() in the
 HSH_DerefObjHead() path
 [13:06] <phk:#varnish-hacking> lochii, the waiting list predates the
 general waiter
 [13:06] Notification successfully posted.
 [13:06] <phk:#varnish-hacking> lochii, it's also not obvious that it would
 be cheaper.
 [13:06] Notification successfully posted.
 [13:07] [clara] [nan0r(~ronank at 195.8.70.15)] not coming to the team
 meeting ?
 [13:07] Notification successfully posted.
 [13:09] [clara] [msg(nan0r)] incident with l3, be there shortly
 [13:10] <lochii:#varnish-hacking> ok, seems to be working , albeit slowly
 [13:11] <lochii:#varnish-hacking> will keep an eye on it
 [13:11] <phk:#varnish-hacking> lochii, slowly in what way ?
 [13:11] Notification successfully posted.
 [13:11] <lochii:#varnish-hacking> as in, the refcnt is draining
 [13:11] <lochii:#varnish-hacking> as the cascade is happening
 [13:11] <lochii:#varnish-hacking> and threads are being woken up
 [13:11] <lochii:#varnish-hacking> just happening really slowly
 [13:12] <phk:#varnish-hacking> lochii so the cascade is not widening ?
 [13:12] Notification successfully posted.
 [13:12] <lochii:#varnish-hacking> no
 [13:12] <lochii:#varnish-hacking> ok, its stuck again
 [13:13] <lochii:#varnish-hacking> ah
 [13:13] <lochii:#varnish-hacking> objcore refcnt is now zero
 [13:13] <lochii:#varnish-hacking> so we are no longer deref objcore
 [13:14] <lochii:#varnish-hacking> which means we stopped deref objhead
 [13:14] <lochii:#varnish-hacking> but I"ve still got the oh with refcnt of
 800+
 [13:14] <lochii:#varnish-hacking> so its stalled here now
 [13:16] <phk:#varnish-hacking> that sounds strange...
 [14:50] <lochii:#varnish-hacking> ok, looks like I've got it working
 [14:51] <lochii:#varnish-hacking> if hsh_rush() is called properly in
 HSH_DerefObjHead() it does indeed work
 [14:51] <lochii:#varnish-hacking> I need to test some stuff with the
 private oh
 [14:52] <lochii:#varnish-hacking> but generally it works
 [14:52] <lochii:#varnish-hacking> so let me re-make the patch
 [14:52] <lochii:#varnish-hacking> (though it would be nice to see ditch
 functionality, I'll keep it out of this if it isn't needed)

 another patch follows.
 }}}

-- 
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:12>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator



More information about the varnish-bugs mailing list