[Varnish] #610: rushing too often may overflow session workspace

Wed Jan 13 09:17:29 CET 2010

#610: rushing too often may overflow session workspace
----------------------+-----------------------------------------------------
 Reporter:  slink     |        Owner:  phk
     Type:  defect    |       Status:  new
 Priority:  high      |    Milestone:     
Component:  varnishd  |      Version:  2.0
 Severity:  major     |   Resolution:     
 Keywords:            |  
----------------------+-----------------------------------------------------
Comment (by slink):

 Hi Paul,

 thanks for looking into this. Maybe I should clarify that this bug is
 really about two different issues

  1. HSH_rush being called too often (when the obj being waited for is not
 necessarily ready)
  2. Exhaustion of the session workspace when cnt_lookup->HSH_Prepare is
 beging called to often.

 2) is a consequence of 1).

 But I agree that changing if (sp->obj == NULL) into if (sp->objhead ==
 NULL) at the top of cnt_lookup would also make sense: The only way
 sp->objhead ever gets set is when we wait for a busy object, and IIUC
 waiting for a busy object is the only case when we reenter cnt_lookup.

 (We probably could also check for sp->hashptr as that is being set in
 HSH_Prepare)

 I will test this suggestion, but this will probably take some more
 time.

 Having said that, I still think that my suggested change (which
 targets issue 1) is an important fix/improvement even if it did not
 have the consequence of workspace exhaustion:

  * Waking up waiting sessions unnecessarily may lead to extremely high
   peak load

   In particular, HSH_Drop calls HSH_Deref (which would rush), so
   whenever we decide or are being forced to give up in cnt_fetch, we
   rush all the other waiters (which will end up waiting again).

   On the production system I am working on, we've seen cases where,
   with a slow backend, the load on varnish servers would suddenly
   raise to the thousands and this scenario would explain the effect.

  * I would like to have better control over restarts

   Another change which I've made for production use (and which I still
   need to document) is to increment sp->restarts in hsh_rush,
   following the idea that a session which has waited for a busy object
   has effectively been (internally) restarted.

   At any rate, it has waited (probably quite a while) for the busy
   object, so we might want to chose different parameters for the
   second fetch (like increasing the grace time).

   In order to get an exact figure for the restarts (with this new
   semantics), I need to make sure that hsh_rush is only called when a
   busy object becomes available.

 Maybe it is of interest that my suggested changes are running in
 production without any issues since january 4th.

 To conclude, I think changing the test in cnt_lookup makes sense, but
 I also think we still need the changes which I suggested.

 I must say that I still haven't looked at the trunk because, at this
 point, I need to focus on improving the stability of production
 versions.

 Nils

-- 
Ticket URL: <http://varnish.projects.linpro.no/ticket/610#comment:2>
Varnish <http://varnish.projects.linpro.no/>
The Varnish HTTP Accelerator