The stevedore API

Martin Blix Grydeland martin at varnish-software.com
Fri Nov 14 17:15:32 CET 2014


Hi,

This stevedore API looks great, and I believe it would work very well
from the existing Varnish stevedore's point of view. But I have
difficulties seeing this work well from a design that relies on double
buffers. There might be details that I'm overlooking / not
understanding fully at this time, so I'll try to outline my concerns.

As I'm reading the API, Varnish will allow clients to stream also from
the interim storage segments. In a double-buffered scenario, this
means that the buffers can't be reused until all of the clients have
finished with them. This is possible to do using refcounting in the
ref()/rel() functions, but with a sufficiently popular object there
will be slow clients pinning the buffers effectively doubling the
memory pressure of a fetch.

In my opinion the API needs to allow for the stevedore to be able to
do double buffering using a single buffer that is being reused for
each call to alloc(). This means that no streaming client should be
allowed to touch any data until after the commit() call has been
made, which should remove the need for the interrim seg_id's.

During the IRC discussions on this, it was argued that it would be
possible to have a buffer size returned from alloc() that is smaller
than the size of the segment it buffers for. The real segment will
then be appended to on each commit(), and the seg_id returned from
commit() will then just be repeated for each append.

I find this approach lacking. First, there will be no opportunity to
coalesche the seg_id's. This can be handled in the stevedore during
segment lookups (the ref() API call), by returning 0 lengths on all
but the first ref() for the same seg_id by each client. This ofc then
requires per client state in the stevedore. Also the seg_id list
becomes unnecessarily big wasting space.

Secondly, since the commit() function is supposed to be the point
where trim() functionality happens, I fail to see how the code should
be able to distinguish a commit() as an append from a commit() where a
trim is needed.


Based on this, I have an slightly modified suggestion for the API:


New objects are created with stv->newobj():

* Arguments:
  * busyobj
  * total size estimate
* Returns:
  * yes/no

busyobj/objcore has private field(s) for the stevedores private use
from here on.

The stevedore can fail this if the object is undesirable, and another
stevedore will be attempted, if nothing else, Transient.


Storage segments are allocated with stv->alloc():

* Arguments:
  * busyobj
  * desired number of bytes
  * (uintptr_t) seg_id or 0.
* Returns:
  * (void *) priv
  * data pointer
  * length

The calling code can supply a previous seg_id that it would be OK to
expand if the stevedore is able to (e.g. there is more room in the
segment). This would typically be the previous seg_id received from
commit(). Each call to alloc() needs to be followed by a commit(),
passing the priv received. alloc() can fail, and indicates failure by
returning a NULL data pointer.


Once the segment has been written to: stv->commit()

* Arguments:
  * busyobj
  * (void *) priv
  * length used
* Returns:
  * (uintptr_t) seg_id

This commits the storage and returns a seg_id(). This could be a new
seg_id, or the same as the one passed to alloc() in which case it
should not be added to the seg_id() list forming the object. commit()
can fail, and indicates failure by returning a 0 seg_id.


Trimming a segment: stv->trim()

* Arguments:
  * busyobj
  * (uintptr_t) seg_id
* Returns:
  * (uintptr_t) seg_id

The calling code typically calls this at the end of a fetch
operation. It can return a seg_id that is to replace the one passed,
which the calling code should free() when it can ensure there are no
readers (e.g. busyobj destruction). A trim that doesn't change the
semantics (e.g. -sfile) will return 0 meaning no replace is necessary.


Object is finalized with stv->final():

* Arguments:
  * busyobj (not NULL)
  * seg_id (key to resurrect persistent object)
  * off_t (subkey to resurrect persistent object)
* Returns:
  none (failure point?)


Objects are killd with stv->deleteobj():

* Arguments:
  * objcore


All allocated segments are individually freed with stv->free():

* Arguments:
  * (uintptr_t) seg_id


Regards,
Martin Blix Grydeland

On 13 November 2014 10:02, Poul-Henning Kamp <phk at phk.freebsd.dk> wrote:

> I've been mucking about with the stevedore api for some days in order
> to resolve a number of silly issues, including the vast overallocation
> for tiny objects when streaming.
>
> There are suprisingly many dead ends in this area.
>
> I think I have finally managed to come up with something that is usable,
> both from stevedore and varnishd side.
>
> The crucial insight is that we do not need to store the OA first
> since we now have accessor functions.  Instead we can collect the
> OAs in the busyobj until we know their final size, and only then
> allocate space for them.
>
> But we still need to alert persistent stevedores before we make the
> first allocation for a new object, and we also (which is new!) want
> to give stevedores a chance to relocate segments once their size is
> known for sure.   In that case it's the stevedores responsibility
> to keep track of streamers of the interrim segment before freeing it.
>
> Persistent stevedores also must be told what "magic key" it must
> present to resurrect an object.
>
> And finally we make it the cache_obj.c codes responsibility to free
> all allocated segments individually, so that the stevedore does not
> need to keep track of that.
>
> Seen from the stevedore:
>
>     New objects are created with stv->newobj().
>
>         Arguments:
>                 busyobj
>                 total size estimate
>         Returns:
>                 yes/no
>
>         busyobj/objcore has private field(s) for the stevedores
>         private use from here on.
>
>         The stevedore can fail this if the object is undesirable,
>         and another stevedore will be attempted, if nothing else,
> Transient.
>
>     Storage segments are allocated with stv->alloc()
>         Arguments:
>                 busyobj
>                 desired number of bytes
>         Returns:
>                 interrim seg_id (uintptr_t)
>                 data pointer
>                 length
>
>     Once the segment has been written to:  stv->commit()
>         Arguments:
>                 busyobj
>                 interrim seg_id
>                 length used
>         Returns:
>                 final seg_id, may be different from interim seg_id
>
>     Other threads may reference segments with stv->ref()
>         Arguments:
>                 busyobj (possibly NULL)
>                 seg_id (interrim or final)
>         Returns:
>                 data pointer
>                 length
>
>     References are released with stv->rel()
>         Arguments:
>                 busyobj (possibly NULL)
>                 seg_id (interrim or final)
>         Returns:
>                 none
>
>     Object is finalized with stv->final()
>         Arguments:
>                 busyobj (not NULL)
>                 seg_id (key to resurrect persistent object)
>                 off_t  (subkey to resurrect persistent object)
>         Returns:
>                 none
>
>     Objects are killed with stv->deleteobj()
>         Arguments:
>                 objcore
>
>     All allocated segments are individually freed with stv->free()
>         Arguments:
>                 seg_id
>
> Seen from varnishd:
>
>         stv->newobj()
>         collect OA's in busyobj and access from there
>         stv->alloc/commit until body is received
>         all OA's defined.
>         byte-stream encode OAs into last storage segment
>         (including OA listing storage segment id's)
>         (if space allows) else into separate segment
>         stv->final(coords for OA bytestream)
>         OA's now pulled out of stored object
>
> Comments most welcome...
>
> --
> Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
> phk at FreeBSD.ORG         | TCP/IP since RFC 956
> FreeBSD committer       | BSD since 4.3-tahoe
> Never attribute to malice what can adequately be explained by incompetence.
>
> _______________________________________________
> varnish-dev mailing list
> varnish-dev at varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev
>



-- 
<http://varnish-software.com>*Martin Blix Grydeland*
Senior Developer | Varnish Software AS
Mobile: +47 992 74 756
We Make Websites Fly!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-dev/attachments/20141114/25de0833/attachment.html>


More information about the varnish-dev mailing list