[master] 36cad5428 cache_obj: Add an asynchronous iteration API

Fri Jul 4 17:04:03 UTC 2025

commit 36cad542831457edf2fb939a886ad0142ef88b7b
Author: Nils Goroll <nils.goroll at uplex.de>
Date:   Mon Jan 6 22:02:26 2025 +0100

    cache_obj: Add an asynchronous iteration API
    
    This commit adds a new object iteration API to support asynchronous IO.
    
    Background
    
    To process object bodies, the Object API so far provides ObjIterate(), which
    calls a storage specific iterator function. It in turn calls a caller-provided
    objiterate_f function on individual, contigious segments of data (extents).
    
    In turn, objiterate_f gets called with either no flags, or one of OBJ_ITER_FLUSH
    and OBJ_ITER_END. The storage iterator uses these flags to signal lifetime of
    the provided entents: They remain valid until a flag is present, so the caller
    may delay use until an extent is provided with a flag sent. Or, seen from the
    other end, objiterate_f needs to ensure it does not use any previously received
    extent when a flag is seen.
    
    objiterate_f can not make any assumption as to if or when it is going to be
    called, if the storage iterator function needs time to retrieve data or a
    streaming fetch is in progress, then so be it, objiterate_f may eventually get
    called or not.
    
    Or, again seen from the other end, the storage iterator function assumes being
    called from a thread and may block at any time.
    
    Why this change?
    
    The model described above is fundamentally incompatible with asynchronous, event
    driven IO models, where a single thread might serve multiple requests in
    parallel to benefit from efficiency gains and thus no called function must ever
    block.
    
    This additional API is intended to provide an interface suitable for such
    asynchronous models. As before, also the asynchronous iterator is owned by a
    storage specific implementation, but now, instead of using a thread for its
    state, that state exists in a data structure opaque to the caller.
    
    Batching with scatter arrays (VSCARAB)
    
    As recapitulated above, the existing objiterate_f works on one buffer at a time,
    yet even before asynchronous I/O, issuing one system call for each buffer would
    be inefficient. So, for the case of HTTP/1, the V1L layer collects buffers into
    an array of io vectors (struct iovec), which are handed over to the kernel using
    writev(). These arrays of io vectors seem to have no established name even after
    decades of existence, elsewhere they are called siov or array, so in this API,
    we are going to call them scatter arrays.
    
    With the new API, we use scatter arrays for all the processing steps: The goal
    is that storage fills a scatter array, which then gets processed and maybe
    replaced by filters, until finally some transport hands many I/Os at once to the
    kernel.
    
    Established interfaces follow the signature of writev(), they have a pointer to
    an array of struct iovec and a count (struct iovec *iov, int iovcnt).
    
    Yet for our plans, we want to have something which can be passed around in a
    single unit, to ensure that the array is always used with the right count,
    something which can vary in size and live on the heap or the stack.
    
    This is the VSCARAB (struct vscarab), the Varnish SCatter ARAy of Buffers,
    basically a container struct with a flexible array member (fam). The VSCARAB has
    a capacity, a used count, and is annotated with v_counted_by_() such that, when
    support for bounds checking is improved by compilers, we get additional sanity
    checks (and possibly optimizations).
    
    The flags member of struct vscarab has one used bit so far, VSCARAB_F_END, which
    is to signal "EOF", like VDP_END. It should be set together with the last bits
    of data, but can also be set later.
    
    We add macros to work on VSCARABs for (bounds) allocation (on the stack and
    heap), initialization, checking (magic and limits), iterating, and adding
    elements.
    
    VSCARET and VFLA
    
    Managing scatter arrays is one side of the coin, when we are done using buffers,
    we need to return them to storage, such that storage can do LRU things or reuse
    memory. As before, we want to batch these operations for efficiency.
    
    As an easy to use, flexible data structure, we add VSCARABs sibing VSCARET. And,
    because both are basically the same, we generalize macros as VFLA, Varnish
    Flexible Arrays.
    
    API Usage
    
    The basic model for the API is that the storage engine "leases" to the caller a
    number of extents, which the caller is then free to use until it returns the
    leases to the storage engine.
    
    The storage engine can also signal to the caller that it can not return more
    extents unless some are returned or that it simply can not return any at this
    time for other reasons (for example, because it is waiting for data on a
    streaming fetch). In both cases, the storage engine promises to call the
    caller's notification function when it is ready to provide more extents or
    iteration has ended.
    
    The API consists of four functions:
    
    - ObjVAIinit() requests an asynchronous iteration on an object. The caller
      provides an optional workspace for the storage engine to use for its state,
      and the notification callback / private pointer introduced with the previous
      commit. Its use is explained below.
    
      ObjVAIinit() returns either an opaque handle owned jointly by the Object layer
      in Varnish-Cache and the storage engine, or NULL if the storage engine does
      not provide asynchronous iteration.
    
    All other API functions work on the handle returned by ObjVAIinit():
    
    - ObjVAIlease() returns the next extents from the object body in a
      caller-prodived VSCARAB. Each extent is a struct viov, which contains a struct
      iovec (see iovec(3type) / readv(2)) with the actual extent, and an integer
      identifying the lease. For the VSCARAB containing the last extent and/or any
      later call (for which the return value is 0), VSCARAB_F_END is set in flags.
      The "lease" integer (uint64_t) of each viov is opaque to the caller and needs
      to be returned as-is later, but is guaranteed by storage to be a multiple of
      8. This can be used by the caller to temporarily stash a tiny amount of
      additional state into the lower bits of the lease.
    
      ObjVAIlease either returns a positive integer with a number of available
      leases, zero if no more leases can be provided, or a negative integer for
      "call again later" and error conditions:
    
      -EAGAIN signals that no more data is available at this point, and the storage
      engine will call the notification function when the condition changes.
    
      -ENOBUFS behaves identically, but also requires the caller to return more
      leases.
    
      -EPIPE mirrors BOS_FAILED on the busy object.
    
      Any other -(errno) can be used by the storage engine to signal other error
      conditions.
    
      To summarize, the return value is either negative for errors or returns the
      number of extents _added_ to the VSCARAB.
    
      To determine eof, callers must only check the flags member of the VSCARAB for
      VSCARAB_F_END.
    
    - ObjVAIreturn() returns a VSCARET of leases to the storage when the caller is
      done with them
    
      For efficiency, leases of extents which are no longer in use should be
      collected in a VSCARET and returned using ObjVAIreturn() before any blocking
      condition. They must be returned when ObjVAIlease() requests so by returning
      -ENOBUFS and, naturally, when iteration over the object body ends.
    
    - ObjVAIfini() finalizes iteration. The handle must not be used thereafter.
    
    Implementation
    
    One particular aspect of the implementation is that the storage engine returns
    the "lease", "return" and "fini" functions to be used with the handle. This
    allows the storage engine to provide functions tailored to the attributes of the
    storage object, for example streaming fetches require more elaborate handling
    than settled storage objects.
    
    Consequently, the vai_hdl which is, by design, opaque to the caller, is not
    entirely opaque to the object layer: The implementation requires it to start
    with a struct vai_hdl_preamble containing the function pointers to be called by
    ObjVAIlease(), ObjVAIreturn() and ObjVAIfini().
    
    More details about the implementation will become clear with the next commit,
    which implements SML's synchronous iterator using the new API.

diff --git a/bin/varnishd/cache/cache.h b/bin/varnishd/cache/cache.h
index e92959f43..118f65a34 100644
--- a/bin/varnishd/cache/cache.h
+++ b/bin/varnishd/cache/cache.h
@@ -42,6 +42,7 @@
 #include <pthread.h>
 #include <stdarg.h>
 #include <sys/types.h>
+#include <sys/uio.h>
 
 #include "vdef.h"
 #include "vrt.h"
@@ -775,6 +776,152 @@ int ObjCheckFlag(struct worker *, struct objcore *, enum obj_flags of);
 typedef void *vai_hdl;
 typedef void vai_notify_cb(vai_hdl, void *priv);
 
+
+/*
+ * VSCARAB: Varnish SCatter ARAy of Buffers:
+ *
+ * an array of viovs, elsewhere also called an siov or sarray
+ */
+struct viov {
+	uint64_t	lease;
+	struct iovec	iov;
+};
+
+struct vscarab {
+	unsigned	magic;
+#define VSCARAB_MAGIC	0x05ca7ab0
+	unsigned	flags;
+#define VSCARAB_F_END	1	// last viov is last overall
+	unsigned	capacity;
+	unsigned	used;
+	struct viov	s[] v_counted_by_(capacity);
+};
+
+// VFLA: starting generic container-with-flexible-array-member macros
+// aka "struct hack"
+//
+// type : struct name
+// name : a pointer to struct type
+// mag  : the magic value for this VFLA
+// cptr : pointer to container struct (aka "head")
+// fam  : member name of the flexible array member
+// cap  : capacity
+//
+// common properties of all VFLAs:
+// - are a miniobj (have magic as the first element)
+// - capacity member is the fam capacity
+// - used member is the number of fam elements used
+//
+// VFLA_SIZE ignores the cap == 0 case, we assert in _INIT
+// offsetoff ref: https://gustedt.wordpress.com/2011/03/14/flexible-array-member/
+//lint -emacro(413, VFLA_SIZE)
+#define VFLA_SIZE(type, fam, cap) (offsetof(struct type, fam) +	\
+	(cap) * sizeof(((struct type *)0)->fam[0]))
+#define VFLA_INIT_(type, cptr, mag, fam, cap, save) do {	\
+	unsigned save = (cap);					\
+	AN(save);						\
+	memset((cptr), 0, VFLA_SIZE(type, fam, save));		\
+	(cptr)->magic = (mag);					\
+	(cptr)->capacity = (save);				\
+} while (0)
+#define VFLA_INIT(type, cptr, mag, fam, cap)			\
+	VFLA_INIT_(type, cptr, mag, fam, cap, VUNIQ_NAME(save))
+// declare, allocate and initialize a local VFLA
+// the additional VLA buf declaration avoids
+// "Variable-sized object may not be initialized"
+#define VFLA_LOCAL_(type, name, mag, fam, cap, bufname)				\
+	char bufname[VFLA_SIZE(type, fam, cap)];				\
+	struct type *name = (void *)bufname;					\
+	VFLA_INIT(type, name, mag, fam, cap)
+#define VFLA_LOCAL(type, name, mag, fam, cap)					\
+	VFLA_LOCAL_(type, name, mag, fam, cap, VUNIQ_NAME(buf))
+// malloc and initialize a VFLA
+#define VFLA_ALLOC(type, name, mag, fam, cap)	do {			\
+	(name) = malloc(VFLA_SIZE(type, fam, cap));			\
+	if ((name) != NULL)						\
+		VFLA_INIT(type, name, mag, fam, cap);			\
+} while(0)
+#define VFLA_FOREACH(var, cptr, fam)						\
+	for (var = &(cptr)->fam[0]; var < &(cptr)->fam[(cptr)->used]; var++)
+// continue iterating after a break of a _FOREACH
+#define VFLA_FOREACH_RESUME(var, cptr, fam)					\
+	for (; var != NULL && var < &(cptr)->fam[(cptr)->used]; var++)
+#define VFLA_GET(cptr, fam) ((cptr)->used < (cptr)->capacity ? &(cptr)->fam[(cptr)->used++] : NULL)
+// asserts sufficient capacity
+#define VFLA_ADD(cptr, fam, val) do {						\
+	assert((cptr)->used < (cptr)->capacity);				\
+	(cptr)->fam[(cptr)->used++] = (val);					\
+} while(0)
+
+#define VSCARAB_SIZE(cap) VFLA_SIZE(vscarab, s, cap)
+#define VSCARAB_INIT(scarab, cap) VFLA_INIT(vscarab, scarab, VSCARAB_MAGIC, s, cap)
+#define VSCARAB_LOCAL(scarab, cap) VFLA_LOCAL(vscarab, scarab, VSCARAB_MAGIC, s, cap)
+#define VSCARAB_ALLOC(scarab, cap) VFLA_ALLOC(vscarab, scarab, VSCARAB_MAGIC, s, cap)
+#define VSCARAB_FOREACH(var, scarab) VFLA_FOREACH(var, scarab, s)
+#define VSCARAB_FOREACH_RESUME(var, scarab) VFLA_FOREACH_RESUME(var, scarab, s)
+#define VSCARAB_GET(scarab) VFLA_GET(scarab, s)
+#define VSCARAB_ADD(scarab, val) VFLA_ADD(scarab, s, val)
+//lint -emacro(64, VSCARAB_ADD_IOV_NORET) weird flexelint bug?
+#define VSCARAB_ADD_IOV_NORET(scarab, vec)					\
+	VSCARAB_ADD(scarab, ((struct viov){.lease = VAI_LEASE_NORET, .iov = (vec)}))
+#define VSCARAB_LAST(scarab) (&(scarab)->s[(scarab)->used - 1])
+
+#define VSCARAB_CHECK(scarab) do {						\
+	CHECK_OBJ(scarab, VSCARAB_MAGIC);					\
+	assert(scarab->used <= scarab->capacity);				\
+} while(0)
+
+#define VSCARAB_CHECK_NOTNULL(scarab) do {					\
+	AN(scarab);								\
+	VSCARAB_CHECK(scarab);							\
+} while(0)
+
+/*
+ * VSCARET: Varnish SCatter Array Return
+ *
+ * an array of leases obtained from a vscarab
+ */
+
+struct vscaret {
+	unsigned	magic;
+#define VSCARET_MAGIC	0x9c1f3d7b
+	unsigned	capacity;
+	unsigned	used;
+	uint64_t	lease[] v_counted_by_(capacity);
+};
+
+#define VSCARET_SIZE(cap) VFLA_SIZE(vscaret, lease, cap)
+#define VSCARET_INIT(scaret, cap) VFLA_INIT(vscaret, scaret, VSCARET_MAGIC, lease, cap)
+#define VSCARET_LOCAL(scaret, cap) VFLA_LOCAL(vscaret, scaret, VSCARET_MAGIC, lease, cap)
+#define VSCARET_ALLOC(scaret, cap) VFLA_ALLOC(vscaret, scaret, VSCARET_MAGIC, lease, cap)
+#define VSCARET_FOREACH(var, scaret) VFLA_FOREACH(var, scaret, lease)
+#define VSCARET_GET(scaret) VFLA_GET(scaret, lease)
+#define VSCARET_ADD(scaret, val) VFLA_ADD(scaret, lease, val)
+
+#define VSCARET_CHECK(scaret) do {						\
+	CHECK_OBJ(scaret, VSCARET_MAGIC);					\
+	assert(scaret->used <= scaret->capacity);				\
+} while(0)
+
+#define VSCARET_CHECK_NOTNULL(scaret) do {					\
+	AN(scaret);								\
+	VSCARET_CHECK(scaret);							\
+} while(0)
+
+/*
+ * VSCARABs can contain leases which are not to be returned to storage, for
+ * example static data or fragments of larger leases to be returned later. For
+ * these cases, use this magic value as the lease. This is deliberately not 0 to
+ * catch oversights.
+ */
+#define VAI_LEASE_NORET ((uint64_t)0x8)
+
+vai_hdl ObjVAIinit(struct worker *, struct objcore *, struct ws *,
+    vai_notify_cb *, void *);
+int ObjVAIlease(struct worker *, vai_hdl, struct vscarab *);
+void ObjVAIreturn(struct worker *, vai_hdl, struct vscaret *);
+void ObjVAIfini(struct worker *, vai_hdl *);
+
 /* cache_req_body.c */
 ssize_t VRB_Iterate(struct worker *, struct vsl_log *, struct req *,
     objiterate_f *func, void *priv);
diff --git a/bin/varnishd/cache/cache_main.c b/bin/varnishd/cache/cache_main.c
index 32a44e3ea..31b94829f 100644
--- a/bin/varnishd/cache/cache_main.c
+++ b/bin/varnishd/cache/cache_main.c
@@ -405,9 +405,55 @@ static struct cli_proto child_cmds[] = {
 	{ NULL }
 };
 
+#define CAP 17U
+
+static void
+t_vscarab1(struct vscarab *scarab)
+{
+	struct viov *v;
+	uint64_t sum;
+
+	VSCARAB_CHECK_NOTNULL(scarab);
+	AZ(scarab->used);
+
+	v = VSCARAB_GET(scarab);
+	AN(v);
+	v->lease = 12;
+
+	VSCARAB_ADD(scarab, (struct viov){.lease = 30});
+
+	sum = 0;
+	VSCARAB_FOREACH(v, scarab)
+		sum += v->lease;
+
+	assert(sum == 42);
+}
+
+static void
+t_vscarab(void)
+{
+	char testbuf[VSCARAB_SIZE(CAP)];
+	struct vscarab *frombuf = (void *)testbuf;
+	VSCARAB_INIT(frombuf, CAP);
+	t_vscarab1(frombuf);
+
+	// ---
+
+	VSCARAB_LOCAL(scarab, CAP);
+	t_vscarab1(scarab);
+
+	// ---
+
+	struct vscarab *heap;
+	VSCARAB_ALLOC(heap, CAP);
+	t_vscarab1(heap);
+	free(heap);
+}
+
 void
 child_main(int sigmagic, size_t altstksz)
 {
+	t_vscarab();
 
 	if (sigmagic)
 		child_sigmagic(altstksz);
diff --git a/bin/varnishd/cache/cache_obj.c b/bin/varnishd/cache/cache_obj.c
index c5f2e54fc..6fe72f448 100644
--- a/bin/varnishd/cache/cache_obj.c
+++ b/bin/varnishd/cache/cache_obj.c
@@ -183,6 +183,101 @@ ObjIterate(struct worker *wrk, struct objcore *oc,
 	return (om->objiterator(wrk, oc, priv, func, final));
 }
 
+/*====================================================================
+ * ObjVAI...(): Asynchronous Iteration
+ *
+ *
+ * ObjVAIinit() returns an opaque handle, or NULL if not supported
+ *
+ *	A VAI handle must not be used concurrently
+ *
+ *	the vai_notify_cb(priv) will be called asynchronously by the storage
+ *	engine when a -EAGAIN / -ENOBUFS condition is over and ObjVAIlease()
+ *	can be called again.
+ *
+ *	Note:
+ *	- the callback gets executed by an arbitrary thread
+ *	- WITH the boc mtx held
+ *	so it should never block and only do minimal work
+ *
+ * ObjVAIlease() fills the vscarab with leases. returns:
+ *
+ *	-EAGAIN:  nothing available at the moment, storage will notify, no use to
+ *		  call again until notification
+ *	-ENOBUFS: caller needs to return leases, storage will notify
+ *	-EPIPE:	  BOS_FAILED for busy object
+ *	-(errno): other problem, fatal
+ *
+ *	>= 0:	  number of viovs added (== scarab->capacity - scarab->used)
+ *
+ *	struct vscarab:
+ *
+ *	the leases can be used by the caller until returned with
+ *	ObjVAIreturn(). The storage guarantees that the lease member is a
+ *	multiple of 8 (that is, the lower three bits are zero). These can be
+ *	used by the caller between lease and return, but must be cleared to
+ *	zero before returning.
+ *
+ * ObjVAIreturn() returns leases collected in a struct vscaret
+ *
+ *	it must be called with a vscaret, which holds an array of lease values from viovs
+ *	received when the caller can guarantee that they are no longer accessed
+ *
+ * ObjVAIfini() finalized iteration
+ *
+ *	it must be called when iteration is done, irrespective of error status
+ */
+
+vai_hdl
+ObjVAIinit(struct worker *wrk, struct objcore *oc, struct ws *ws,
+    vai_notify_cb *cb, void *cb_priv)
+{
+	const struct obj_methods *om = obj_getmethods(oc);
+
+	CHECK_OBJ_NOTNULL(wrk, WORKER_MAGIC);
+
+	if (om->vai_init == NULL)
+		return (NULL);
+	return (om->vai_init(wrk, oc, ws, cb, cb_priv));
+}
+
+int
+ObjVAIlease(struct worker *wrk, vai_hdl vhdl, struct vscarab *scarab)
+{
+	struct vai_hdl_preamble *vaip = vhdl;
+
+	AN(vaip);
+	assert(vaip->magic2 == VAI_HDL_PREAMBLE_MAGIC2);
+	AN(vaip->vai_lease);
+	return (vaip->vai_lease(wrk, vhdl, scarab));
+}
+
+void
+ObjVAIreturn(struct worker *wrk, vai_hdl vhdl, struct vscaret *scaret)
+{
+	struct vai_hdl_preamble *vaip = vhdl;
+
+	AN(vaip);
+	assert(vaip->magic2 == VAI_HDL_PREAMBLE_MAGIC2);
+	/* vai_return is optional */
+	if (vaip->vai_return != NULL)
+		vaip->vai_return(wrk, vhdl, scaret);
+	else
+		VSCARET_INIT(scaret, scaret->capacity);
+}
+
+void
+ObjVAIfini(struct worker *wrk, vai_hdl *vhdlp)
+{
+	AN(vhdlp);
+	struct vai_hdl_preamble *vaip = *vhdlp;
+
+	AN(vaip);
+	assert(vaip->magic2 == VAI_HDL_PREAMBLE_MAGIC2);
+	AN(vaip->vai_lease);
+	vaip->vai_fini(wrk, vhdlp);
+}
+
 /*====================================================================
  * ObjGetSpace()
  *
diff --git a/bin/varnishd/cache/cache_obj.h b/bin/varnishd/cache/cache_obj.h
index f6ee8618e..0aff7c8b2 100644
--- a/bin/varnishd/cache/cache_obj.h
+++ b/bin/varnishd/cache/cache_obj.h
@@ -70,6 +70,78 @@ struct vai_qe {
 	void			*priv;
 };
 
+#define VAI_ASSERT_LEASE(x) AZ((x) & 0x7)
+
+/*
+ * start an iteration. the ws can we used (reserved) by storage
+ * the void * will be passed as the second argument to vai_notify_cb
+ */
+typedef vai_hdl vai_init_f(struct worker *, struct objcore *, struct ws *,
+	vai_notify_cb *, void *);
+
+/*
+ * lease io vectors from storage
+ *
+ * vai_hdl is from vai_init_f
+ * the vscarab is provided by the caller to return leases
+ *
+ * return:
+ * -EAGAIN:	nothing available at the moment, storage will notify, no use to
+ *		call again until notification
+ * -ENOBUFS:	caller needs to return leases, storage will notify
+ * -EPIPE:	BOS_FAILED for busy object
+ * -(errno):	other problem, fatal
+ *  >= 0:	number of viovs added
+ */
+typedef int vai_lease_f(struct worker *, vai_hdl, struct vscarab *);
+
+/*
+ * return leases
+ */
+typedef void vai_return_f(struct worker *, vai_hdl, struct vscaret *);
+
+/*
+ * finish iteration, vai_return_f must have been called on all leases
+ */
+typedef void vai_fini_f(struct worker *, vai_hdl *);
+
+/*
+ * vai_hdl must start with this preamble such that when cast to it, cache_obj.c
+ * has access to the methods.
+ *
+ * The first magic is owned by storage, the second magic is owned by cache_obj.c
+ * and must be initialized to VAI_HDL_PREAMBLE_MAGIC2
+ *
+ */
+
+//lint -esym(768, vai_hdl_preamble::reserve)
+struct vai_hdl_preamble {
+	unsigned	magic;	// owned by storage
+	unsigned	magic2;
+#define VAI_HDL_PREAMBLE_MAGIC2	0x7a15d162
+	vai_lease_f	*vai_lease;
+	vai_return_f	*vai_return;	// optional
+	uintptr_t	reserve[4];	// abi fwd compat
+	vai_fini_f	*vai_fini;
+};
+
+#define INIT_VAI_HDL(to, x) do {				\
+	(void)memset(to, 0, sizeof *(to));			\
+	(to)->preamble.magic = (x);				\
+	(to)->preamble.magic2 = VAI_HDL_PREAMBLE_MAGIC2;	\
+} while (0)
+
+#define CHECK_VAI_HDL(obj, x) do {				\
+	assert(obj->preamble.magic == (x));			\
+	assert(obj->preamble.magic2 == VAI_HDL_PREAMBLE_MAGIC2);\
+} while (0)
+
+#define CAST_VAI_HDL_NOTNULL(obj, ptr, x) do {			\
+	AN(ptr);						\
+	(obj) = (ptr);						\
+	CHECK_VAI_HDL(obj, x);					\
+} while (0)
+
 struct obj_methods {
 	/* required */
 	objfree_f	*objfree;
@@ -84,5 +156,6 @@ struct obj_methods {
 	objslim_f	*objslim;
 	objtouch_f	*objtouch;
 	objsetstate_f	*objsetstate;
+	/* async iteration (VAI) */
+	vai_init_f	*vai_init;
 };
-