[master] 6e616f98d Chapter 5 & 6 in the CHERI saga

Poul-Henning Kamp phk at FreeBSD.org
Wed Nov 30 14:39:05 UTC 2022


commit 6e616f98d83cced716fd300388fca86a00f24076
Author: Poul-Henning Kamp <phk at FreeBSD.org>
Date:   Wed Nov 30 14:38:34 2022 +0000

    Chapter 5 & 6 in the CHERI saga

diff --git a/doc/sphinx/phk/cheri5.rst b/doc/sphinx/phk/cheri5.rst
new file mode 100644
index 000000000..762d72c0a
--- /dev/null
+++ b/doc/sphinx/phk/cheri5.rst
@@ -0,0 +1,197 @@
+.. _phk_cheri_5:
+
+How Varnish met CHERI 5/N
+=========================
+
+Varnish Workspaces
+------------------
+
+To process a HTTP request or response, varnish must allocate bits
+of memory which will only be used for the duration of that processing,
+and all of it can be released back at the same time.
+
+To avoid calling ``malloc(3)`` a lot, which comes with a locking
+overhead in a heavily multithreaded process, but even more to
+avoid having to keep track of all these allocations in order to be able
+to ``free(3)`` them all, varnish has "workspaces":
+
+.. code-block:: none
+
+    struct ws {
+        […]
+        char    *s;     /* (S)tart of buffer */
+        char    *f;     /* (F)ree/front pointer */
+        char    *r;     /* (R)eserved length */
+        char    *e;     /* (E)nd of buffer */
+    };
+
+The ``s`` pointer points at the start of a slab of memory, owned
+exclusively by the current thread and ``e`` points to the end.
+
+Initially ``f`` is the same as ``s``, but as allocations are made
+from the workspace, it moves towards ``e``.  The ``r`` pointer is
+used to make "reservations", we will ignore that for now.
+
+Workspaces look easy to create:
+
+.. code-block:: none
+
+    ws->s = space;
+    ws->e = ws->s + len;
+    ws->f = ws->s;
+    ws->r = NULL;
+
+… only, given the foot-shooting-abetting nature of the C language,
+we have bolted on a lot of seat-belts:
+
+.. code-block:: none
+
+    #define WS_ID_SIZE 4
+    
+    struct ws {
+        unsigned        magic;
+    #define WS_MAGIC    0x35fac554
+        char            id[WS_ID_SIZE]; /* identity */
+        char            *s;             /* (S)tart of buffer */
+        char            *f;             /* (F)ree/front pointer */
+        char            *r;             /* (R)eserved length */
+        char            *e;             /* (E)nd of buffer */
+    };
+    
+    void
+    WS_Init(struct ws *ws, const char *id, void *space, unsigned len)
+    {
+        unsigned l;
+    
+        DSLb(DBG_WORKSPACE,
+            "WS_Init(%s, %p, %p, %u)", id, ws, space, len);
+        assert(space != NULL);
+        assert(PAOK(space));
+        INIT_OBJ(ws, WS_MAGIC);
+        ws->s = space;
+        l = PRNDDN(len - 1);
+        ws->e = ws->s + l;
+        memset(ws->e, WS_REDZONE_END, len - l);
+        ws->f = ws->s;
+        assert(id[0] & 0x20);           // cheesy islower()
+        bstrcpy(ws->id, id);
+        WS_Assert(ws);
+    }
+
+Let me walk you through that:
+
+The ``DSLb()`` call can be used to trace all operations on the
+workspace, so we can see what actually goes on.
+
+(Hint: Your ``malloc(3)`` may have something similar,
+look for ``utrace`` in the manual page.)
+
+Next we check the provided space pointer is not NULL, and
+that it is properly aligned, these are both following
+a varnish style-pattern, to sprinkle asserts liberally,
+both as code documentation, but also because it allows
+the compiler to optimize things better.
+
+The ``INIT_OBJ() and ``magic`` field is a style-pattern
+we use throughout varnish:  Each structure is tagged with
+a unique magic, which can be used to ensure that pointers
+are what we are told, when they get passed through a ``void*``.
+
+We set the ``s`` pointer.
+
+We calculate a length at least one byte shorter than what
+we were provided, align it, and point ``e`` at that.
+
+We fill that extraspace at and past ``e``, with a "canary" to
+stochastically detect overruns.  It catches most but not
+all overruns.
+
+We set the name of the workspace, ensuring it is not already
+marked as overflowed.
+
+And finally check that the resulting workspace complies with
+the defined invariants, as captured in the ``WS_Assert()``
+function.
+
+With CHERI, it looks like this:
+
+.. code-block:: none
+
+    void
+    WS_Init(struct ws *ws, const char *id, void *space, unsigned len)
+    {
+        unsigned l;
+ 
+        DSLb(DBG_WORKSPACE,
+            "WS_Init(%s, %p, %p, %u)", id, ws, space, len);
+        assert(space != NULL);
+        INIT_OBJ(ws, WS_MAGIC);
+        assert(PAOK(space));
+        ws->s = cheri_bounds_set(space, len);
+        ws->e = ws->s + len
+        ws->f = ws->s;
+        assert(id[0] & 0x20);           // cheesy islower()
+        bstrcpy(ws->id, id);
+        WS_Assert(ws);
+    }
+
+All the gunk to implement a canary to detect overruns went
+away, because with CHERI we can restrict the ``s`` pointer so writing
+outside the workspace is *by definition* impossible, as long as your
+pointer is derived from ``s``.
+
+Less memory wasted, much stronger check and more readable source-code,
+what's not to like ?
+
+When an allocation is made from the workspace, CHERI makes it possible
+to restrict the returned pointer to just the allocated space:
+
+.. code-block:: none
+
+    void *
+    WS_Alloc(struct ws *ws, unsigned bytes)
+    {
+        char *r;
+     
+        […]
+        r = ws->f;
+        ws->f += bytes;
+        return(cheri_bounds_set(r, bytes));
+    }
+
+Varnish String Buffers
+----------------------
+
+Back in the mists of time, Dag-Erling Smørgrav and I designed a
+safe string API called ``sbuf`` for the FreeBSD kernel.
+
+The basic idea is you set up your buffer, you call functions to stuff
+text into it, and those functions do all the hard work to ensure
+you do not overrun the buffer.  When the string is complete, you
+call a function to "finish" the buffer, and if returns a flag which
+tells you if overrun (or other problems) happened, and then you can
+get a pointer to the resulting string from another function.
+
+Varnish has adopted sbuf's under the name ``vsb``.  This should
+really not surprise anybody:  Dag-Erling was also involved
+in the birth of varnish.
+
+It should be obvious that internally ``vsb`` almost always operate
+on a bigger buffer than the result, so this is another obvious
+place to have CHERI cut a pointer down to size:
+
+.. code-block:: none
+
+    char *
+    VSB_data(const struct vsb *s)
+    {
+
+        assert_VSB_integrity(s);
+        assert_VSB_state(s, VSB_FINISHED);
+
+        return (cheri_bounds_set(s->s_buf, s->s_len + 1));
+    }
+
+Still no bugs though.
+
+*/phk*
diff --git a/doc/sphinx/phk/cheri6.rst b/doc/sphinx/phk/cheri6.rst
new file mode 100644
index 000000000..adf5d83cb
--- /dev/null
+++ b/doc/sphinx/phk/cheri6.rst
@@ -0,0 +1,179 @@
+.. _phk_cheri_6:
+
+How Varnish met CHERI 6/N
+=========================
+
+Varnish Socket Addresses
+------------------------
+
+Socket addresses are a bit of a mess, in particular because nobody
+dared shake up all the IPv4 legacy code when IPv6 was introduced.
+
+In varnish we encapsulate all that uglyness in a ``struct suckaddr``,
+so named because it sucks that we have to spend time and code on this.
+
+In a case like this, it makes sense to make the internals strictly
+read-only, to ensure nobody gets sneaky ideas:
+
+.. code-block:: none
+
+    struct suckaddr *
+    VSA_Build(void *d, const void *s, unsigned sal)
+    {
+        struct suckaddr *sua;
+     
+        [… lots of ugly stuff …]
+    
+        return (RO(sua));
+    }
+
+It would then seem logical to use C's ``const`` to signal this fact,
+but since the current VSA api is currently defined such that users
+call ``free(3)`` on the suckaddrs when they are done with them, that does
+not work, because the prototype for ``free(3)`` is:
+
+.. code-block:: none
+
+	void free(void *ptr);
+
+So you cannot call it with a ``const`` pointer.
+
+All hail the ISO-C standards committee!
+
+This brings me to a soft point with CHERI: Allocators.
+
+How to free things with CHERI
+-----------------------------
+
+A very common design-pattern in encapsulating classes look something
+like this:
+
+.. code-block:: none
+
+    struct real_foo {
+        struct foo foo;
+        [some metadata about foo]
+    };
+    
+    const struct foo *
+    Make_Foo([arguments])
+    {
+        struct real_foo *rf;
+    
+        rf = calloc(1, sizeof *rf);
+        if (rf == NULL)
+            return (rf);
+        [fill in rf]
+        return (&rf->foo);
+    }
+
+    void
+    Free_Foo(const struct foo **ptr)
+    {
+        const struct foo *fp;
+        struct real_foo *rfp;
+
+        assert(ptr != NULL);
+        fp = *ptr;
+        assert(fp != NULL);
+        *ptr = NULL;
+
+        rfp = (struct real_foo*)((uintptr_t)fp);
+        [clean stuff up]
+    }
+
+We pass a ``**ptr`` to ``Free_Foo()``, another varnish style-pattern,
+so we can NULL out the holding variable in the calling function,
+to avoid a dangling pointer to the now destroyed object from
+causing any kind of trouble later.
+
+In the calling function this looks like:
+
+.. code-block:: none
+
+    const struct foo *foo_ptr;
+    […]
+    Free_Foo(&foo_ptr);
+
+If we use CHERI to make the foo truly ``const`` for the users of
+the API, we cannot, as above, wash the ``const`` away with a trip through
+``uintptr_t`` and then write to the metadata.
+
+The CHERI C/C++ manual, a terse academic tome, laconically mention that:
+
+*»Therefore, some additional work may be required to derive
+a pointer to the allocation’s metadata via another global capability,
+rather than the one that has been passed to free().«*
+
+Despite the understatement, I am very much in favour of this, because
+this is *precisely* why my own
+`phkmalloc <https://papers.freebsd.org/1998/phk-malloc/>`_
+became a big hit twenty years ago:  By firmly separating the metadata
+from the allocated space, several new classes of mistakes using the
+``malloc(3)`` API could, and was, detected.
+
+But this *is* going to be an embuggerance for CHERI users, because
+with CHERI getting from one pointer to different one is actual work.
+
+The only "proper" solution is to build some kind of datastructure:
+List, tree, hash, DB2 database, pick any poison you prefer, and
+search out the metadata pointer using the impotent pointer as key.
+Given that CHERI pointers are huge, it may be a better idea to embed
+a numeric index in the object and use that the key,
+
+An important benefit of this »additional work« is that if your
+free-function get passed a pointer to something else, you will
+find out, because it is not in your data-structure.
+
+It would be a good idea if CHERI came with a quality implementation
+of "Find my other pointer", so that nobody is forced to crack The
+Art of Computer Programming open for this.
+
+When the API is "fire-and-forget" like VSA, in the sense that there
+is no metadata to clean up, we can leave the hard work to the
+``malloc(3)`` implementation.
+
+Ever since ``phkmalloc`` no relevant implementation of ``malloc(3)``
+has dereferenced the freed pointer, in order to find the metadata
+for the allocation.  Despite its non-const C prototype ``free(3)``,
+will therefore happily handle a ``const`` or even CHERIed read-only
+pointer.
+
+But you *will* have to scrub the ``const`` off with a ``uintptr_t``
+to satisfy the C-compiler:
+
+.. code-block:: none
+
+    void
+    VSA_free(const struct suckaddr **vsap)
+    {       
+        const struct suckaddr *vsa;
+         
+        AN(vsap);
+        vsa = *vsap;
+        *vsap = NULL;
+        free((char *)(uintptr_t)vsa);
+    }
+
+Or in varnish style:
+
+.. code-block:: none
+
+    void
+    VSA_free(const struct suckaddr **vsap)
+    {
+        const struct suckaddr *vsa;
+    
+        TAKE_OBJ_NOTNULL(vsa, vsap, SUCKADDR_MAGIC);
+        free(TRUST_ME(vsa));
+    }
+
+
+Having been all over this code now, I have decided to constify ``struct
+suckaddr`` in varnish, even without CHERI, it is sounder that way.
+
+It is not bug, but CHERI made it a lot simpler and faster for me
+to explore the consequences of this change, so I will forfeit
+a score of "half a bug" to CHERI at this point.
+
+*/phk*
diff --git a/doc/sphinx/phk/index.rst b/doc/sphinx/phk/index.rst
index 5d33fef4e..4b792547e 100644
--- a/doc/sphinx/phk/index.rst
+++ b/doc/sphinx/phk/index.rst
@@ -13,6 +13,8 @@ You may or may not want to know what Poul-Henning thinks.
 .. toctree::
 	:maxdepth: 1
 
+	cheri6.rst
+	cheri5.rst
 	cheri4.rst
 	cheri3.rst
 	cheri2.rst


More information about the varnish-commit mailing list