RFC for VIP17: unix domain sockets for listen and backend addresses

Tue Apr 25 15:46:26 CEST 2017

On Mon, Apr 24, 2017 at 1:53 PM, Geoff Simmons <geoff at uplex.de> wrote:
> By request at bugwash, this is to open a thread for commentary about
> VIP17:
>
> https://github.com/varnishcache/varnish-cache/wiki/VIP-17%3A-Enable-Unix-domain-sockets-for-listen-and-backend-addresses

Thanks for the thread, I will alow myself to copy the VIP in its
current state here before I comment because I have noticed at least 2
updates in the wiki before this threead started.

# Synopsis
Allow unix domain sockets as a listen address for Varnish (``-a``
option) and as addresses for backends, and for use in ACLs. Obtain
credentials of the peer process connected on a UDS, such as uid and
gid, for use in VCL.

# Why?
* Eliminate the overhead of TCP/loopback for connections with peers
that are colocated on a host with Varnish.
* Restrict who can send requests to Varnish by setting permissions on
the UDS path of the listen address.
  * (But see the discussion below about getting this right portably.)
* Make it possible for a backend peer to require restricted
credentials for the Varnish process by setting permissions on the UDS
path on which it listens.
* Peer credentials make it possible to:
  * Make information about the peer available in VCL and the log.
  * Extend ACLs to make it possible to place further restrictions on
peers connecting to the listen address.

An obvious application is the use of SSL offloaders connecting to the
listen address, and SSL "onloaders" as backends. UDS would eliminate
the TCP overhead, and the ability to restrict the credentials of peers
mitigates the risks of man-in-the-middle. Both haproxy and
nginx/ProxyPass, among others, support UDS addresses in "both
directions", so they are candidates for this purpose. A notable
exception is hitch, which currently only supports TCP connections. I
would be happy to help the hitch project support UDS (shouldn't be
hard at all).

I would like to make this contribution for the September 2017 release.
With the VIP I'd like to clarify:

* Are there any changes planned for VTCP and VSA in the September
release that would make adding UDS to those interfaces less trivial
than it is now?
* Every platform has a way to get peer credentials from a UDS, but
there's no standard and it's highly platform-dependent. So how do we
want to handle that?
* Additions/changes to VCL and other changes in naming, such as the
``-a`` option and backend definitions.
* If someone knows a reason why we shouldn't do this at all, this is
the place to say so.

# How?
## Address notation
I suggest that we require a prefix such as ``unix:`` to identify UDS
addresses (nginx uses ``unix:``, haproxy uses ``unix@``):
```
varnishd -a unix:/path/to/uds
backend uds { .host = "unix:/path/to/uds"; }
```
That makes the interpretation unambiguous. We could simply interpret
paths as UDS addresses when they appear in those places, but then we
would need logic like: if the argument cannot be resolved as a host or
parsed as an IP address, then assume it's a path for UDS, but if the
path does not exist or cannot be accessed, then fail. So better to
just make it unambiguous.

Parsing UDS addresses would be an extension of ``VSS_Resolver``.

The name ``.host`` in a backend definition becomes a bit peculiar if
its value can also be a UDS (we will see a number or examples like
this). We could:

* stay with the name ``.host``, and document the fact that it might
not identify a host in some cases
* replace ``.host`` with a name like ``.peer``, sacrificing backward
compatibility
* introduce ``.peer``, retain ``.host`` as a deprecated alias, and
remove ``.host`` in a future release

I suggest the last option, comments welcome.

``.port`` in a backend definition is already optional, and is
unnecessary for a UDS. Should it be an error to specify a port when a
UDS is specified, or should it be ignored? Comments welcome.

## Access permissions on the listen address
For the ``-a`` address, I suggest an optional means of specifying who
can access the UDS:
```
varnishd -a unix:/path/to/uds:uid=foo,gid=bar
```
There's an issue here in that the separator (``:`` in the example)
could not appear in any UDS path. We might just have to forbid a
certain character in UDS paths. Fortunately we don't have a such a
problem with backend addresses (which are generated by another server,
so we have less freedom to impose restrictions on the path names).

``uid`` and ``gid`` can be specified as numeric or with names. Either,
both or none of uid and gid would be permitted. Enforcing access
permissions would be tricky to get right portably and reliably (and
might just not work). From what I surmise at the moment (and I might
be quite wrong):

* Ownership would have to set on the directory containing the UDS --
``/path/to/`` in the example.
  * BSD-derived systems do not restrict connects to the UDS itself due
to its permissions (or so I've read). But you can make a UDS
inaccessible to a process that can't read its directory.
* Then chmod the directory to 0700 or 0770, depending on whether
access is set for user and/or group.
  * This should be done before bind, creating the directory if necessary.
* On Linux, peers connecting to the UDS must have read/write
permission, so we would also set uid/gid ownership on the UDS and set
permissions to 0600 or 0660, as the case may be. Might as well do that
on every platform.
  * Must be done after bind and before listen.
* ``mgt_acceptor.c`` would do all of this. Typically the management
process runs as root and is able to change permissions and ownership;
if the management process owner can't do these things, then varnishd
fails to start.

So the sequence for the management process would be (again, unless I'm
getting this all wrong):
* create the directory if necessary
* if access restrictions were requested then set uid/gid and
permissions on the directory accordingly
* bind (note that ``VTCP_bind`` will have to unlink the before before
bind for a UDS, if the path already exists)
* set permissions on the UDS, at least read/write in all cases, and
set ownership if requested

Then the socket can be handed off to the child process for listen.

If no access restrictions were requested, then don't manipulate
ownership, let bind create the UDS, and set its permissions to 0666.

Comments and corrections on this section are very much welcome.

## VSA and VTCP
Extending these interfaces, in their current form, to accommodate UDS
is a piece of cake.

VSA can just as easily encapsulate ``sockaddr_un`` as it currently
does for the ip4 and ip6 types.

For the most part, VTCP just works with sockets, so it doesn't matter
whether they are TCP or UDS sockets. There would have to be some
changes about naming (``VTCP_name``, ``_myname`` and ``_hisname``),
but I'd like to set that aside for a moment, and get to the subject of
naming further down. Some other changes would involve:

* Unlink the UDS path before bind in ``VTCP_bind``
* Some new kinds of errors may result from ``VTCP_connect``, such as
EPERM or ENOENT, but we may not have to change anything for that --
``VTCP_connect`` currently just fails on error and lets the caller
decide what to to with the errno.
* We'll have to investigate which of the socket options are compatible
with UDS. From a quick look I suspect that these are at least
irrelevant to UDS and may be errors:
  * httpready
  * ``TCP_DEFER_ACCEPT``
  * ``TCP_FASTOPEN``
  * disabling Nagle (``TCP_NODELAY``)

My main question about all this is: are there plans to significantly
revise VSA and VTCP for the September release? Or can I expect that
they it will remain fairly easy to extend for UDS?

A minor issue is that the name ``VTCP`` (all of the ``VTCP_*``
functions, the source name ``vtcp.c``, etc.) becomes a misnomer if it
also covers UDS. We could just live with that. OTOH a single git
commit could change it all at once, although we might have to bikeshed
over a new name (``VSOCK``?).

## Peer credentials
The good news is that all of the platforms listed as level A and B in
"Picking Platforms" (the phk rant) have the means to obtain
credentials of the peer on a connected UDS.

The bad news is that there's no standard, they're all different, and
they encompass different information.

* FreeBSD
  * ``getpeereid`` returns the EUID and EGID. OpenBSD appears to have
``getpeereid`` as well.
  * ``getsockopt(LOCAL_PEERCRED)`` returns credentials in the
``xucred`` struct defined in ``<sys/ucred.h>``, which includes EUID
and all of the groups to which the peer belongs.
* Linux
  * ``getsockopt(SO_PEERCRED)`` returns the ``ucred`` struct defined
in ``<sys/socket.h>`` which includes pid, uid and gid. It's not clear
to me from the manuals whether it's EUID/EGID or RUID/RGID.
(Googled-up examples seem to assume EUID/EGID.)
  * For ``getpeereid`` we'd have to link to libbsd.
* Solaris
  * Appears to have nothing like any of the other platforms, but it
does have ``getpeerucred``, which fills in a ``ucred_t`` defined in
``<ucred.h>``. This is an opaque structure with a [family of accessor
functions](https://docs.oracle.com/cd/E53394_01/html/E54766/ucred-get-3c.html)
``ucred_get*``, which tell you almost anything you can think of.
* MacOS/Darwin
  * Appears to be just like FreeBSD: ``getpeereid`` and
``getsockopt(LOCAL_PEERCRED)``

All of these obtain the credentials that were true when the peer
called ``connect`` or ``listen``, and according to the docs they can't
be faked (unless there's a kernel bug).

Most or all of these platforms have ways to receive peer credentials
in ancillary messages, which may contain more information, but that
may require that the peer co-operates, and we can't rely on that.

So it appears that the least common denominator is EUID and EGID
(assuming that's what you get in Linux). I suggest that we just go
with that, to be used as described below.

Because of all of the platform dependencies, there will have to be
something like ``cred_compat.h`` full of ``#ifdef``s, and probably
some ``configure.ac`` logic to figure it all out. We'll also have to
decide what to do when Varnish is built on a platform where we find
none of the above.

## Address naming
Getting back to ``VTCP_name``, ``_hisname`` and ``_myname``: these are
currently hard-wired in their signatures for an address and a port,
and they're spread out all over the place in Varnish.

IMO the least obtrusive way to adapt this for UDS would be to generate
the UDS path in the address position, and generate a string
``"<uid>:<gid>"`` where the port is currently generated. Or we could
bite the bullet by changing these three functions to something less
hard-wired, then go find all of the places where they are called and
figure out what to do. I suggest the less obtrusive option, at least
in an initial implementation, although admittedly the more difficult
option may be the right thing in the long run. Comments are welcome.

Assuming we go for ``"<uid>:<gid>"`` in the "port" position -- we
could generate that string always using the numeric IDs. Or should we
call getpwnam/getgrnam, and generate the names if we can get them?
Comments welcome.

We'd have to decide what to do on a platform where we don't have a way
(or haven't figured out how) to get the peer credentials. Generate
``":"`` or ``"?:?"``? Comments welcome again.

## VCL/VRT
Additions and changes to VCL and VRT involve:
* VCL variables ``*.ip``: ``client.ip``, ``local.ip``, ``server.ip``,
``remote.ip`` and ``beresp.backend.ip``
* VCL data type IP
* introducing VMOD std functions to return the uid and gid for the
``*.ip`` objects, as numbers or names
* extending ACLs to specify UDSen and optionally peer credentials
* VRT: types ``VCL_IP`` and ``struct vrt_backend``, and the VRT
functions related to ``VCL_IP`` and ``suckaddr``

The ``*.ip`` variables essentially encapsulate suckaddrs, which we
don't have to change. For the string conversion, if the suckaddr wraps
a sockaddr_un, then return the UDS path.

Here again we have the problem that the names ``*.ip`` are
inappropriate, since the value could be a UDS. Again I suggest the
strategy of introducing a new name, in this case ``*.addr``, and
deprecating the old names, but leaving the old names around until a
future release.

``VCL_IP`` is just a suckaddr, so we don't have to change anything,
but we have another inappropriate name for UDSen. The same goes for
data type ``IP``. Again I suggest the strategy of introducing new
names, ``ADDR`` and ``VCL_ADDR`` (``VCL_ADDR`` defined as exactly the
same typedef as ``VCL_IP``), and deprecating the old names.

I suggest adding functions like these to VMOD std, with the obvious
implementations:
* ``INT uid_number(ADDR addr, INT fallback)``
* ``STRING uid_name(ADDR addr, STRING fallback)``
* ``INT gid_number(ADDR addr, INT fallback)``
* ``STRING gid_name(ADDR addr, STRING fallback)``

Of course these would always return the fallbacks for non-UDS addresses.

ACLs can be extended to include paths for a UDS and restrictions on the uid/gid:
```
acl foo {
    "/path/to/uds";
    "/path/with/a/*/wildcard";
    "/path/with/a/uid/restriction",uid=4711;
    "/path/with/more/r?strictions",uid=foo,gid=bar;
}
```
So we can: name UDS paths in an ACL, allow filename globbing, include
restrictions on the uid and gid, and allow both numbers and names for
uid/gid.

I'm not sure what to do about ``struct vrt_backend``, which currently
has fields for IPv4 and IPv6 addresses, both as strings and suckaddrs.
I doubt that it makes sense just to add the same fields for UDS
addresses, since the point is that a backend may have both kinds of IP
addresses, but it won't also have a UDS address at the same time.

We might have to introduce something like this:
```
union addr {
    struct {
        char *ipv4_addr;
        char *ipv6_addr;
        struct suckaddr *ipv4_suckaddr;
        struct suckaddr *ipv6_suckaddr;
    } ip;
    struct {
        char *path;
        struct suckaddr *uds_suckaddr;
    } uds;
};
```
... and then use the union type for the "address" field of the backend
definition -- it's either an IP address, which can be one or both of
IPv4 and IPv6, or a UDS. Comments welcome.

I think that the VRT functions that currently use ``VCL_IP`` and
suckaddrs can be adapted either without changes or very
straightforwardly, but again we'll want to introduce "addr" where "ip"
currently appears in the names, and deprecate the old names:
* ``VRT_acl_match``: use the ``VCL_ADDR`` type in the signature
* ``VRT_ipcmp``: no change
* ``VRT_IP_string``: introduce ``char *VRT_ADDR_string(VRT_CTX,
VCL_ADDR)`` with the same function, and deprecate the old one