Implementing the HTTP Vary: header efficiently

The HTTP header "Vary:" lists the headerfields which affected the content generation process.

Typical examples are choice of lanuage based on the "Accept-Language" header or compression based on the "Accept-Encoding:" header.

The relevant text from RFC2616 says:

  13.6 Caching Negotiated Responses
  [...]
  When the cache receives a subsequent request whose Request-URI
  specifies one or more cache entries including a Vary header field,
  the cache MUST NOT use such a cache entry to construct a response to
  the new request unless all of the selecting request-headers present
  in the new request match the corresponding stored request-headers in
  the original request.

Allowed transforms

Certain trivial transforms, and a single very complex one, are allowed when determining if two headers match, but it is not obvious that implementing any of these will result in any benefit for Varnish at this point.

The cost of not implementing these transforms is multiple identical cached copies of the same object because Accept-Encoding: compress, gzip is different from Accept-Encoding: gzip, compress.

I think we can ignore the permitted Vary transforms for two reasons:

First: the set of possible headers we can meet is, more or less, limited by the number of different useragents (times their relevant versions).

But second, and more importantly, we can mitigate this explosion in VCL by rewriting the relevant headers before we go to the backend.

We could for instance rewrite any Accept-Encoding: line that includes a non-zero q value for gzip to retain just the gzip part, so that the backend will see either no Accept-Encoding or exactly Accept-Encoding: gzip.

For this strategy to work, the header rewrites must accurately reflect the backends decisions, and it may well be the case that it is a better idea to move the decision entirely to Varnish, by rewriting the URL accordingly or similar.

But by default, Varnish will perform no transforms on the headers singled out by Vary: for comparison.

Storage Concerns

The Varnish hash/storage facility already offers support for multiple objects at the same hash-location, and provided the backend offers correct Vary: headers, specifically listing not only the headers used for the content decision but also headers which would have had impact, had they been there, the order of objects on the hash-chain is not important in the context of Vary: processing.

Hash Lookup processing

At the time where we look an object up, instead of just validating the found object on hash-match and TTL, we also need to ensure Vary: compatibility.

If there was no Vary: header on the object, it is acceptable and processing continues.

If there was a Vary: header on the object, an encoded byte-stream will contain instructions for matching the requests headers to determine a match or a miss.

The encoded byte-stream will essentially contain a sequence of (Header_name, Header_contents) tuplets and if they all match, the object is compatible with the request.

Insert processing

At the time an object is inserted into the hash/storage, we must identify and process any Vary: header we find and encode, from the request we sent to the backend, a byte-stream to store with the object.

I stress sent to the backend because this may not be identical to what was in the original request if extensive rewrites have been going on. Furthermore, keeping that request around will require some remodelling of the memory usage, because currently it is overwritten by the response from the backend in struct vbe_conn.

Prefetch processing

When prefetch is implemented, the request sent to the backend must include the exact fields quoted in the Vary: header in the object, and we can reconstruct these from the encoded byte-string.

We must not assume that the new object will have the same Vary: header as the old one.

If we implement prefetch so that it can do TTL extensions on object identity, then the Vary: header will need special care. If prefetch always create a new object, nothing magic needs to happen.

Encoded Byte-Stream

The encoded bytestream is not a published interface, so the exact layout will likely remain isolated entirely in cache_vary.c and the following should just be considered a rough sketch of the idea:

N * {
    1 byte    Header-name length + 1
    N bytes   Header
    1 byte    ':'
    1 byte    '\0'
    2 bytes   Big-Endian encoded length of content. 0xffff: not present.
    X bytes   Header content.
    }
1 byte        '\0'