Implementing the HTTP Vary: header efficiently
The HTTP header "Vary:" lists the headerfields which affected the content generation process.
Typical examples are choice of lanuage based on the "Accept-Language" header or compression based on the "Accept-Encoding:" header.
The relevant text from RFC2616 says:
13.6 Caching Negotiated Responses [...] When the cache receives a subsequent request whose Request-URI specifies one or more cache entries including a Vary header field, the cache MUST NOT use such a cache entry to construct a response to the new request unless all of the selecting request-headers present in the new request match the corresponding stored request-headers in the original request.
Allowed transforms
Certain trivial transforms, and a single very complex one, are allowed when determining if two headers match, but it is not obvious that implementing any of these will result in any benefit for Varnish at this point.
The cost of not implementing these transforms is multiple identical cached copies of the same object because Accept-Encoding: compress, gzip is different from Accept-Encoding: gzip, compress.
I think we can ignore the permitted Vary transforms for two reasons:
First: the set of possible headers we can meet is, more or less, limited by the number of different useragents (times their relevant versions).
But second, and more importantly, we can mitigate this explosion in VCL by rewriting the relevant headers before we go to the backend.
We could for instance rewrite any Accept-Encoding: line that includes a non-zero q value for gzip to retain just the gzip part, so that the backend will see either no Accept-Encoding or exactly Accept-Encoding: gzip.
For this strategy to work, the header rewrites must accurately reflect the backends decisions, and it may well be the case that it is a better idea to move the decision entirely to Varnish, by rewriting the URL accordingly or similar.
But by default, Varnish will perform no transforms on the headers singled out by Vary: for comparison.
Storage Concerns
The Varnish hash/storage facility already offers support for multiple objects at the same hash-location, and provided the backend offers correct Vary: headers, specifically listing not only the headers used for the content decision but also headers which would have had impact, had they been there, the order of objects on the hash-chain is not important in the context of Vary: processing.
Hash Lookup processing
At the time where we look an object up, instead of just validating the found object on hash-match and TTL, we also need to ensure Vary: compatibility.
If there was no Vary: header on the object, it is acceptable and processing continues.
If there was a Vary: header on the object, an encoded byte-stream will contain instructions for matching the requests headers to determine a match or a miss.
The encoded byte-stream will essentially contain a sequence of (Header_name, Header_contents) tuplets and if they all match, the object is compatible with the request.
Insert processing
At the time an object is inserted into the hash/storage, we must identify and process any Vary: header we find and encode, from the request we sent to the backend, a byte-stream to store with the object.
I stress sent to the backend because this may not be identical to what was in the original request if extensive rewrites have been going on. Furthermore, keeping that request around will require some remodelling of the memory usage, because currently it is overwritten by the response from the backend in struct vbe_conn.
Prefetch processing
When prefetch is implemented, the request sent to the backend must include the exact fields quoted in the Vary: header in the object, and we can reconstruct these from the encoded byte-string.
We must not assume that the new object will have the same Vary: header as the old one.
If we implement prefetch so that it can do TTL extensions on object identity, then the Vary: header will need special care. If prefetch always create a new object, nothing magic needs to happen.
Encoded Byte-Stream
The encoded bytestream is not a published interface, so the exact layout will likely remain isolated entirely in cache_vary.c and the following should just be considered a rough sketch of the idea:
N * {
1 byte Header-name length + 1
N bytes Header
1 byte ':'
1 byte '\0'
2 bytes Big-Endian encoded length of content. 0xffff: not present.
X bytes Header content.
}
1 byte '\0'
