Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic

Fri Dec 17 13:16:42 UTC 2021

> Easy one: you build the key from the request, while the vary header is a response property. Therefore, Varnish can't know to put the varied headers into the hash, because it doesn't have that information yet.

Of course! Makes perfect sense. But the documentation should make this crystal clear, if you ask me. Like a page titled "How is the cache key calculated?" which explains everything related to this. Like how the hash works, how the default implementation works, how to modify it, and how the backend response still can cause multiple different versions (like the Vary header, and maybe other so called "secondary keys"?), and how to modify that behavior. I also think it would make sense if this documentation also mentioned how all this information can be debugged using varnishlog or other tools.

Is there even an official word for this final "cache key"? "Hash" clearly isn't specific enough. I'm talking about a word that refers to the unique key that always corresponds to only a _single_ version of a cached object.

>Don't touch that guy! Varnish will ignore "accept-encoding" in "vary" because it handles compression internally, and always forces "accept-encoding: gzip" before entering vcl_backend_fetch. If your VCL mentions accept-encoding, it's almost always wrong.

Sorry, I'm confused now... Don't touch _which_ guy? Our VCL doesn't contain anything regarding "Accept-Encoding". All I said was that the Vary header in the response from the backend is "Accept-Encoding". And the way I see it, this shouldn't be the cause of the strange problems we are seeing, since even when factoring in this, there should exist a matching cached object for me, and it should be served regardless of TTL or backend health as long as the grace hasn't expired (which it hasn't). Or is my reasoning flawed here, based on the VCL snippet in my original post? Can you think of a scenario where our VCL would return the synthetic 500 page even when there exists a cached objekt matching the hash and vary logic?

> The default VSL space is 80MB, which is "only" worth a few (tens of) thousands requests, so yeah, it can be a short backlog.

Yeah, I think 80MB is a bit to small for us. Ideally we should be able to sit down on a Monday and troubleshoot problems that occured Friday evening, but that might require a way too big VSL space. But a few hundred MB should be fine.

> The problem is that you have to be recording when the initial request goes through. But, if you have then, cache hits will show the VXID of that first request in their "x-varnish" header, and you can find it this way ("varnishlog -r log.bin -q 'vxid == THE_VXID'")

Well, would it really be a cache hit? The main transaction I'm looking for is the first transaction for a specific path (in this case, "/") where Varnish served the synthetic 500 page. And then I would also like to see the closest preceding transaction for that same page, where the hash (and the Vary logic) matches the main transaction mentioned above.

Picture for example this scenario:

1. Request for "/", with with a certain hash and "secondary key" (from the "Accept-Encoding"). Data is fetched from backend and stored in cache.
2. Lots of other, irrelevant requests, including requests for "/" but with a different hash and/or "Accept-Encoding"
3. Request for "/" that uses the cached data from point 1 above
4. Lots of other, irrelevant requests, same as 2 above, but now in such big numbers that the backend servers start having problems
5. Request for "/" with same hash and/or "Accept-Encoding" as 1 above, but for some reason Varnish now returns the synthetic 500 page and puts it in the cache

The transaction log for request in step 5 is the main one I want to find, but also the transaction for step 3 and possibly step 1 as well. But the transaction for step 5 is the most interesting one, and I don't understand how that would be a cache hit while returning a fresh synthetic 500 error page.

Or are you saying that we can find "number 5" above by first finding some transaction later in the log, with a cache hit for the 500 page, and that transaction in turn points to "number 5" since that was the transaction that created the 500 response?

This makes me realize that when looking at past log data, ie using the "-d" parameter, or reading from a file, it looks for matching transactions/requests starting from the beginning. But in this kind troubleshooting I think it could be useful to also be able to search from the end and go backwards. Is that possible? So that if I use the "-" parameter to limit the output to a single transaction, it outputs the _last_ one instead of the first one.