From guillaume.quintard at gmail.com Mon Dec 6 01:32:27 2021 From: guillaume.quintard at gmail.com (Guillaume Quintard) Date: Sun, 5 Dec 2021 17:32:27 -0800 Subject: Writing vmods in Rust Message-ID: Hi everyone, I've been working for a while on a little project that ended up taking quite a chunk of my time, and while it's nowhere near done, I feel like it's time to show the current progress. *In other words: that stuff ain't ready, don't put it in prod.* Here we go: we can *build vmods using pure rust*. For example, this guys: https://github.com/gquintard/varnish-rs/tree/main/examples And the crates `varnish` and `varnish-sys`, that make this possible have been published on crates.io, and we have documentation! https://docs.rs/varnish/latest/varnish/ Without trying to start a language war, there are a few benefits to using Rust over C: - the language is designed to prevent a few classes of annoying bugs, and the compiler is very good at explaining where you went wrong - a lot of the C practices (assert everywhere, miniobj, etc.) are basically baked into Rust - the barrier between C and Rust is razor-thin, with almost zero overhead cost (curse you, null-terminated strings!) - the tooling is absolutely stellar - string manipulation is sane (yeah, easy stab at C, but I had to do it) What you get at the moment: - few dependencies: you only need cargo, python3 and the libvarnish dev files to get going - write the same vmod.vcc as you would for C, the boilerplate is generated for you - building is literally just one command: `cargo build` - automatic type translation so you can write pure rust, and never need to see the `VCL` objects under their C form ( https://docs.rs/varnish/latest/varnish/vcl/convert/index.html) - support for PRIVobjects ( https://github.com/gquintard/varnish-rs/blob/main/examples/vmod_object/vmod.vcc and https://github.com/gquintard/varnish-rs/blob/main/examples/vmod_object/src/lib.rs ) - you can iterate through and modify the HTTP objects easily ( https://docs.rs/varnish/latest/varnish/vcl/ctx/struct.Ctx.html) What you don't get for now: - a comprehensive API, some bits are still missing, like WS_ReserveAll/WS_Release, but because of the type conversion, you can possibly live without it - a stable API. I'm heavily working on it, and things will change twenty times before the next Varnish release (but it doesn't matter as the crates are version-locked) - something based solely on the public library (libvarnishapi), I reimplemented some stuff using private bits to avoid copies - all the types. I only implemented a subset of VCL types, but fret not, the others will come. - a sensible boilerplate code generation, the current approach involves embedding a python script that imports `vmodtool.py` and running it from rust. But there's a plan to fix that In case you want to try it, I've also created a minimal example repo that you can play with: git clone https://github.com/gquintard/vmod-rs-template.git cd vmod-rs-template cargo build cargo test I will continue working on expanding the API, documentation and will start filing issues to come up with a roadmap. If you are interested in helping, have questions or feedback, please do ping me, I'd love to hear from you. Cheers! -- Guillaume Quintard -------------- next part -------------- An HTML attachment was scrubbed... URL: From batanun at hotmail.com Thu Dec 16 20:12:21 2021 From: batanun at hotmail.com (Batanun B) Date: Thu, 16 Dec 2021 20:12:21 +0000 Subject: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic Message-ID: Hi, One of our websites usually has quite a low but steady stream of visitors, but occationally we get a sudden surge of requests over a very short time period (about 1-2 minutes). Varnish seems to handle the load fine, but the backends struggle with this. But I have noticed that Varnish doesn't serve the stale cached data, but instead shows our synthetic 500 page. This is true even for the start page, that definitely existed in the cache. And we have a grace period of 10 days, so I it's quite annoying that we can't simply serve the stale cached data during this short period. I have tried picturing the entire flow, following the logic in the vcl, but I can't see what we do wrong. And annoyingly I can't reproduce the problem locally by simply shutting down the backends (or setting them to unhealthy), because whenever I do that I get the stale content served just as intended. Could the sheer volume itself cause this, making it impossible to reproduce by simply fetching the page a few times in the browser before and after disabling the backends? Or is there some edge case that I haven't thought of that is causing this? A simpliedfied version of our vcl is included below, with only the relevant parts. But unless I have some blatent problem with the vcl, I think it would be good if I learned how to troubleshoot this using the Varnish tools, like varnishlog. So that next time this happens, I can use varnishlog etc to see what's happening. Is it possible using varnishlog to find the very first request for a specific path ("/" in our case) where it returned the synthetic 500 and put it in the cache? And is it also possible to find the request just before that one, for the same path. If I could extract those two requests (including all the metadata in those transactions) from the djungle of thousands of requests, then maybe I can find some explanation why the second request doesn't return the stale data. ----------------------------- sub vcl_hit { if (obj.ttl > 0s) { // Regular cache hit return (deliver); } else if (req.restarts == 0 && std.healthy(req.backend_hint)) { // Graced cache hit, first attempt. // Force cache miss to trigger fetch in foreground (ie synchronous fetch). set req.http.X-forced-miss = "true"; return (miss); } else { // Graced cache hit, previous attempts failed (or backend unhealthy). Let the fetch happen in the background (ie asynchronous fetch), and return the cached value. return (deliver); } } sub vcl_recv { if (req.http.X-cache-pass == "true") { return(pass); } set req.grace = 240h; // 10 day grace } sub vcl_backend_response { if (bereq.http.X-cache-pass != "true") { if (beresp.status < 400) { set beresp.grace = 240h; set beresp.ttl = 30m; // Cache invalidation in the form of xkey softpurge can put objects into grace before the TTL is past. } else if (beresp.status < 500) { set beresp.ttl = 1m; return (deliver); } else if (beresp.status >= 500) { // In some cases we want to abandon the backend request on 500-errors, since it otherwise would overwrite the cached object that still is usefull for grace. // This will make it jump to vcl_synth with a 503 status. There it will restart the request. if (bereq.is_bgfetch) { // In grace period. Abandoning 5xx request, since it otherwise would overwrite the cached object that still is usefull for grace return (abandon); } else if (bereq.http.X-forced-miss == "true") { return (abandon); } // Non background fetch, ie no grace period (and no stale content available). Cache the error page for a few seconds. set beresp.ttl = 15s; return (deliver); } } } sub vcl_synth { if (req.http.X-forced-miss == "true" && resp.status >= 500) { return (restart); } } sub vcl_backend_error { if (bereq.is_bgfetch || bereq.http.X-forced-miss == "true") { // By calling abandon, it will jump to vcl_synth with a 503 status. There it will restart the request. return (abandon); } set beresp.http.Content-Type = "text/html; charset=utf-8"; set beresp.ttl = 15s; synthetic(std.fileread("/etc/varnish/500.html")); return (deliver); } ----------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume.quintard at gmail.com Thu Dec 16 20:23:15 2021 From: guillaume.quintard at gmail.com (Guillaume Quintard) Date: Thu, 16 Dec 2021 12:23:15 -0800 Subject: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic In-Reply-To: References: Message-ID: Could be a vary issue, can you post the output of `varnishlog -d -g request -q 'RespStaus eq 500'? Please anonymize the bits you don't want us to see by replacing them with "XXXXXX", but don't simply remove them. In the meantime, here's a cheat sheet to get started with varnishlog: https://docs.varnish-software.com/tutorials/vsl-query/ -- Guillaume Quintard On Thu, Dec 16, 2021 at 12:15 PM Batanun B wrote: > Hi, > > One of our websites usually has quite a low but steady stream of visitors, > but occationally we get a sudden surge of requests over a very short time > period (about 1-2 minutes). Varnish seems to handle the load fine, but the > backends struggle with this. But I have noticed that Varnish doesn't serve > the stale cached data, but instead shows our synthetic 500 page. This is > true even for the start page, that definitely existed in the cache. And we > have a grace period of 10 days, so I it's quite annoying that we can't > simply serve the stale cached data during this short period. > > I have tried picturing the entire flow, following the logic in the vcl, > but I can't see what we do wrong. And annoyingly I can't reproduce the > problem locally by simply shutting down the backends (or setting them to > unhealthy), because whenever I do that I get the stale content served just > as intended. Could the sheer volume itself cause this, making it impossible > to reproduce by simply fetching the page a few times in the browser before > and after disabling the backends? Or is there some edge case that I haven't > thought of that is causing this? > > A simpliedfied version of our vcl is included below, with only the > relevant parts. But unless I have some blatent problem with the vcl, I > think it would be good if I learned how to troubleshoot this using the > Varnish tools, like varnishlog. So that next time this happens, I can use > varnishlog etc to see what's happening. > > Is it possible using varnishlog to find the very first request for a > specific path ("/" in our case) where it returned the synthetic 500 and put > it in the cache? And is it also possible to find the request just before > that one, for the same path. If I could extract those two requests > (including all the metadata in those transactions) from the djungle of > thousands of requests, then maybe I can find some explanation why the > second request doesn't return the stale data. > > ----------------------------- > > sub vcl_hit { > if (obj.ttl > 0s) { > // Regular cache hit > return (deliver); > } else if (req.restarts == 0 && std.healthy(req.backend_hint)) { > // Graced cache hit, first attempt. > // Force cache miss to trigger fetch in foreground (ie synchronous > fetch). > set req.http.X-forced-miss = "true"; > return (miss); > } else { > // Graced cache hit, previous attempts failed (or backend > unhealthy). Let the fetch happen in the background (ie asynchronous fetch), > and return the cached value. > return (deliver); > } > } > > sub vcl_recv { > if (req.http.X-cache-pass == "true") { > return(pass); > } > > set req.grace = 240h; // 10 day grace > } > > sub vcl_backend_response { > if (bereq.http.X-cache-pass != "true") { > if (beresp.status < 400) { > set beresp.grace = 240h; > set beresp.ttl = 30m; // Cache invalidation in the form of xkey > softpurge can put objects into grace before the TTL is past. > } else if (beresp.status < 500) { > set beresp.ttl = 1m; > return (deliver); > } else if (beresp.status >= 500) { > // In some cases we want to abandon the backend request on > 500-errors, since it otherwise would overwrite the cached object that still > is usefull for grace. > // This will make it jump to vcl_synth with a 503 status. > There it will restart the request. > if (bereq.is_bgfetch) { > // In grace period. Abandoning 5xx request, since it > otherwise would overwrite the cached object that still is usefull for grace > return (abandon); > } else if (bereq.http.X-forced-miss == "true") { > return (abandon); > } > // Non background fetch, ie no grace period (and no stale > content available). Cache the error page for a few seconds. > set beresp.ttl = 15s; > return (deliver); > } > } > } > > sub vcl_synth { > if (req.http.X-forced-miss == "true" && resp.status >= 500) { > return (restart); > } > } > > sub vcl_backend_error { > if (bereq.is_bgfetch || bereq.http.X-forced-miss == "true") { > // By calling abandon, it will jump to vcl_synth with a 503 > status. There it will restart the request. > return (abandon); > } > set beresp.http.Content-Type = "text/html; charset=utf-8"; > set beresp.ttl = 15s; > synthetic(std.fileread("/etc/varnish/500.html")); > return (deliver); > } > > ----------------------------- > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From batanun at hotmail.com Thu Dec 16 21:30:20 2021 From: batanun at hotmail.com (Batanun B) Date: Thu, 16 Dec 2021 21:30:20 +0000 Subject: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic In-Reply-To: References: Message-ID: > Could be a vary issue Ah, I remember hearing about that earlier, and made a mental note to read up on that. But I forgot all about it. Now I did just that, and boy was that a cold shower for me! I definitely need to unset that header. But why, for the love of all that is holy, does Varnish not include the vary-data in the hash? Why isn't the hash the _single_ key used when storing and looking up data in the cache? Why does Varnish hide this stuff from us? However, when checking the Vary header from the backend, it is set to "Accept-Encoding". And since I haven't changed anything in my browser, it should send the same "Accept-Encoding" request header whenever I surf the website. And since I have visited the startpage multiple times the last 10 days, it should have a cached version of it matching my "Accept-Encoding". > can you post the output of `varnishlog -d -g request -q 'RespStaus eq 500'? Well, that gives me nothing that is relevant here, sadly. The last time this happened was a few days ago, and the buffer doesn't seem to be big enough to keep data that far back. But maybe you could describe what you would look for? I would love to learn how to troubleshoot this. > In the meantime, here's a cheat sheet to get started with varnishlog: > https://docs.varnish-software.com/tutorials/vsl-query/ Thanks, although most of that stuff I already knew. And it doesn't really give any more advanced examples. Like the problem I mentioned earlier. I really would like to know if it is possible to find the first request where it served the 500 page for the "/" url, as well as the request just before that, for the same url. Do you know how to construct a query that gives me that? -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume.quintard at gmail.com Thu Dec 16 22:42:34 2021 From: guillaume.quintard at gmail.com (Guillaume Quintard) Date: Thu, 16 Dec 2021 14:42:34 -0800 Subject: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic In-Reply-To: References: Message-ID: > Ah, I remember hearing about that earlier, and made a mental note to read up on that. But I forgot all about it. Now I did just that, and boy was that a cold shower for me! I definitely need to unset that header. But why, for the love of all that is holy, does Varnish not include the vary-data in the hash? Why isn't the hash the _single_ key used when storing and looking up data in the cache? Why does Varnish hide this stuff from us? Easy one: you build the key from the request, while the vary header is a response property. Therefore, Varnish can't know to put the varied headers into the hash, because it doesn't have that information yet. Basically, the vary header is backend information that allows Varnish to properly differentiate objects with the same hash, and the "hash collision" happens because you, the VCL writer, were not specific enough in vcl_hash (not necessarily your fault, though). (technically, and to avoid the wrath of Varnish veterans: it's not a hash collision because the resources technically represent the same resource and the varied headers are just secondary keys). > However, when checking the Vary header from the backend, it is set to "Accept-Encoding". And since I haven't changed anything in my browser, it should send the same "Accept-Encoding" request header whenever I surf the website. And since I have visited the startpage multiple times the last 10 days, it should have a cached version of it matching my "Accept-Encoding". Don't touch that guy! Varnish will ignore "accept-encoding" in "vary" because it handles compression internally, and always forces "accept-encoding: gzip" before entering vcl_backend_fetch. If your VCL mentions accept-encoding, it's almost always wrong. > Well, that gives me nothing that is relevant here, sadly. The last time this happened was a few days ago, and the buffer doesn't seem to be big enough to keep data that far back. > But maybe you could describe what you would look for? I would love to learn how to troubleshoot this. The default VSL space is 80MB, which is "only" worth a few (tens of) thousands requests, so yeah, it can be a short backlog. You can instead start logging into a file: > varnishlog -g request -q 'RespStatus eq 500' -w log.bin once the file grows, you can start looking at it using "varnishlog -r log.bin" > Thanks, although most of that stuff I already knew. And it doesn't really give any more advanced examples. Like the problem I mentioned earlier. I really would like to know if it is possible to find the first request where it served the 500 page for the "/" url, as well as the request just before that, for the same url. Do you know how to construct a query that gives me that? The problem is that you have to be recording when the initial request goes through. But, if you have then, cache hits will show the VXID of that first request in their "x-varnish" header, and you can find it this way ("varnishlog -r log.bin -q 'vxid == THE_VXID'") Hope that helps -- Guillaume Quintard On Thu, Dec 16, 2021 at 1:33 PM Batanun B wrote: > > Could be a vary issue > > Ah, I remember hearing about that earlier, and made a mental note to read > up on that. But I forgot all about it. Now I did just that, and boy was > that a cold shower for me! I definitely need to unset that header. But why, > for the love of all that is holy, does Varnish not include the vary-data in > the hash? Why isn't the hash the _single_ key used when storing and looking > up data in the cache? Why does Varnish hide this stuff from us? > > However, when checking the Vary header from the backend, it is set to > "Accept-Encoding". And since I haven't changed anything in my browser, it > should send the same "Accept-Encoding" request header whenever I surf the > website. And since I have visited the startpage multiple times the last 10 > days, it should have a cached version of it matching my "Accept-Encoding". > > > can you post the output of `varnishlog -d -g request -q 'RespStaus eq > 500'? > > Well, that gives me nothing that is relevant here, sadly. The last time > this happened was a few days ago, and the buffer doesn't seem to be big > enough to keep data that far back. > > But maybe you could describe what you would look for? I would love to > learn how to troubleshoot this. > > > In the meantime, here's a cheat sheet to get started with varnishlog: > > https://docs.varnish-software.com/tutorials/vsl-query/ > > Thanks, although most of that stuff I already knew. And it doesn't really > give any more advanced examples. Like the problem I mentioned earlier. I > really would like to know if it is possible to find the first request where > it served the 500 page for the "/" url, as well as the request just before > that, for the same url. Do you know how to construct a query that gives me > that? > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From batanun at hotmail.com Fri Dec 17 13:16:42 2021 From: batanun at hotmail.com (Batanun B) Date: Fri, 17 Dec 2021 13:16:42 +0000 Subject: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic In-Reply-To: References: Message-ID: > Easy one: you build the key from the request, while the vary header is a response property. Therefore, Varnish can't know to put the varied headers into the hash, because it doesn't have that information yet. Of course! Makes perfect sense. But the documentation should make this crystal clear, if you ask me. Like a page titled "How is the cache key calculated?" which explains everything related to this. Like how the hash works, how the default implementation works, how to modify it, and how the backend response still can cause multiple different versions (like the Vary header, and maybe other so called "secondary keys"?), and how to modify that behavior. I also think it would make sense if this documentation also mentioned how all this information can be debugged using varnishlog or other tools. Is there even an official word for this final "cache key"? "Hash" clearly isn't specific enough. I'm talking about a word that refers to the unique key that always corresponds to only a _single_ version of a cached object. >Don't touch that guy! Varnish will ignore "accept-encoding" in "vary" because it handles compression internally, and always forces "accept-encoding: gzip" before entering vcl_backend_fetch. If your VCL mentions accept-encoding, it's almost always wrong. Sorry, I'm confused now... Don't touch _which_ guy? Our VCL doesn't contain anything regarding "Accept-Encoding". All I said was that the Vary header in the response from the backend is "Accept-Encoding". And the way I see it, this shouldn't be the cause of the strange problems we are seeing, since even when factoring in this, there should exist a matching cached object for me, and it should be served regardless of TTL or backend health as long as the grace hasn't expired (which it hasn't). Or is my reasoning flawed here, based on the VCL snippet in my original post? Can you think of a scenario where our VCL would return the synthetic 500 page even when there exists a cached objekt matching the hash and vary logic? > The default VSL space is 80MB, which is "only" worth a few (tens of) thousands requests, so yeah, it can be a short backlog. Yeah, I think 80MB is a bit to small for us. Ideally we should be able to sit down on a Monday and troubleshoot problems that occured Friday evening, but that might require a way too big VSL space. But a few hundred MB should be fine. > The problem is that you have to be recording when the initial request goes through. But, if you have then, cache hits will show the VXID of that first request in their "x-varnish" header, and you can find it this way ("varnishlog -r log.bin -q 'vxid == THE_VXID'") Well, would it really be a cache hit? The main transaction I'm looking for is the first transaction for a specific path (in this case, "/") where Varnish served the synthetic 500 page. And then I would also like to see the closest preceding transaction for that same page, where the hash (and the Vary logic) matches the main transaction mentioned above. Picture for example this scenario: 1. Request for "/", with with a certain hash and "secondary key" (from the "Accept-Encoding"). Data is fetched from backend and stored in cache. 2. Lots of other, irrelevant requests, including requests for "/" but with a different hash and/or "Accept-Encoding" 3. Request for "/" that uses the cached data from point 1 above 4. Lots of other, irrelevant requests, same as 2 above, but now in such big numbers that the backend servers start having problems 5. Request for "/" with same hash and/or "Accept-Encoding" as 1 above, but for some reason Varnish now returns the synthetic 500 page and puts it in the cache The transaction log for request in step 5 is the main one I want to find, but also the transaction for step 3 and possibly step 1 as well. But the transaction for step 5 is the most interesting one, and I don't understand how that would be a cache hit while returning a fresh synthetic 500 error page. Or are you saying that we can find "number 5" above by first finding some transaction later in the log, with a cache hit for the 500 page, and that transaction in turn points to "number 5" since that was the transaction that created the 500 response? This makes me realize that when looking at past log data, ie using the "-d" parameter, or reading from a file, it looks for matching transactions/requests starting from the beginning. But in this kind troubleshooting I think it could be useful to also be able to search from the end and go backwards. Is that possible? So that if I use the "-" parameter to limit the output to a single transaction, it outputs the _last_ one instead of the first one. From marco.dickert at evolver.de Fri Dec 17 14:18:23 2021 From: marco.dickert at evolver.de (Marco Dickert - evolver group) Date: Fri, 17 Dec 2021 15:18:23 +0100 Subject: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic In-Reply-To: References: Message-ID: <20211217141823.GC4534@marco.evolver.de> On 2021-12-17 13:16:42, Batanun B wrote: > > The default VSL space is 80MB, which is "only" worth a few (tens of) thousands requests, so yeah, it can be a short backlog. > > Yeah, I think 80MB is a bit to small for us. Ideally we should be able to sit down on a Monday and troubleshoot problems that occured Friday evening, but that might require a way too big VSL space. But a few hundred MB should be fine. I recommend permanent logging, if you want to be able to debug older incidents. We do it like this: ``` /usr/bin/varnishlog -w /var/log/varnish/varnish-500.log -a -D \ -P /var/run/varnishlog-500.pid \ -q 'RespStatus >= 500 or BerespStatus >=500' ``` I attach our systemd unit file, for you may be interested. -- Cheers Marco Dickert -------------- next part -------------- [Unit] Description=Varnishlog for response codes > 500 After=varnish.service [Service] Type=forking ExecStart=/usr/bin/varnishlog -w /var/log/varnish/varnish-500.log -a -D -P /var/run/varnishlog-500.pid -q 'RespStatus >= 500 and ReqHeader:Host !~ staging' ExecReload=/usr/bin/kill -HUP $MAINPID PIDFile=/var/run/varnishlog-500.pid Restart=always [Install] WantedBy=multi-user.target From batanun at hotmail.com Fri Dec 17 15:25:31 2021 From: batanun at hotmail.com (Batanun B) Date: Fri, 17 Dec 2021 15:25:31 +0000 Subject: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic In-Reply-To: <20211217141823.GC4534@marco.evolver.de> References: <20211217141823.GC4534@marco.evolver.de> Message-ID: > I recommend permanent logging, if you want to be able to debug older incidents. > We do it like this: > > ``` > /usr/bin/varnishlog -w /var/log/varnish/varnish-500.log -a -D \ > -P /var/run/varnishlog-500.pid \ > -q 'RespStatus >= 500 or BerespStatus >=500' > ``` > > I attach our systemd unit file, for you may be interested. Thanks. I have thought about that too. But I think we might want to include non-error transactions as well. I mean, with the problems this post is about we want to see when the cached version of the start page was generated and when it was last served from cache successfully. But maybe we could have a permanent logging just for the start page, regardless of http status. That should hopefully reduce the logging intensity enough so that logging to disk isn't effecting the Varnish performance. One thing though... If you log all "status: 500+" transactions to disk, isn't there a risk that your logging might exacerbate a situation where your site is overwhelmed with traffic? Where a large load causes your backends to start failing, and that triggers intense logging of those erroneous transactions which might reduce the performance of Varnish, causing more timeouts etc which cause more logging and so on... From marco.dickert at evolver.de Fri Dec 17 16:01:20 2021 From: marco.dickert at evolver.de (Marco Dickert - evolver group) Date: Fri, 17 Dec 2021 17:01:20 +0100 Subject: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic In-Reply-To: References: <20211217141823.GC4534@marco.evolver.de> Message-ID: <20211217160120.GA6203@marco.evolver.de> On 2021-12-17 15:25:31, Batanun B wrote: > Thanks. I have thought about that too. But I think we might want to include > non-error transactions as well. I mean, with the problems this post is about > we want to see when the cached version of the start page was generated and > when it was last served from cache successfully. But maybe we could have a > permanent logging just for the start page, regardless of http status. That > should hopefully reduce the logging intensity enough so that logging to disk > isn't effecting the Varnish performance. Well, it depends on the performance of your storage and the amount of req/sec on the front page, but these logs can get very huge very quickly. I'd suggest to determine the correct delivery of the front page via an external monitoring (e.g. icinga2 or a simple script). As far as I understand, you don't need to know the exact request, but more of a rough point in time of when the requests start failing. So a monitoring script which curls every minute should be sufficient and causes a lot less trouble. > One thing though... If you log all "status: 500+" transactions to disk, isn't > there a risk that your logging might exacerbate a situation where your site is > overwhelmed with traffic? Where a large load causes your backends to start > failing, and that triggers intense logging of those erroneous transactions > which might reduce the performance of Varnish, causing more timeouts etc which > cause more logging and so on... Indeed there is a risk of self-reinforcing effects, but it didn't happen yet. We also do not plan to logging 500s forever, but only till our problem is solved, which is an error in varnishs memory handling. At the moment, our most concerning 500s are caused by varnish itself, stating "Could not get storage", when the configured memory limit is reached. -- Mit freundlichen Gr??en Marco Dickert Administration und Technik evolver services GmbH Fon +49 / (0)3 71 / 4 00 03 78 24 Fax +49 / (0)3 71 / 4 00 03 79 E-Mail marco.dickert at evolver.de Web https://www.evolver.de Sitz der Gesellschaft: Chemnitz Handelsregister: Amtsgericht Chemnitz, HRB 22649 Gesch?ftsf?hrer: Torsten Gramann und Mathias M?ckel From batanun at hotmail.com Fri Dec 17 16:10:55 2021 From: batanun at hotmail.com (Batanun B) Date: Fri, 17 Dec 2021 16:10:55 +0000 Subject: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic In-Reply-To: <20211217160120.GA6203@marco.evolver.de> References: <20211217141823.GC4534@marco.evolver.de> <20211217160120.GA6203@marco.evolver.de> Message-ID: > I'd suggest to determine the correct delivery of the front page via an external monitoring > (e.g. icinga2 or a simple script). As far as I understand, you don't need to > know the exact request, but more of a rough point in time of when the requests > start failing. So a monitoring script which curls every minute should be > sufficient and causes a lot less trouble. Well, in our use case we would need the exact requests in order to understand why it happens. It might very well be a bug in our VCL, but I would like to have show me something that could narrow it down. From guillaume.quintard at gmail.com Fri Dec 17 16:57:54 2021 From: guillaume.quintard at gmail.com (Guillaume Quintard) Date: Fri, 17 Dec 2021 08:57:54 -0800 Subject: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic In-Reply-To: References: Message-ID: -- Guillaume Quintard On Fri, Dec 17, 2021 at 5:18 AM Batanun B wrote: > Is there even an official word for this final "cache key"? "Hash" clearly > isn't specific enough. I'm talking about a word that refers to the unique > key that always corresponds to only a _single_ version of a cached object. > I don't think there is at the moment Sorry, I'm confused now... Don't touch _which_ guy? Our VCL doesn't contain > anything regarding "Accept-Encoding". All I said was that the Vary header > in the response from the backend is "Accept-Encoding". And the way I see > it, this shouldn't be the cause of the strange problems we are seeing, > since even when factoring in this, there should exist a matching cached > object for me, and it should be served regardless of TTL or backend health > as long as the grace hasn't expired (which it hasn't). Or is my reasoning > flawed here, based on the VCL snippet in my original post? Can you think of > a scenario where our VCL would return the synthetic 500 page even when > there exists a cached objekt matching the hash and vary logic? > > As you mentioned that your VCL was simplified, I didn't want to assume anything. So yes, I meant: do not worry about "accept-encoding", either as a header, or as an entry in the vary header, Varnish will handle that properly. So, the vary hypothesis doesn't pan out. Could it be that your cache size is too small instead and that the churn is pushing the object out? Yeah, I think 80MB is a bit to small for us. Ideally we should be able to > sit down on a Monday and troubleshoot problems that occured Friday evening, > but that might require a way too big VSL space. But a few hundred MB should > be fine. > > I would just log on disk, and rotate every few days. You mentioned that the traffic is fairly low, so the disk usage shouldn't be bad, especially if the backend is on another server. Varnish won't trip itself by choking the disk, the worst case scenario is that varnishlog will not write the file enough and will drop a few transactions. > > > The problem is that you have to be recording when the initial request > goes through. But, if you have then, cache hits will show the VXID of that > first request in their "x-varnish" header, and you can find it this way > ("varnishlog -r log.bin -q 'vxid == THE_VXID'") > > Well, would it really be a cache hit? The main transaction I'm looking for > is the first transaction for a specific path (in this case, "/") where > Varnish served the synthetic 500 page. And then I would also like to see > the closest preceding transaction for that same page, where the hash (and > the Vary logic) matches the main transaction mentioned above. > In that case, you must log all the requests that could match, and once you have found your offender, walk your way up to find the previous request. I don't think there's another way here. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dridi at varni.sh Mon Dec 20 14:03:54 2021 From: dridi at varni.sh (Dridi Boukelmoune) Date: Mon, 20 Dec 2021 14:03:54 +0000 Subject: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic In-Reply-To: <20211217160120.GA6203@marco.evolver.de> References: <20211217141823.GC4534@marco.evolver.de> <20211217160120.GA6203@marco.evolver.de> Message-ID: On Fri, Dec 17, 2021 at 4:03 PM Marco Dickert - evolver group wrote: > > On 2021-12-17 15:25:31, Batanun B wrote: > > Thanks. I have thought about that too. But I think we might want to include > > non-error transactions as well. I mean, with the problems this post is about > > we want to see when the cached version of the start page was generated and > > when it was last served from cache successfully. But maybe we could have a > > permanent logging just for the start page, regardless of http status. That > > should hopefully reduce the logging intensity enough so that logging to disk > > isn't effecting the Varnish performance. > > Well, it depends on the performance of your storage and the amount of req/sec on > the front page, but these logs can get very huge very quickly. I'd suggest to > determine the correct delivery of the front page via an external monitoring > (e.g. icinga2 or a simple script). As far as I understand, you don't need to > know the exact request, but more of a rough point in time of when the requests > start failing. So a monitoring script which curls every minute should be > sufficient and causes a lot less trouble. > > > One thing though... If you log all "status: 500+" transactions to disk, isn't > > there a risk that your logging might exacerbate a situation where your site is > > overwhelmed with traffic? Where a large load causes your backends to start > > failing, and that triggers intense logging of those erroneous transactions > > which might reduce the performance of Varnish, causing more timeouts etc which > > cause more logging and so on... > > Indeed there is a risk of self-reinforcing effects, but it didn't happen yet. We > also do not plan to logging 500s forever, but only till our problem is solved, > which is an error in varnishs memory handling. At the moment, our most > concerning 500s are caused by varnish itself, stating "Could not get storage", > when the configured memory limit is reached. If you get a surge of 5XX responses from either Varnish or the backend, you can also rate-limit logs to the disk: https://varnish-cache.org/docs/6.0/reference/varnishlog.html See the -R option. Cheers, Dridi From joh.hendriks at gmail.com Thu Dec 30 08:16:46 2021 From: joh.hendriks at gmail.com (Johan Hendriks) Date: Thu, 30 Dec 2021 09:16:46 +0100 Subject: health check varnish Message-ID: Hello all, first off all, thanks for a year full of varnish happiness like the developers page and the parallel ESI for the varnish-cache edition. We use the following in the vcl for our varnish health check from our haproxy loadbalancers if (req.url == "/varnish_status") { ??????? if (!std.healthy(req.backend_hint)) { ??????????? return (synth(500, "Varnish: unhealthy backend.")); ??????? } ??????? else { ??????????? return (synth(200, "Varnish: All good!")); ??????? } ??? } And this works fine if we run varnish on the webserver itself with only one backend. But if we use the director with multiple backends we get an 500 error if one of the backend servers is offline even if the director itself has enough servers left and still is marked ad healthy. Is there a way to check if the director itself is down? like Backend name????? Admin??? Probe??? Health???? Last change boot.web01??????? probe??? 5/5????? healthy??? Mon, 27 Dec 2021 08:42:58 GMT boot.web02??????? probe??? 5/5????? healthy??? Mon, 27 Dec 2021 08:42:58 GMT boot.web03 ??? ?? probe??? 5/5????? healthy??? Mon, 27 Dec 2021 08:42:58 GMT boot.webcluster?? probe??? 2/2????? healthy??? Mon, 27 Dec 2021 08:42:58 GMT return a 200 as all is good boot.web01??????? probe??? 2/5????? sick?????? Wed, 29 Dec 2021 18:33:41 GMT boot.web02??????? probe??? 4/5????? healthy??? Wed, 29 Dec 2021 18:33:41 GMT boot.web03 ?????? probe??? 4/5????? healthy??? Wed, 29 Dec 2021 18:33:41 GMT boot.webcluster?? probe??? 1/2????? healthy??? Wed, 29 Dec 2021 18:33:41 GMT still return a 200 as we still have a healthy webcluster backend. Backend name????? Admin??? Probe??? Health??? Last change boot.web01??????? probe??? 2/5????? sick????? Wed, 29 Dec 2021 18:34:40 GMT boot.web02??????? probe??? 2/5????? sick????? Wed, 29 Dec 2021 18:34:40 GMT boot.web02??????? probe??? 2/5????? sick????? Wed, 29 Dec 2021 18:34:40 GMT boot.webcluster?? probe??? 0/2????? sick????? Wed, 29 Dec 2021 18:34:40 GMT return the 500 error code as our webcluster backend is down. So can we just get the health status of webcluster in this case? regards Johan And in advance a happy new year. From dridi at varni.sh Thu Dec 30 09:17:25 2021 From: dridi at varni.sh (Dridi Boukelmoune) Date: Thu, 30 Dec 2021 09:17:25 +0000 Subject: health check varnish In-Reply-To: References: Message-ID: On Thu, Dec 30, 2021 at 8:17 AM Johan Hendriks wrote: > > Hello all, first off all, thanks for a year full of varnish happiness > like the developers page and the parallel ESI for the varnish-cache edition. > > We use the following in the vcl for our varnish health check from our > haproxy loadbalancers > > if (req.url == "/varnish_status") { > if (!std.healthy(req.backend_hint)) { > return (synth(500, "Varnish: unhealthy backend.")); > } > else { > return (synth(200, "Varnish: All good!")); > } > } > > And this works fine if we run varnish on the webserver itself with only > one backend. > But if we use the director with multiple backends we get an 500 error if > one of the backend servers is offline even if the director itself has > enough servers left and still is marked ad healthy. > > Is there a way to check if the director itself is down? > > like > Backend name Admin Probe Health Last change > boot.web01 probe 5/5 healthy Mon, 27 Dec 2021 08:42:58 GMT > boot.web02 probe 5/5 healthy Mon, 27 Dec 2021 08:42:58 GMT > boot.web03 probe 5/5 healthy Mon, 27 Dec 2021 08:42:58 GMT > boot.webcluster probe 2/2 healthy Mon, 27 Dec 2021 08:42:58 GMT > > return a 200 as all is good > > boot.web01 probe 2/5 sick Wed, 29 Dec 2021 18:33:41 GMT > boot.web02 probe 4/5 healthy Wed, 29 Dec 2021 18:33:41 GMT > boot.web03 probe 4/5 healthy Wed, 29 Dec 2021 18:33:41 GMT > boot.webcluster probe 1/2 healthy Wed, 29 Dec 2021 18:33:41 GMT > > still return a 200 as we still have a healthy webcluster backend. > > Backend name Admin Probe Health Last change > boot.web01 probe 2/5 sick Wed, 29 Dec 2021 18:34:40 GMT > boot.web02 probe 2/5 sick Wed, 29 Dec 2021 18:34:40 GMT > boot.web02 probe 2/5 sick Wed, 29 Dec 2021 18:34:40 GMT > boot.webcluster probe 0/2 sick Wed, 29 Dec 2021 18:34:40 GMT > > return the 500 error code as our webcluster backend is down. > > So can we just get the health status of webcluster in this case? If you assign webcluster.backend() to req.backend_hint before performing the health check it should work fine. Dridi From joh.hendriks at gmail.com Fri Dec 31 12:41:05 2021 From: joh.hendriks at gmail.com (Johan Hendriks) Date: Fri, 31 Dec 2021 13:41:05 +0100 Subject: health check varnish In-Reply-To: References: Message-ID: <7560aeeb-d367-5852-0955-bc33c2faf0a0@gmail.com> On 30/12/2021 10:17, Dridi Boukelmoune wrote: > On Thu, Dec 30, 2021 at 8:17 AM Johan Hendriks wrote: >> Hello all, first off all, thanks for a year full of varnish happiness >> like the developers page and the parallel ESI for the varnish-cache edition. >> >> We use the following in the vcl for our varnish health check from our >> haproxy loadbalancers >> >> if (req.url == "/varnish_status") { >> if (!std.healthy(req.backend_hint)) { >> return (synth(500, "Varnish: unhealthy backend.")); >> } >> else { >> return (synth(200, "Varnish: All good!")); >> } >> } >> >> And this works fine if we run varnish on the webserver itself with only >> one backend. >> But if we use the director with multiple backends we get an 500 error if >> one of the backend servers is offline even if the director itself has >> enough servers left and still is marked ad healthy. >> >> Is there a way to check if the director itself is down? >> >> like >> Backend name Admin Probe Health Last change >> boot.web01 probe 5/5 healthy Mon, 27 Dec 2021 08:42:58 GMT >> boot.web02 probe 5/5 healthy Mon, 27 Dec 2021 08:42:58 GMT >> boot.web03 probe 5/5 healthy Mon, 27 Dec 2021 08:42:58 GMT >> boot.webcluster probe 2/2 healthy Mon, 27 Dec 2021 08:42:58 GMT >> >> return a 200 as all is good >> >> boot.web01 probe 2/5 sick Wed, 29 Dec 2021 18:33:41 GMT >> boot.web02 probe 4/5 healthy Wed, 29 Dec 2021 18:33:41 GMT >> boot.web03 probe 4/5 healthy Wed, 29 Dec 2021 18:33:41 GMT >> boot.webcluster probe 1/2 healthy Wed, 29 Dec 2021 18:33:41 GMT >> >> still return a 200 as we still have a healthy webcluster backend. >> >> Backend name Admin Probe Health Last change >> boot.web01 probe 2/5 sick Wed, 29 Dec 2021 18:34:40 GMT >> boot.web02 probe 2/5 sick Wed, 29 Dec 2021 18:34:40 GMT >> boot.web02 probe 2/5 sick Wed, 29 Dec 2021 18:34:40 GMT >> boot.webcluster probe 0/2 sick Wed, 29 Dec 2021 18:34:40 GMT >> >> return the 500 error code as our webcluster backend is down. >> >> So can we just get the health status of webcluster in this case? > If you assign webcluster.backend() to req.backend_hint before > performing the health check it should work fine. > > Dridi Thank you, that did the trick. I knew it had to be something simple that i overlooked. Have a wonderful 2022 all. regards Johan