From guillaume at varnish-software.com Fri Feb 1 18:56:11 2019 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Fri, 1 Feb 2019 10:56:11 -0800 Subject: Locating connection reset issue / h2 vs http/1.1 In-Reply-To: <678C443B-1889-40EA-9835-9A2C7EA3091A@shee.org> References: <678C443B-1889-40EA-9835-9A2C7EA3091A@shee.org> Message-ID: Are you able to find some logs in varnishlog (-g session, filtering by port) to see what varnish is doing? -- Guillaume Quintard On Thu, Jan 31, 2019 at 9:45 AM wrote: > Am 31.01.2019 um 17:38 schrieb Guillaume Quintard < > guillaume at varnish-software.com>: > > > > On Thu, Jan 31, 2019 at 6:22 AM wrote: > >> > >> I have following stack: hitch-1.5 - varnish-5.2.0 - httpd-2.2/2.4 > >> > >> On a high traffic node I am observing a lot of "Socket error: > Connection reset by peer" log entries coming from hitch. > >> > >> I am trying to locate the cause of the issue (hitch or varnish site). > >> > >> So far I can say; that disabling h2 on hitch the "Connection resets" > doesn't appear anymore. > >> > >> Does this have to do with varnish-5.2.'s h2 implementation? > >> > >> Jan 30 19:02:37 srv-s01 hitch[4006]: ww.xx.yy.zz:59395 :0 10:11 > NPN/ALPN protocol: h2 > >> Jan 30 19:02:37 srv-s01 hitch[4006]: ww.xx.yy.zz:59395 :0 10:11 ssl end > handshake > >> Jan 30 19:02:37 srv-s01 hitch[4006]: ww.xx.yy.zz:59395 :42884 10:11 > backend connected > >> Jan 30 19:02:39 srv-s01 hitch[4006]: {backend} Socket error: Connection > reset by peer > >> Jan 30 19:02:39 srv-s01 hitch[4006]: ww.xx.yy.zz:59395 :42884 10:11 > proxy shutdown req=SHUTDOWN_CLEAR > >> Jan 30 19:02:39 srv-s01 hitch[4006]: {backend} Socket error: Broken pipe > >> Jan 30 19:02:39 srv-s01 hitch[4006]: ww.xx.yy.zz:59395 :42884 10:11 > proxy shutdown req=SHUTDOWN_CLEAR > >> Jan 30 19:02:39 srv-s01 hitch[4006]: ww.xx.yy.zz:59399 :0 10:11 proxy > connect > > > > Have you activated h2 support in Varnish? (it's not on by default) > > > > Sure, DAEMON_OPTS has -p feature=+http2 passed. The content is delivered > via h2 (verified in browsers) > but sometimes lot of assets (client view) produce ERR_CONNECTION_CLOSED > errors in the browser and on > server site the mentioned "connection reset by peer" log entries appears > ... > > -- > Leon -------------- next part -------------- An HTML attachment was scrubbed... URL: From revirii at googlemail.com Tue Feb 5 10:50:54 2019 From: revirii at googlemail.com (Hu Bert) Date: Tue, 5 Feb 2019 11:50:54 +0100 Subject: varnish 5.0: varnish slow when backends do not respond? Message-ID: Hey there, i hope i'm right here... i have the following setup to deliver images: nginx: https -> forward request to varnish 5.0 if image is not in cache -> forward request to backend nginx backend nginx: delivers file to varnish if found on harddisk if backend nginx doesn't find: forward request to 2 backend tomcats to calculate the desired image The 2 backend tomcats do deliver another webapp (and are a varnish backend as well); at the moment they're quite busy and stop working due to heavy load (->restart), the result is that varnish sees/thinks that the backends are sick. Somehow then even the cached images are delivered after a quite long waiting period, e.g. a 5 KB image takes more than 7 seconds. Is this the normal behaviour that varnish does answer slowly if some backends are sick? If any other information is need i can provide the necessary stuff. Thx in advance Hubert From guillaume at varnish-software.com Tue Feb 5 11:32:49 2019 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Tue, 5 Feb 2019 12:32:49 +0100 Subject: varnish 5.0: varnish slow when backends do not respond? In-Reply-To: References: Message-ID: Hi, Do you have probes set up? If you do, the backend will be declared sick and varnish will reply instantly without even trying to contact it. It sounds like that at the moment, varnish just tries to get whatever it can, waiting for as long as authorized. Cheers, On Tue, Feb 5, 2019, 11:51 Hu Bert Hey there, > > i hope i'm right here... i have the following setup to deliver images: > > nginx: https -> forward request to varnish 5.0 > if image is not in cache -> forward request to backend nginx > backend nginx: delivers file to varnish if found on harddisk > if backend nginx doesn't find: forward request to 2 backend tomcats to > calculate the desired image > > The 2 backend tomcats do deliver another webapp (and are a varnish > backend as well); at the moment they're quite busy and stop working > due to heavy load (->restart), the result is that varnish sees/thinks > that the backends are sick. Somehow then even the cached images are > delivered after a quite long waiting period, e.g. a 5 KB image takes > more than 7 seconds. > > Is this the normal behaviour that varnish does answer slowly if some > backends are sick? > > If any other information is need i can provide the necessary stuff. > > Thx in advance > Hubert > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From revirii at googlemail.com Tue Feb 5 11:55:31 2019 From: revirii at googlemail.com (Hu Bert) Date: Tue, 5 Feb 2019 12:55:31 +0100 Subject: varnish 5.0: varnish slow when backends do not respond? In-Reply-To: References: Message-ID: Hi Guillaume, the backend config looks like this (just questioning a simple file from tomcat); maybe params are wrong? : backend tomcat_backend1 { .host = "192.168.0.126"; .port = "8082"; .connect_timeout = 15s; .first_byte_timeout = 60s; .between_bytes_timeout = 15s; .probe = { .url = "/portal/info.txt"; .timeout = 10s; .interval = 1m; .window = 3; .threshold = 1; } } The backend is shown as 'sick', but the time until you get an answer from nginx/varnish differs, from below a second to 7 or more seconds - but the requested image is already in cache (hits >= 1). Imho the cache should work and deliver a cached file, independent from a (non) working backend. Maybe beresp.ttl messed up? else if (beresp.status<300) { [lots of rules] } else { # Use very short caching time for error messages - giving the system the chance to recover set beresp.ttl = 10s; unset beresp.http.Cache-Control; return(deliver); } Thx Hubert Am Di., 5. Feb. 2019 um 12:33 Uhr schrieb Guillaume Quintard : > > Hi, > > Do you have probes set up? If you do, the backend will be declared sick and varnish will reply instantly without even trying to contact it. > > It sounds like that at the moment, varnish just tries to get whatever it can, waiting for as long as authorized. > > Cheers, > > On Tue, Feb 5, 2019, 11:51 Hu Bert > >> Hey there, >> >> i hope i'm right here... i have the following setup to deliver images: >> >> nginx: https -> forward request to varnish 5.0 >> if image is not in cache -> forward request to backend nginx >> backend nginx: delivers file to varnish if found on harddisk >> if backend nginx doesn't find: forward request to 2 backend tomcats to >> calculate the desired image >> >> The 2 backend tomcats do deliver another webapp (and are a varnish >> backend as well); at the moment they're quite busy and stop working >> due to heavy load (->restart), the result is that varnish sees/thinks >> that the backends are sick. Somehow then even the cached images are >> delivered after a quite long waiting period, e.g. a 5 KB image takes >> more than 7 seconds. >> >> Is this the normal behaviour that varnish does answer slowly if some >> backends are sick? >> >> If any other information is need i can provide the necessary stuff. >> >> Thx in advance >> Hubert >> _______________________________________________ >> varnish-misc mailing list >> varnish-misc at varnish-cache.org >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc From guillaume at varnish-software.com Tue Feb 5 15:10:09 2019 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Tue, 5 Feb 2019 16:10:09 +0100 Subject: varnish 5.0: varnish slow when backends do not respond? In-Reply-To: References: Message-ID: re-adding the list > I will reduce probe param 'interval' to, let's say, 10s. That sounds reasonable? I would definitely make for more reactive decision. Shameless plug: I would recommend reading on that topic: https://info.varnish-software.com/blog/backends-load-balancing (man vcl, the probes section is of course a must-read) -- Guillaume Quintard On Tue, Feb 5, 2019 at 3:05 PM Hu Bert wrote: > Hi, > i'll try these commands. No output so far, but i'll see. > > I will reduce probe param 'interval' to, let's say, 10s. That sounds > reasonable? > > > Hubert > > Am Di., 5. Feb. 2019 um 14:35 Uhr schrieb Guillaume Quintard > : > > > > Try something like that: varnishlog -q "Timestamp:Resp[2] > 7" -g request > > (man vsl-query for more info) > > > > I just think your probe definition is pretty bad (1 minute interval is > going to yield some wonky results) and you varnish sees the backend as > healthy, tries to fetch, fakes a long time, then the probe finally kicks in. > > -- > > Guillaume Quintard > > > > > > On Tue, Feb 5, 2019 at 1:47 PM Hu Bert wrote: > >> > >> Hi, > >> sry i can't reproduce, as i had to get the varnish running. Maybe i > >> have to explain... :-) > >> > >> We once had a server with nginx (frontend), varnish and some other > >> stuff, and as RAM became a tight resource, we got another server > >> (server2) running, separately for varnish. That server now cached all > >> the images and all other stuff (like css, js etc.) from the tomcat > >> backends. So the vcl file contained the images backends and all the > >> tomcat backends. > >> > >> We then moved the cache for "all the other stuff" to server3, and > >> server2 only cached images from then on. But the vcl file stayed > >> untouched, still containing all the backends&probes that actually > >> weren't necessary for images - and now 2 of these backends (due to > >> load) repeatedly answered 500/502 and have to be rebooted regularly > >> (nothing can be done here at the moment). > >> > >> To get the varnish on server2 (images) running i simply removed all > >> the unnecessary tomcat backends and restarted varnish, and now it's > >> running really good. I still have the old vcl file on server3 running, > >> there i see that the 2 tomcat backends are changing between sick and > >> healthy. Don't know if it might work there as well - i tried it but > >> the output of 'varnishlog -g request' is massive. Something special i > >> should grep for? > >> > >> Alternatively i could provide the vcl file, but i'm afraid that your > >> eyes might explode ;-) > >> > >> Hubert > >> > >> Am Di., 5. Feb. 2019 um 13:14 Uhr schrieb Guillaume Quintard > >> : > >> > > >> > Hi, > >> > > >> > Can you try to set the backend health to sick using "varnishadm > backend.set_health" and try to reproduce? > >> > > >> > If you can reproduce, please pastebin the corresponding "varnishlog > -g request" block > >> > > >> > On Tue, Feb 5, 2019, 12:55 Hu Bert >> >> > >> >> Hi Guillaume, > >> >> > >> >> the backend config looks like this (just questioning a simple file > >> >> from tomcat); maybe params are wrong? : > >> >> > >> >> backend tomcat_backend1 { > >> >> .host = "192.168.0.126"; > >> >> .port = "8082"; > >> >> .connect_timeout = 15s; > >> >> .first_byte_timeout = 60s; > >> >> .between_bytes_timeout = 15s; > >> >> .probe = { > >> >> .url = "/portal/info.txt"; > >> >> .timeout = 10s; > >> >> .interval = 1m; > >> >> .window = 3; > >> >> .threshold = 1; > >> >> } > >> >> } > >> >> > >> >> The backend is shown as 'sick', but the time until you get an answer > >> >> from nginx/varnish differs, from below a second to 7 or more seconds > - > >> >> but the requested image is already in cache (hits >= 1). > >> >> > >> >> Imho the cache should work and deliver a cached file, independent > from > >> >> a (non) working backend. Maybe beresp.ttl messed up? > >> >> > >> >> else if (beresp.status<300) { > >> >> [lots of rules] > >> >> } else { > >> >> # Use very short caching time for error messages - giving the > >> >> system the chance to recover > >> >> set beresp.ttl = 10s; > >> >> unset beresp.http.Cache-Control; > >> >> return(deliver); > >> >> } > >> >> > >> >> Thx > >> >> Hubert > >> >> > >> >> Am Di., 5. Feb. 2019 um 12:33 Uhr schrieb Guillaume Quintard > >> >> : > >> >> > > >> >> > Hi, > >> >> > > >> >> > Do you have probes set up? If you do, the backend will be declared > sick and varnish will reply instantly without even trying to contact it. > >> >> > > >> >> > It sounds like that at the moment, varnish just tries to get > whatever it can, waiting for as long as authorized. > >> >> > > >> >> > Cheers, > >> >> > > >> >> > On Tue, Feb 5, 2019, 11:51 Hu Bert >> >> >> > >> >> >> Hey there, > >> >> >> > >> >> >> i hope i'm right here... i have the following setup to deliver > images: > >> >> >> > >> >> >> nginx: https -> forward request to varnish 5.0 > >> >> >> if image is not in cache -> forward request to backend nginx > >> >> >> backend nginx: delivers file to varnish if found on harddisk > >> >> >> if backend nginx doesn't find: forward request to 2 backend > tomcats to > >> >> >> calculate the desired image > >> >> >> > >> >> >> The 2 backend tomcats do deliver another webapp (and are a varnish > >> >> >> backend as well); at the moment they're quite busy and stop > working > >> >> >> due to heavy load (->restart), the result is that varnish > sees/thinks > >> >> >> that the backends are sick. Somehow then even the cached images > are > >> >> >> delivered after a quite long waiting period, e.g. a 5 KB image > takes > >> >> >> more than 7 seconds. > >> >> >> > >> >> >> Is this the normal behaviour that varnish does answer slowly if > some > >> >> >> backends are sick? > >> >> >> > >> >> >> If any other information is need i can provide the necessary > stuff. > >> >> >> > >> >> >> Thx in advance > >> >> >> Hubert > >> >> >> _______________________________________________ > >> >> >> varnish-misc mailing list > >> >> >> varnish-misc at varnish-cache.org > >> >> >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From revirii at googlemail.com Mon Feb 18 09:58:24 2019 From: revirii at googlemail.com (Hu Bert) Date: Mon, 18 Feb 2019 10:58:24 +0100 Subject: strange temporary varnish outage Message-ID: Hello, we're using varnish v5 (debian stretch) for image caching; yesterday there was a strange outage where i'm somehow unable to find the reason as there are almost no log entries, besides one: Feb 17 09:03:47 rowlf kernel: [1047133.190149] cgroup: fork rejected by pids controller in /system.slice/varnish.service But the problems started a couple of minutes before that, so this message simply could be a result of previous problems. Some munin graphs: Backend traffic: strange spike in backend connection retry/success, decrease in recycle/reuse: https://abload.de/img/varnish_backend_traffqwj74.png Expunge: a similar spike in "Number of expired objects" https://abload.de/img/varnish_expunge-day5kk0l.png Threads: threads went up at that time; was lower before (restart was done on Feb 14th), and suddenly went up. day: https://abload.de/img/varnish_threads-dayzoken.png week: https://abload.de/img/varnish_threads-week7qjoo.png Backend graph: https://abload.de/img/nginx_status-day54jkd.png /etc/systemd/system/varnish.service : https://pastebin.com/aAhMHn4p Here's the (shortened) vcl file: https://pastebin.com/nVu5vVaa Anyone has an idea how to dig into this? Something horribly wrong in the vcl file? Thx, Hubert From revirii at googlemail.com Tue Feb 19 08:01:46 2019 From: revirii at googlemail.com (Hu Bert) Date: Tue, 19 Feb 2019 09:01:46 +0100 Subject: strange temporary varnish outage In-Reply-To: References: Message-ID: Good morning, i think we solved the problem: we ran into a systemd limit (4915 tasks): https://github.com/varnishcache/varnish-cache/issues/2822 https://github.com/varnishcache/pkg-varnish-cache/blob/6c90eb775857573564dc1fe38424267143bb6b34/systemd/varnish.service#L19 It seems we hit that limit; i updated the (loooong outdated) v5 to v6 LTS and set TasksMax=infinity. systemctl status varnish.service now shows: Tasks: 7136 - so, yeah, solved :-) Thx for reading ;-) Hubert Am Mo., 18. Feb. 2019 um 10:58 Uhr schrieb Hu Bert : > > Hello, > > we're using varnish v5 (debian stretch) for image caching; yesterday > there was a strange outage where i'm somehow unable to find the reason > as there are almost no log entries, besides one: > > Feb 17 09:03:47 rowlf kernel: [1047133.190149] cgroup: fork rejected > by pids controller in /system.slice/varnish.service > > But the problems started a couple of minutes before that, so this > message simply could be a result of previous problems. Some munin > graphs: > > Backend traffic: strange spike in backend connection retry/success, > decrease in recycle/reuse: > https://abload.de/img/varnish_backend_traffqwj74.png > > Expunge: a similar spike in "Number of expired objects" > https://abload.de/img/varnish_expunge-day5kk0l.png > > Threads: threads went up at that time; was lower before (restart was > done on Feb 14th), and suddenly went up. > day: https://abload.de/img/varnish_threads-dayzoken.png > week: https://abload.de/img/varnish_threads-week7qjoo.png > Backend graph: https://abload.de/img/nginx_status-day54jkd.png > > /etc/systemd/system/varnish.service : https://pastebin.com/aAhMHn4p > Here's the (shortened) vcl file: https://pastebin.com/nVu5vVaa > > Anyone has an idea how to dig into this? Something horribly wrong in > the vcl file? > > > Thx, > Hubert From dridi at varni.sh Sat Feb 23 07:36:56 2019 From: dridi at varni.sh (Dridi Boukelmoune) Date: Sat, 23 Feb 2019 08:36:56 +0100 Subject: strange temporary varnish outage In-Reply-To: References: Message-ID: On Tue, Feb 19, 2019 at 9:03 AM Hu Bert wrote: > > Good morning, > i think we solved the problem: we ran into a systemd limit (4915 tasks): > > https://github.com/varnishcache/varnish-cache/issues/2822 > https://github.com/varnishcache/pkg-varnish-cache/blob/6c90eb775857573564dc1fe38424267143bb6b34/systemd/varnish.service#L19 > > It seems we hit that limit; i updated the (loooong outdated) v5 to v6 > LTS and set TasksMax=infinity. systemctl status varnish.service now > shows: Tasks: 7136 - so, yeah, solved :-) Thx for reading ;-) Happy to see that moving to 6.0 solved the problem! > Hubert > > Am Mo., 18. Feb. 2019 um 10:58 Uhr schrieb Hu Bert : > > > > Hello, > > > > we're using varnish v5 (debian stretch) for image caching; yesterday > > there was a strange outage where i'm somehow unable to find the reason > > as there are almost no log entries, besides one: > > > > Feb 17 09:03:47 rowlf kernel: [1047133.190149] cgroup: fork rejected > > by pids controller in /system.slice/varnish.service > > > > But the problems started a couple of minutes before that, so this > > message simply could be a result of previous problems. Some munin > > graphs: > > > > Backend traffic: strange spike in backend connection retry/success, > > decrease in recycle/reuse: > > https://abload.de/img/varnish_backend_traffqwj74.png > > > > Expunge: a similar spike in "Number of expired objects" > > https://abload.de/img/varnish_expunge-day5kk0l.png > > > > Threads: threads went up at that time; was lower before (restart was > > done on Feb 14th), and suddenly went up. > > day: https://abload.de/img/varnish_threads-dayzoken.png > > week: https://abload.de/img/varnish_threads-week7qjoo.png > > Backend graph: https://abload.de/img/nginx_status-day54jkd.png > > > > /etc/systemd/system/varnish.service : https://pastebin.com/aAhMHn4p > > Here's the (shortened) vcl file: https://pastebin.com/nVu5vVaa > > > > Anyone has an idea how to dig into this? Something horribly wrong in > > the vcl file? > > > > > > Thx, > > Hubert > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc