Child process recurrently being restarted

Wed Jun 28 09:12:46 CEST 2017

Sadly, nothing suspicious here, you can still try:
- bumping the cli_timeout
- changing your disk scheduler
- changing the advice option of the file storage

I'm still convinced this is due to Varnish getting stuck waiting for the
disk because of the file storage fragmentation.

Maybe you could look at SMF.*.g_alloc and compare it to the number of
objects. Ideally, we would have a 1:1 relation between objects and
allocations. If that number drops prior to a restart, that would be a good
clue.

-- 
Guillaume Quintard

On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <stefanobaldo at gmail.com>
wrote:

> Hi Guillaume.
>
> It keeps restarting.
> Would you mind taking a quick look in the following VCL file to check if
> you find anything suspicious?
>
> Thank you very much.
>
> Best,
> Stefano
>
> vcl 4.0;
>
> import std;
>
> backend default {
>   .host = "sites-web-server-lb";
>   .port = "80";
> }
>
> include "/etc/varnish/bad_bot_detection.vcl";
>
> sub vcl_recv {
>   call bad_bot_detection;
>
>   if (req.url == "/nocache" || req.url == "/version") {
>     return(pass);
>   }
>
>   unset req.http.Cookie;
>   if (req.method == "PURGE") {
>     ban("obj.http.x-host == " + req.http.host + " && obj.http.x-user-agent
> !~ Googlebot");
>     return(synth(750));
>   }
>
>   set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
> }
>
> sub vcl_synth {
>   if (resp.status == 750) {
>     set resp.status = 200;
>     synthetic("PURGED => " + req.url);
>     return(deliver);
>   } elsif (resp.status == 501) {
>     set resp.status = 200;
>     set resp.http.Content-Type = "text/html; charset=utf-8";
>     synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
>     return(deliver);
>   }
> }
>
> sub vcl_backend_response {
>   unset beresp.http.Set-Cookie;
>   set beresp.http.x-host = bereq.http.host;
>   set beresp.http.x-user-agent = bereq.http.user-agent;
>
>   if (bereq.url == "/themes/basic/assets/theme.min.css"
>     || bereq.url == "/api/events/PAGEVIEW"
>     || bereq.url ~ "^\/assets\/img\/") {
>     set beresp.http.Cache-Control = "max-age=0";
>   } else {
>     unset beresp.http.Cache-Control;
>   }
>
>   if (beresp.status == 200 ||
>     beresp.status == 301 ||
>     beresp.status == 302 ||
>     beresp.status == 404) {
>       if (bereq.url ~ "\&ordenar=aleatorio$") {
>         set beresp.http.X-TTL = "1d";
>         set beresp.ttl = 1d;
>       } else {
>         set beresp.http.X-TTL = "1w";
>         set beresp.ttl = 1w;
>       }
>   }
>
>   if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
> {
>     set beresp.do_gzip = true;
>   }
> }
>
> sub vcl_pipe {
>   set bereq.http.connection = "close";
>   return (pipe);
> }
>
> sub vcl_deliver {
>   unset resp.http.x-host;
>   unset resp.http.x-user-agent;
> }
>
> sub vcl_backend_error {
>   if (beresp.status == 502 || beresp.status == 503 || beresp.status ==
> 504) {
>     set beresp.status = 200;
>     set beresp.http.Content-Type = "text/html; charset=utf-8";
>     synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>     return (deliver);
>   }
> }
>
> sub vcl_hash {
>   if (req.http.User-Agent ~ "Google Page Speed") {
>     hash_data("Google Page Speed");
>   } elsif (req.http.User-Agent ~ "Googlebot") {
>     hash_data("Googlebot");
>   }
> }
>
> sub vcl_deliver {
>   if (resp.status == 501) {
>     return (synth(resp.status));
>   }
>   if (obj.hits > 0) {
>     set resp.http.X-Cache = "hit";
>   } else {
>     set resp.http.X-Cache = "miss";
>   }
> }
>
>
> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
> guillaume at varnish-software.com> wrote:
>
>> Nice! It may have been the cause, time will tell.can you report back in a
>> few days to let us know?
>> --
>> Guillaume Quintard
>>
>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo at gmail.com> wrote:
>>
>>> Hi Guillaume.
>>>
>>> I think things will start to going better now after changing the bans.
>>> This is how my last varnishstat looked like moments before a crash
>>> regarding the bans:
>>>
>>> MAIN.bans                     41336          .   Count of bans
>>> MAIN.bans_completed           37967          .   Number of bans marked
>>> 'completed'
>>> MAIN.bans_obj                     0          .   Number of bans using
>>> obj.*
>>> MAIN.bans_req                 41335          .   Number of bans using
>>> req.*
>>> MAIN.bans_added               41336         0.68 Bans added
>>> MAIN.bans_deleted                 0         0.00 Bans deleted
>>>
>>> And this is how it looks like now:
>>>
>>> MAIN.bans                         2          .   Count of bans
>>> MAIN.bans_completed               1          .   Number of bans marked
>>> 'completed'
>>> MAIN.bans_obj                     2          .   Number of bans using
>>> obj.*
>>> MAIN.bans_req                     0          .   Number of bans using
>>> req.*
>>> MAIN.bans_added                2016         0.69 Bans added
>>> MAIN.bans_deleted              2014         0.69 Bans deleted
>>>
>>> Before the changes, bans were never deleted!
>>> Now the bans are added and quickly deleted after a minute or even a
>>> couple of seconds.
>>>
>>> May this was the cause of the problem? It seems like varnish was having
>>> a large number of bans to manage and test against.
>>> I will let it ride now. Let's see if the problem persists or it's gone!
>>> :-)
>>>
>>> Best,
>>> Stefano
>>>
>>>
>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>> guillaume at varnish-software.com> wrote:
>>>
>>>> Looking good!
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <stefanobaldo at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Guillaume,
>>>>>
>>>>> Can the following be considered "ban lurker friendly"?
>>>>>
>>>>> sub vcl_backend_response {
>>>>>   set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>   set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>> }
>>>>>
>>>>> sub vcl_recv {
>>>>>   if (req.method == "PURGE") {
>>>>>     ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>     return(synth(750));
>>>>>   }
>>>>> }
>>>>>
>>>>> sub vcl_deliver {
>>>>>   unset resp.http.x-url;
>>>>>   unset resp.http.x-user-agent;
>>>>> }
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>>
>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>> guillaume at varnish-software.com> wrote:
>>>>>
>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>
>>>>>> I don't think you need to expand the VSL at all.
>>>>>>
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Guillaume.
>>>>>>
>>>>>> Thanks for answering.
>>>>>>
>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>> performance but it stills restarting.
>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>> signal of overhead.
>>>>>>
>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>>>> default size passing "-l 200m,20m" to varnishd and using
>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was
>>>>>> a problem here. After a couple of hours varnish died and I received a "no
>>>>>> space left on device" message - deleting the /var/lib/varnish solved the
>>>>>> problem and varnish was up again, but it's weird because there was free
>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>> what could have happened. I will try to stop increasing the
>>>>>> /var/lib/varnish size.
>>>>>>
>>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>
>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url
>>>>>> + " && req.http.User-Agent !~ Googlebot");
>>>>>>
>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>> documentation and it looks like they're not.
>>>>>>
>>>>>> Best,
>>>>>> Stefano
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>> guillaume at varnish-software.com> wrote:
>>>>>>
>>>>>>> Hi Stefano,
>>>>>>>
>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>
>>>>>>> After some time, the file storage is terrible on a hard drive (SSDs
>>>>>>> take a bit more time to degrade) because of fragmentation. One solution to
>>>>>>> help the disks cope is to overprovision themif they're SSDs, and you can
>>>>>>> try different advices in the file storage definition in the command line
>>>>>>> (last parameter, after granularity).
>>>>>>>
>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>
>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>
>>>>>>> --
>>>>>>> Guillaume Quintard
>>>>>>>
>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>> stefanobaldo at gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello.
>>>>>>>>
>>>>>>>> I am having a critical problem with Varnish Cache in production for
>>>>>>>> over a month and any help will be appreciated.
>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>
>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>>> responding to CLI, killed it.
>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from
>>>>>>>> ping: 400 CLI communication error
>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>>>> signal=9
>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>> Child starts
>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>
>>>>>>>> The following link is the varnishstat output just 1 minute before a
>>>>>>>> restart:
>>>>>>>>
>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>
>>>>>>>> Environment:
>>>>>>>>
>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>> packagecloud.io
>>>>>>>> CPU 2x2.9 GHz
>>>>>>>> Mem 3.69 GiB
>>>>>>>> Running inside a Docker container
>>>>>>>> NFILES=131072
>>>>>>>> MEMLOCK=82000
>>>>>>>>
>>>>>>>> Additional info:
>>>>>>>>
>>>>>>>> - I need to cache a large number of objets and the cache should
>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know
>>>>>>>> if this is a problem;
>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>> anything to do with it;
>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>> - No panic is reported, the only info I can retrieve is from syslog;
>>>>>>>> - During all the time, event moments before the crashes, everything
>>>>>>>> is okay and requests are being responded very fast.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stefano Baldo
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> varnish-misc mailing list
>>>>>>>> varnish-misc at varnish-cache.org
>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20170628/610a5a7a/attachment-0001.html>