Child process recurrently being restarted

Tue Jun 27 23:07:31 CEST 2017

Hi Guillaume.

It keeps restarting.
Would you mind taking a quick look in the following VCL file to check if
you find anything suspicious?

Thank you very much.

Best,
Stefano

vcl 4.0;

import std;

backend default {
  .host = "sites-web-server-lb";
  .port = "80";
}

include "/etc/varnish/bad_bot_detection.vcl";

sub vcl_recv {
  call bad_bot_detection;

  if (req.url == "/nocache" || req.url == "/version") {
    return(pass);
  }

  unset req.http.Cookie;
  if (req.method == "PURGE") {
    ban("obj.http.x-host == " + req.http.host + " && obj.http.x-user-agent
!~ Googlebot");
    return(synth(750));
  }

  set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
}

sub vcl_synth {
  if (resp.status == 750) {
    set resp.status = 200;
    synthetic("PURGED => " + req.url);
    return(deliver);
  } elsif (resp.status == 501) {
    set resp.status = 200;
    set resp.http.Content-Type = "text/html; charset=utf-8";
    synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
    return(deliver);
  }
}

sub vcl_backend_response {
  unset beresp.http.Set-Cookie;
  set beresp.http.x-host = bereq.http.host;
  set beresp.http.x-user-agent = bereq.http.user-agent;

  if (bereq.url == "/themes/basic/assets/theme.min.css"
    || bereq.url == "/api/events/PAGEVIEW"
    || bereq.url ~ "^\/assets\/img\/") {
    set beresp.http.Cache-Control = "max-age=0";
  } else {
    unset beresp.http.Cache-Control;
  }

  if (beresp.status == 200 ||
    beresp.status == 301 ||
    beresp.status == 302 ||
    beresp.status == 404) {
      if (bereq.url ~ "\&ordenar=aleatorio$") {
        set beresp.http.X-TTL = "1d";
        set beresp.ttl = 1d;
      } else {
        set beresp.http.X-TTL = "1w";
        set beresp.ttl = 1w;
      }
  }

  if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") {
    set beresp.do_gzip = true;
  }
}

sub vcl_pipe {
  set bereq.http.connection = "close";
  return (pipe);
}

sub vcl_deliver {
  unset resp.http.x-host;
  unset resp.http.x-user-agent;
}

sub vcl_backend_error {
  if (beresp.status == 502 || beresp.status == 503 || beresp.status == 504)
{
    set beresp.status = 200;
    set beresp.http.Content-Type = "text/html; charset=utf-8";
    synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
    return (deliver);
  }
}

sub vcl_hash {
  if (req.http.User-Agent ~ "Google Page Speed") {
    hash_data("Google Page Speed");
  } elsif (req.http.User-Agent ~ "Googlebot") {
    hash_data("Googlebot");
  }
}

sub vcl_deliver {
  if (resp.status == 501) {
    return (synth(resp.status));
  }
  if (obj.hits > 0) {
    set resp.http.X-Cache = "hit";
  } else {
    set resp.http.X-Cache = "miss";
  }
}

On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
guillaume at varnish-software.com> wrote:

> Nice! It may have been the cause, time will tell.can you report back in a
> few days to let us know?
> --
> Guillaume Quintard
>
> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo at gmail.com> wrote:
>
>> Hi Guillaume.
>>
>> I think things will start to going better now after changing the bans.
>> This is how my last varnishstat looked like moments before a crash
>> regarding the bans:
>>
>> MAIN.bans                     41336          .   Count of bans
>> MAIN.bans_completed           37967          .   Number of bans marked
>> 'completed'
>> MAIN.bans_obj                     0          .   Number of bans using
>> obj.*
>> MAIN.bans_req                 41335          .   Number of bans using
>> req.*
>> MAIN.bans_added               41336         0.68 Bans added
>> MAIN.bans_deleted                 0         0.00 Bans deleted
>>
>> And this is how it looks like now:
>>
>> MAIN.bans                         2          .   Count of bans
>> MAIN.bans_completed               1          .   Number of bans marked
>> 'completed'
>> MAIN.bans_obj                     2          .   Number of bans using
>> obj.*
>> MAIN.bans_req                     0          .   Number of bans using
>> req.*
>> MAIN.bans_added                2016         0.69 Bans added
>> MAIN.bans_deleted              2014         0.69 Bans deleted
>>
>> Before the changes, bans were never deleted!
>> Now the bans are added and quickly deleted after a minute or even a
>> couple of seconds.
>>
>> May this was the cause of the problem? It seems like varnish was having a
>> large number of bans to manage and test against.
>> I will let it ride now. Let's see if the problem persists or it's gone!
>> :-)
>>
>> Best,
>> Stefano
>>
>>
>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>> guillaume at varnish-software.com> wrote:
>>
>>> Looking good!
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <stefanobaldo at gmail.com>
>>> wrote:
>>>
>>>> Hi Guillaume,
>>>>
>>>> Can the following be considered "ban lurker friendly"?
>>>>
>>>> sub vcl_backend_response {
>>>>   set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>   set beresp.http.x-user-agent = bereq.http.user-agent;
>>>> }
>>>>
>>>> sub vcl_recv {
>>>>   if (req.method == "PURGE") {
>>>>     ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>> obj.http.x-user-agent !~ Googlebot");
>>>>     return(synth(750));
>>>>   }
>>>> }
>>>>
>>>> sub vcl_deliver {
>>>>   unset resp.http.x-url;
>>>>   unset resp.http.x-user-agent;
>>>> }
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>>
>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>> guillaume at varnish-software.com> wrote:
>>>>
>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>
>>>>> I don't think you need to expand the VSL at all.
>>>>>
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo at gmail.com> wrote:
>>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> Thanks for answering.
>>>>>
>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>> performance but it stills restarting.
>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>> signal of overhead.
>>>>>
>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>>> default size passing "-l 200m,20m" to varnishd and using
>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a
>>>>> problem here. After a couple of hours varnish died and I received a "no
>>>>> space left on device" message - deleting the /var/lib/varnish solved the
>>>>> problem and varnish was up again, but it's weird because there was free
>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>> what could have happened. I will try to stop increasing the
>>>>> /var/lib/varnish size.
>>>>>
>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>
>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url +
>>>>> " && req.http.User-Agent !~ Googlebot");
>>>>>
>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>> documentation and it looks like they're not.
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>>
>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>> guillaume at varnish-software.com> wrote:
>>>>>
>>>>>> Hi Stefano,
>>>>>>
>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
>>>>>> trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>
>>>>>> After some time, the file storage is terrible on a hard drive (SSDs
>>>>>> take a bit more time to degrade) because of fragmentation. One solution to
>>>>>> help the disks cope is to overprovision themif they're SSDs, and you can
>>>>>> try different advices in the file storage definition in the command line
>>>>>> (last parameter, after granularity).
>>>>>>
>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>
>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>> stefanobaldo at gmail.com> wrote:
>>>>>>
>>>>>>> Hello.
>>>>>>>
>>>>>>> I am having a critical problem with Varnish Cache in production for
>>>>>>> over a month and any help will be appreciated.
>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>
>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>> responding to CLI, killed it.
>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from
>>>>>>> ping: 400 CLI communication error
>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>>> signal=9
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>> Child starts
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>
>>>>>>> The following link is the varnishstat output just 1 minute before a
>>>>>>> restart:
>>>>>>>
>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>
>>>>>>> Environment:
>>>>>>>
>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>> Installed using pre-built package from official repo at
>>>>>>> packagecloud.io
>>>>>>> CPU 2x2.9 GHz
>>>>>>> Mem 3.69 GiB
>>>>>>> Running inside a Docker container
>>>>>>> NFILES=131072
>>>>>>> MEMLOCK=82000
>>>>>>>
>>>>>>> Additional info:
>>>>>>>
>>>>>>> - I need to cache a large number of objets and the cache should last
>>>>>>> for almost a week, so I have set up a 450G storage space, I don't know if
>>>>>>> this is a problem;
>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>> anything to do with it;
>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>> - No panic is reported, the only info I can retrieve is from syslog;
>>>>>>> - During all the time, event moments before the crashes, everything
>>>>>>> is okay and requests are being responded very fast.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano Baldo
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> varnish-misc mailing list
>>>>>>> varnish-misc at varnish-cache.org
>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20170627/0454f260/attachment-0001.html>