Child process recurrently being restarted

Wed Jun 28 15:58:55 CEST 2017

Transient is not limited I suppose, so the g_space == 0 is normal.

You can try running on bare metal, not sure there will be a difference

-- 
Guillaume Quintard

On Wed, Jun 28, 2017 at 3:54 PM, Stefano Baldo <stefanobaldo at gmail.com>
wrote:

> Also, we are running varnish inside a docker container.
> The storage disk is attached to the same host, and mounted to the
> container via docker volume.
>
> Do you think it's worth a try to run varnish directly on the host,
> avoiding docker? I don't see how this could be a problem but I don't know
> what to do anymore.
>
> Best,
>
>
> On Wed, Jun 28, 2017 at 10:43 AM, Guillaume Quintard <
> guillaume at varnish-software.com> wrote:
>
>> Yeah, I was wondering about Transient, but it seems under control.
>>
>> Apart from moving away from file storage, I have nothing at the moment :-/
>>
>> --
>> Guillaume Quintard
>>
>> On Wed, Jun 28, 2017 at 3:39 PM, Stefano Baldo <stefanobaldo at gmail.com>
>> wrote:
>>
>>> Hi.
>>>
>>> root at 2c6c325b279f:/# varnishstat -1 | grep g_bytes
>>> SMA.Transient.g_bytes                    519022          .   Bytes
>>> outstanding
>>> SMF.s0.g_bytes                      23662845952          .   Bytes
>>> outstanding
>>>
>>> You mean g_bytes from SMA.Transient? I have set no malloc storage.
>>>
>>>
>>> On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard <
>>> guillaume at varnish-software.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
>>>> the memory you are seeing?
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> I increased the cli_timeout yesterday to 900sec (15min) and it
>>>>> restarted anyway, which seems to indicate that the thread is really stalled.
>>>>>
>>>>> This was 1 minute after the last restart:
>>>>>
>>>>> MAIN.n_object               3908216          .   object structs made
>>>>> SMF.s0.g_alloc              7794510          .   Allocations
>>>>> outstanding
>>>>>
>>>>> I've just changed the I/O Scheduler to noop to see what happens.
>>>>>
>>>>> One interest thing I've found is about the memory usage.
>>>>>
>>>>> In the 1st minute of use:
>>>>> MemTotal:        3865572 kB
>>>>> MemFree:          120768 kB
>>>>> MemAvailable:    2300268 kB
>>>>>
>>>>> 1 minute before a restart:
>>>>> MemTotal:        3865572 kB
>>>>> MemFree:           82480 kB
>>>>> MemAvailable:      68316 kB
>>>>>
>>>>> It seems like the system is possibly running out of memory.
>>>>>
>>>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>>>>> see in some examples that is common to use "-s file" AND "-s malloc"
>>>>> together. Should I be passing "-s malloc" as well to somehow try to limit
>>>>> the memory usage by varnishd?
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>>
>>>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>>>>> guillaume at varnish-software.com> wrote:
>>>>>
>>>>>> Sadly, nothing suspicious here, you can still try:
>>>>>> - bumping the cli_timeout
>>>>>> - changing your disk scheduler
>>>>>> - changing the advice option of the file storage
>>>>>>
>>>>>> I'm still convinced this is due to Varnish getting stuck waiting for
>>>>>> the disk because of the file storage fragmentation.
>>>>>>
>>>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>>>>> objects. Ideally, we would have a 1:1 relation between objects and
>>>>>> allocations. If that number drops prior to a restart, that would be a good
>>>>>> clue.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <
>>>>>> stefanobaldo at gmail.com> wrote:
>>>>>>
>>>>>>> Hi Guillaume.
>>>>>>>
>>>>>>> It keeps restarting.
>>>>>>> Would you mind taking a quick look in the following VCL file to
>>>>>>> check if you find anything suspicious?
>>>>>>>
>>>>>>> Thank you very much.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano
>>>>>>>
>>>>>>> vcl 4.0;
>>>>>>>
>>>>>>> import std;
>>>>>>>
>>>>>>> backend default {
>>>>>>>   .host = "sites-web-server-lb";
>>>>>>>   .port = "80";
>>>>>>> }
>>>>>>>
>>>>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>>>>
>>>>>>> sub vcl_recv {
>>>>>>>   call bad_bot_detection;
>>>>>>>
>>>>>>>   if (req.url == "/nocache" || req.url == "/version") {
>>>>>>>     return(pass);
>>>>>>>   }
>>>>>>>
>>>>>>>   unset req.http.Cookie;
>>>>>>>   if (req.method == "PURGE") {
>>>>>>>     ban("obj.http.x-host == " + req.http.host + " &&
>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>     return(synth(750));
>>>>>>>   }
>>>>>>>
>>>>>>>   set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_synth {
>>>>>>>   if (resp.status == 750) {
>>>>>>>     set resp.status = 200;
>>>>>>>     synthetic("PURGED => " + req.url);
>>>>>>>     return(deliver);
>>>>>>>   } elsif (resp.status == 501) {
>>>>>>>     set resp.status = 200;
>>>>>>>     set resp.http.Content-Type = "text/html; charset=utf-8";
>>>>>>>     synthetic(std.fileread("/etc/varnish/pages/invalid_domain.ht
>>>>>>> ml"));
>>>>>>>     return(deliver);
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_backend_response {
>>>>>>>   unset beresp.http.Set-Cookie;
>>>>>>>   set beresp.http.x-host = bereq.http.host;
>>>>>>>   set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>
>>>>>>>   if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>>>>>     || bereq.url == "/api/events/PAGEVIEW"
>>>>>>>     || bereq.url ~ "^\/assets\/img\/") {
>>>>>>>     set beresp.http.Cache-Control = "max-age=0";
>>>>>>>   } else {
>>>>>>>     unset beresp.http.Cache-Control;
>>>>>>>   }
>>>>>>>
>>>>>>>   if (beresp.status == 200 ||
>>>>>>>     beresp.status == 301 ||
>>>>>>>     beresp.status == 302 ||
>>>>>>>     beresp.status == 404) {
>>>>>>>       if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>>>>>         set beresp.http.X-TTL = "1d";
>>>>>>>         set beresp.ttl = 1d;
>>>>>>>       } else {
>>>>>>>         set beresp.http.X-TTL = "1w";
>>>>>>>         set beresp.ttl = 1w;
>>>>>>>       }
>>>>>>>   }
>>>>>>>
>>>>>>>   if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>>>>> {
>>>>>>>     set beresp.do_gzip = true;
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_pipe {
>>>>>>>   set bereq.http.connection = "close";
>>>>>>>   return (pipe);
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_deliver {
>>>>>>>   unset resp.http.x-host;
>>>>>>>   unset resp.http.x-user-agent;
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_backend_error {
>>>>>>>   if (beresp.status == 502 || beresp.status == 503 || beresp.status
>>>>>>> == 504) {
>>>>>>>     set beresp.status = 200;
>>>>>>>     set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>>>>>     synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>>>>>     return (deliver);
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_hash {
>>>>>>>   if (req.http.User-Agent ~ "Google Page Speed") {
>>>>>>>     hash_data("Google Page Speed");
>>>>>>>   } elsif (req.http.User-Agent ~ "Googlebot") {
>>>>>>>     hash_data("Googlebot");
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_deliver {
>>>>>>>   if (resp.status == 501) {
>>>>>>>     return (synth(resp.status));
>>>>>>>   }
>>>>>>>   if (obj.hits > 0) {
>>>>>>>     set resp.http.X-Cache = "hit";
>>>>>>>   } else {
>>>>>>>     set resp.http.X-Cache = "miss";
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>>>>> guillaume at varnish-software.com> wrote:
>>>>>>>
>>>>>>>> Nice! It may have been the cause, time will tell.can you report
>>>>>>>> back in a few days to let us know?
>>>>>>>> --
>>>>>>>> Guillaume Quintard
>>>>>>>>
>>>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Guillaume.
>>>>>>>>>
>>>>>>>>> I think things will start to going better now after changing the
>>>>>>>>> bans.
>>>>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>>>>> regarding the bans:
>>>>>>>>>
>>>>>>>>> MAIN.bans                     41336          .   Count of bans
>>>>>>>>> MAIN.bans_completed           37967          .   Number of bans
>>>>>>>>> marked 'completed'
>>>>>>>>> MAIN.bans_obj                     0          .   Number of bans
>>>>>>>>> using obj.*
>>>>>>>>> MAIN.bans_req                 41335          .   Number of bans
>>>>>>>>> using req.*
>>>>>>>>> MAIN.bans_added               41336         0.68 Bans added
>>>>>>>>> MAIN.bans_deleted                 0         0.00 Bans deleted
>>>>>>>>>
>>>>>>>>> And this is how it looks like now:
>>>>>>>>>
>>>>>>>>> MAIN.bans                         2          .   Count of bans
>>>>>>>>> MAIN.bans_completed               1          .   Number of bans
>>>>>>>>> marked 'completed'
>>>>>>>>> MAIN.bans_obj                     2          .   Number of bans
>>>>>>>>> using obj.*
>>>>>>>>> MAIN.bans_req                     0          .   Number of bans
>>>>>>>>> using req.*
>>>>>>>>> MAIN.bans_added                2016         0.69 Bans added
>>>>>>>>> MAIN.bans_deleted              2014         0.69 Bans deleted
>>>>>>>>>
>>>>>>>>> Before the changes, bans were never deleted!
>>>>>>>>> Now the bans are added and quickly deleted after a minute or even
>>>>>>>>> a couple of seconds.
>>>>>>>>>
>>>>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>>>>> having a large number of bans to manage and test against.
>>>>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>>>>> gone! :-)
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>>>>> guillaume at varnish-software.com> wrote:
>>>>>>>>>
>>>>>>>>>> Looking good!
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>>>>> stefanobaldo at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Guillaume,
>>>>>>>>>>>
>>>>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>>>>
>>>>>>>>>>> sub vcl_backend_response {
>>>>>>>>>>>   set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>>>>>   set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> sub vcl_recv {
>>>>>>>>>>>   if (req.method == "PURGE") {
>>>>>>>>>>>     ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>>>>>     return(synth(750));
>>>>>>>>>>>   }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> sub vcl_deliver {
>>>>>>>>>>>   unset resp.http.x-url;
>>>>>>>>>>>   unset resp.http.x-user-agent;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Stefano
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>>>>> guillaume at varnish-software.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>>
>>>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo at gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Guillaume.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for answering.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to
>>>>>>>>>>>> increase performance but it stills restarting.
>>>>>>>>>>>> Also, I checked the I/O performance for the disk and there is
>>>>>>>>>>>> no signal of overhead.
>>>>>>>>>>>>
>>>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its
>>>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount.
>>>>>>>>>>>> There was a problem here. After a couple of hours varnish died and I
>>>>>>>>>>>> received a "no space left on device" message - deleting the
>>>>>>>>>>>> /var/lib/varnish solved the problem and varnish was up again, but it's
>>>>>>>>>>>> weird because there was free memory on the host to be used with the tmpfs
>>>>>>>>>>>> directory, so I don't know what could have happened. I will try to stop
>>>>>>>>>>>> increasing the /var/lib/varnish size.
>>>>>>>>>>>>
>>>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans
>>>>>>>>>>>> are lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>>>>
>>>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>>>>
>>>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Stefano
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>>>>> guillaume at varnish-software.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Stefano,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish
>>>>>>>>>>>>> gets stuck trying to push/pull data and can't make time to reply to the
>>>>>>>>>>>>> CLI. I'd recommend monitoring the disk activity (bandwidth and iops) to
>>>>>>>>>>>>> confirm.
>>>>>>>>>>>>>
>>>>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>>>>> stefanobaldo at gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am having a critical problem with Varnish Cache in
>>>>>>>>>>>>>> production for over a month and any help will be appreciated.
>>>>>>>>>>>>>> The problem is that Varnish child process is recurrently
>>>>>>>>>>>>>> being restarted after 10~20h of use, with the following message:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>>>> not responding to CLI, killed it.
>>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected
>>>>>>>>>>>>>> reply from ping: 400 CLI communication error
>>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>>>> died signal=9
>>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>>>>> complete
>>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>>> Started
>>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>>> said Child starts
>>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>>>>> before a restart:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Environment:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>>>>> packagecloud.io
>>>>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>>>>> NFILES=131072
>>>>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Additional info:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - I need to cache a large number of objets and the cache
>>>>>>>>>>>>>> should last for almost a week, so I have set up a 450G storage space, I
>>>>>>>>>>>>>> don't know if this is a problem;
>>>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system
>>>>>>>>>>>>>> just before the last crash. I really don't know if this is too much or may
>>>>>>>>>>>>>> have anything to do with it;
>>>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>>>>> syslog;
>>>>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>>>>> varnish-misc at varnish-cache.org
>>>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>>>>> -misc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20170628/05eb2c56/attachment-0001.html>