Child process recurrently being restarted

Stefano Baldo stefanobaldo at gmail.com
Wed Jun 28 15:39:52 CEST 2017


Hi.

root at 2c6c325b279f:/# varnishstat -1 | grep g_bytes
SMA.Transient.g_bytes                    519022          .   Bytes
outstanding
SMF.s0.g_bytes                      23662845952          .   Bytes
outstanding

You mean g_bytes from SMA.Transient? I have set no malloc storage.


On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard <
guillaume at varnish-software.com> wrote:

> Hi,
>
> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
> the memory you are seeing?
>
> --
> Guillaume Quintard
>
> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo at gmail.com>
> wrote:
>
>> Hi Guillaume.
>>
>> I increased the cli_timeout yesterday to 900sec (15min) and it restarted
>> anyway, which seems to indicate that the thread is really stalled.
>>
>> This was 1 minute after the last restart:
>>
>> MAIN.n_object               3908216          .   object structs made
>> SMF.s0.g_alloc              7794510          .   Allocations outstanding
>>
>> I've just changed the I/O Scheduler to noop to see what happens.
>>
>> One interest thing I've found is about the memory usage.
>>
>> In the 1st minute of use:
>> MemTotal:        3865572 kB
>> MemFree:          120768 kB
>> MemAvailable:    2300268 kB
>>
>> 1 minute before a restart:
>> MemTotal:        3865572 kB
>> MemFree:           82480 kB
>> MemAvailable:      68316 kB
>>
>> It seems like the system is possibly running out of memory.
>>
>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>> see in some examples that is common to use "-s file" AND "-s malloc"
>> together. Should I be passing "-s malloc" as well to somehow try to limit
>> the memory usage by varnishd?
>>
>> Best,
>> Stefano
>>
>>
>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>> guillaume at varnish-software.com> wrote:
>>
>>> Sadly, nothing suspicious here, you can still try:
>>> - bumping the cli_timeout
>>> - changing your disk scheduler
>>> - changing the advice option of the file storage
>>>
>>> I'm still convinced this is due to Varnish getting stuck waiting for the
>>> disk because of the file storage fragmentation.
>>>
>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>> objects. Ideally, we would have a 1:1 relation between objects and
>>> allocations. If that number drops prior to a restart, that would be a good
>>> clue.
>>>
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <stefanobaldo at gmail.com>
>>> wrote:
>>>
>>>> Hi Guillaume.
>>>>
>>>> It keeps restarting.
>>>> Would you mind taking a quick look in the following VCL file to check
>>>> if you find anything suspicious?
>>>>
>>>> Thank you very much.
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>> vcl 4.0;
>>>>
>>>> import std;
>>>>
>>>> backend default {
>>>>   .host = "sites-web-server-lb";
>>>>   .port = "80";
>>>> }
>>>>
>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>
>>>> sub vcl_recv {
>>>>   call bad_bot_detection;
>>>>
>>>>   if (req.url == "/nocache" || req.url == "/version") {
>>>>     return(pass);
>>>>   }
>>>>
>>>>   unset req.http.Cookie;
>>>>   if (req.method == "PURGE") {
>>>>     ban("obj.http.x-host == " + req.http.host + " &&
>>>> obj.http.x-user-agent !~ Googlebot");
>>>>     return(synth(750));
>>>>   }
>>>>
>>>>   set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>> }
>>>>
>>>> sub vcl_synth {
>>>>   if (resp.status == 750) {
>>>>     set resp.status = 200;
>>>>     synthetic("PURGED => " + req.url);
>>>>     return(deliver);
>>>>   } elsif (resp.status == 501) {
>>>>     set resp.status = 200;
>>>>     set resp.http.Content-Type = "text/html; charset=utf-8";
>>>>     synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
>>>>     return(deliver);
>>>>   }
>>>> }
>>>>
>>>> sub vcl_backend_response {
>>>>   unset beresp.http.Set-Cookie;
>>>>   set beresp.http.x-host = bereq.http.host;
>>>>   set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>
>>>>   if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>>     || bereq.url == "/api/events/PAGEVIEW"
>>>>     || bereq.url ~ "^\/assets\/img\/") {
>>>>     set beresp.http.Cache-Control = "max-age=0";
>>>>   } else {
>>>>     unset beresp.http.Cache-Control;
>>>>   }
>>>>
>>>>   if (beresp.status == 200 ||
>>>>     beresp.status == 301 ||
>>>>     beresp.status == 302 ||
>>>>     beresp.status == 404) {
>>>>       if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>>         set beresp.http.X-TTL = "1d";
>>>>         set beresp.ttl = 1d;
>>>>       } else {
>>>>         set beresp.http.X-TTL = "1w";
>>>>         set beresp.ttl = 1w;
>>>>       }
>>>>   }
>>>>
>>>>   if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>> {
>>>>     set beresp.do_gzip = true;
>>>>   }
>>>> }
>>>>
>>>> sub vcl_pipe {
>>>>   set bereq.http.connection = "close";
>>>>   return (pipe);
>>>> }
>>>>
>>>> sub vcl_deliver {
>>>>   unset resp.http.x-host;
>>>>   unset resp.http.x-user-agent;
>>>> }
>>>>
>>>> sub vcl_backend_error {
>>>>   if (beresp.status == 502 || beresp.status == 503 || beresp.status ==
>>>> 504) {
>>>>     set beresp.status = 200;
>>>>     set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>>     synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>>     return (deliver);
>>>>   }
>>>> }
>>>>
>>>> sub vcl_hash {
>>>>   if (req.http.User-Agent ~ "Google Page Speed") {
>>>>     hash_data("Google Page Speed");
>>>>   } elsif (req.http.User-Agent ~ "Googlebot") {
>>>>     hash_data("Googlebot");
>>>>   }
>>>> }
>>>>
>>>> sub vcl_deliver {
>>>>   if (resp.status == 501) {
>>>>     return (synth(resp.status));
>>>>   }
>>>>   if (obj.hits > 0) {
>>>>     set resp.http.X-Cache = "hit";
>>>>   } else {
>>>>     set resp.http.X-Cache = "miss";
>>>>   }
>>>> }
>>>>
>>>>
>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>> guillaume at varnish-software.com> wrote:
>>>>
>>>>> Nice! It may have been the cause, time will tell.can you report back
>>>>> in a few days to let us know?
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo at gmail.com> wrote:
>>>>>
>>>>>> Hi Guillaume.
>>>>>>
>>>>>> I think things will start to going better now after changing the bans.
>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>> regarding the bans:
>>>>>>
>>>>>> MAIN.bans                     41336          .   Count of bans
>>>>>> MAIN.bans_completed           37967          .   Number of bans
>>>>>> marked 'completed'
>>>>>> MAIN.bans_obj                     0          .   Number of bans using
>>>>>> obj.*
>>>>>> MAIN.bans_req                 41335          .   Number of bans using
>>>>>> req.*
>>>>>> MAIN.bans_added               41336         0.68 Bans added
>>>>>> MAIN.bans_deleted                 0         0.00 Bans deleted
>>>>>>
>>>>>> And this is how it looks like now:
>>>>>>
>>>>>> MAIN.bans                         2          .   Count of bans
>>>>>> MAIN.bans_completed               1          .   Number of bans
>>>>>> marked 'completed'
>>>>>> MAIN.bans_obj                     2          .   Number of bans using
>>>>>> obj.*
>>>>>> MAIN.bans_req                     0          .   Number of bans using
>>>>>> req.*
>>>>>> MAIN.bans_added                2016         0.69 Bans added
>>>>>> MAIN.bans_deleted              2014         0.69 Bans deleted
>>>>>>
>>>>>> Before the changes, bans were never deleted!
>>>>>> Now the bans are added and quickly deleted after a minute or even a
>>>>>> couple of seconds.
>>>>>>
>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>> having a large number of bans to manage and test against.
>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>> gone! :-)
>>>>>>
>>>>>> Best,
>>>>>> Stefano
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>> guillaume at varnish-software.com> wrote:
>>>>>>
>>>>>>> Looking good!
>>>>>>>
>>>>>>> --
>>>>>>> Guillaume Quintard
>>>>>>>
>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>> stefanobaldo at gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Guillaume,
>>>>>>>>
>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>
>>>>>>>> sub vcl_backend_response {
>>>>>>>>   set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>>   set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>> }
>>>>>>>>
>>>>>>>> sub vcl_recv {
>>>>>>>>   if (req.method == "PURGE") {
>>>>>>>>     ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>>     return(synth(750));
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>> sub vcl_deliver {
>>>>>>>>   unset resp.http.x-url;
>>>>>>>>   unset resp.http.x-user-agent;
>>>>>>>> }
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>> guillaume at varnish-software.com> wrote:
>>>>>>>>
>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>
>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Guillaume Quintard
>>>>>>>>>
>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Guillaume.
>>>>>>>>>
>>>>>>>>> Thanks for answering.
>>>>>>>>>
>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>>>> performance but it stills restarting.
>>>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>>>> signal of overhead.
>>>>>>>>>
>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>>>>>>> default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There
>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a
>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved
>>>>>>>>> the problem and varnish was up again, but it's weird because there was free
>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>>>> /var/lib/varnish size.
>>>>>>>>>
>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>
>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>
>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>> guillaume at varnish-software.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Stefano,
>>>>>>>>>>
>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>>>>
>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>
>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>
>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>
>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>> stefanobaldo at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello.
>>>>>>>>>>>
>>>>>>>>>>> I am having a critical problem with Varnish Cache in production
>>>>>>>>>>> for over a month and any help will be appreciated.
>>>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>>>
>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>>>>>> responding to CLI, killed it.
>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>>>>>>> signal=9
>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>> complete
>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>> Started
>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>>>>> Child starts
>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>
>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>> before a restart:
>>>>>>>>>>>
>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>
>>>>>>>>>>> Environment:
>>>>>>>>>>>
>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>> packagecloud.io
>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>> NFILES=131072
>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>
>>>>>>>>>>> Additional info:
>>>>>>>>>>>
>>>>>>>>>>> - I need to cache a large number of objets and the cache should
>>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know
>>>>>>>>>>> if this is a problem;
>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>>>>> anything to do with it;
>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>> syslog;
>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>> varnish-misc at varnish-cache.org
>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>> -misc
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20170628/6adf63ba/attachment-0001.html>


More information about the varnish-misc mailing list