Child process recurrently being restarted

Stefano Baldo stefanobaldo at gmail.com
Mon Jun 26 20:21:43 CEST 2017


Hi Guillaume.

I think things will start to going better now after changing the bans.
This is how my last varnishstat looked like moments before a crash
regarding the bans:

MAIN.bans                     41336          .   Count of bans
MAIN.bans_completed           37967          .   Number of bans marked
'completed'
MAIN.bans_obj                     0          .   Number of bans using obj.*
MAIN.bans_req                 41335          .   Number of bans using req.*
MAIN.bans_added               41336         0.68 Bans added
MAIN.bans_deleted                 0         0.00 Bans deleted

And this is how it looks like now:

MAIN.bans                         2          .   Count of bans
MAIN.bans_completed               1          .   Number of bans marked
'completed'
MAIN.bans_obj                     2          .   Number of bans using obj.*
MAIN.bans_req                     0          .   Number of bans using req.*
MAIN.bans_added                2016         0.69 Bans added
MAIN.bans_deleted              2014         0.69 Bans deleted

Before the changes, bans were never deleted!
Now the bans are added and quickly deleted after a minute or even a couple
of seconds.

May this was the cause of the problem? It seems like varnish was having a
large number of bans to manage and test against.
I will let it ride now. Let's see if the problem persists or it's gone! :-)

Best,
Stefano


On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
guillaume at varnish-software.com> wrote:

> Looking good!
>
> --
> Guillaume Quintard
>
> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <stefanobaldo at gmail.com>
> wrote:
>
>> Hi Guillaume,
>>
>> Can the following be considered "ban lurker friendly"?
>>
>> sub vcl_backend_response {
>>   set beresp.http.x-url = bereq.http.host + bereq.url;
>>   set beresp.http.x-user-agent = bereq.http.user-agent;
>> }
>>
>> sub vcl_recv {
>>   if (req.method == "PURGE") {
>>     ban("obj.http.x-url == " + req.http.host + req.url + " &&
>> obj.http.x-user-agent !~ Googlebot");
>>     return(synth(750));
>>   }
>> }
>>
>> sub vcl_deliver {
>>   unset resp.http.x-url;
>>   unset resp.http.x-user-agent;
>> }
>>
>> Best,
>> Stefano
>>
>>
>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>> guillaume at varnish-software.com> wrote:
>>
>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>> expression. Easiest way is to stash the host, user-agent and url in
>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>
>>> I don't think you need to expand the VSL at all.
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo at gmail.com> wrote:
>>>
>>> Hi Guillaume.
>>>
>>> Thanks for answering.
>>>
>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>> performance but it stills restarting.
>>> Also, I checked the I/O performance for the disk and there is no signal
>>> of overhead.
>>>
>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>> default size passing "-l 200m,20m" to varnishd and using
>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a
>>> problem here. After a couple of hours varnish died and I received a "no
>>> space left on device" message - deleting the /var/lib/varnish solved the
>>> problem and varnish was up again, but it's weird because there was free
>>> memory on the host to be used with the tmpfs directory, so I don't know
>>> what could have happened. I will try to stop increasing the
>>> /var/lib/varnish size.
>>>
>>> Anyway, I am worried about the bans. You asked me if the bans are lurker
>>> friedly. Well, I don't think so. My bans are created this way:
>>>
>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + "
>>> && req.http.User-Agent !~ Googlebot");
>>>
>>> Are they lurker friendly? I was taking a quick look and the
>>> documentation and it looks like they're not.
>>>
>>> Best,
>>> Stefano
>>>
>>>
>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>> guillaume at varnish-software.com> wrote:
>>>
>>>> Hi Stefano,
>>>>
>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
>>>> trying to push/pull data and can't make time to reply to the CLI. I'd
>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>
>>>> After some time, the file storage is terrible on a hard drive (SSDs
>>>> take a bit more time to degrade) because of fragmentation. One solution to
>>>> help the disks cope is to overprovision themif they're SSDs, and you can
>>>> try different advices in the file storage definition in the command line
>>>> (last parameter, after granularity).
>>>>
>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>
>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <stefanobaldo at gmail.com>
>>>> wrote:
>>>>
>>>>> Hello.
>>>>>
>>>>> I am having a critical problem with Varnish Cache in production for
>>>>> over a month and any help will be appreciated.
>>>>> The problem is that Varnish child process is recurrently being
>>>>> restarted after 10~20h of use, with the following message:
>>>>>
>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>> responding to CLI, killed it.
>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from
>>>>> ping: 400 CLI communication error
>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>> signal=9
>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child
>>>>> starts
>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>
>>>>> The following link is the varnishstat output just 1 minute before a
>>>>> restart:
>>>>>
>>>>> https://pastebin.com/g0g5RVTs
>>>>>
>>>>> Environment:
>>>>>
>>>>> varnish-5.1.2 revision 6ece695
>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>> Installed using pre-built package from official repo at
>>>>> packagecloud.io
>>>>> CPU 2x2.9 GHz
>>>>> Mem 3.69 GiB
>>>>> Running inside a Docker container
>>>>> NFILES=131072
>>>>> MEMLOCK=82000
>>>>>
>>>>> Additional info:
>>>>>
>>>>> - I need to cache a large number of objets and the cache should last
>>>>> for almost a week, so I have set up a 450G storage space, I don't know if
>>>>> this is a problem;
>>>>> - I use ban a lot. There was about 40k bans in the system just before
>>>>> the last crash. I really don't know if this is too much or may have
>>>>> anything to do with it;
>>>>> - No registered CPU spikes (almost always by 30%);
>>>>> - No panic is reported, the only info I can retrieve is from syslog;
>>>>> - During all the time, event moments before the crashes, everything is
>>>>> okay and requests are being responded very fast.
>>>>>
>>>>> Best,
>>>>> Stefano Baldo
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> varnish-misc mailing list
>>>>> varnish-misc at varnish-cache.org
>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20170626/c05380e2/attachment-0001.html>


More information about the varnish-misc mailing list