Varnish child killed

Jean-Francois Laurens jean-francois.laurens at rts.ch
Thu Apr 21 15:20:47 CEST 2011


Hi Geoff,

1. You¹re completely true about guesses !
In fact it¹s very hard for me to get anything valuable from the logs in
order to point a guilty !

I can just tell what I can see from cacti graphs and correlate with the
behavior of varnish.

What is the way in your opinion to have a better idea on what¹s going on ?
At this stage I¹m just open to any idea !
Please don¹t tell me to run varnish in debug mode in production ;-)

I am just trying to explain what I can see as there was some obvious paterns
(out of cacti graphs):
Impossible to go higher the 25Gb of cache used. You can clearly see that the
shared memory used is increasing gradually until 25G and can¹t go higher.
I believe there is some more objects to cache but varnish can¹t for any
reason.
Impossible to go above 1 million object cached. Same observation as the
shared memory.

2. hitratio
My hit ratio is pretty low around 50%.
It used to be around 70% with the version 2.0.6 and dropped down to 50%.
I can¹t explain it as what was changed was only :
In vcl_fetch, replace obj by beresp
Replace calls to pass or deliver by return(deliver) or return(pass)

Attached is a graph in order to give you an idea about hit ratio:


On the left side you see of the graph you see the hitrate when using 2.0.6
and on right side after upgrading.

Here is my vcl:
backend default {
    .host = "172.20.102.55";
    .port = "8080";
    .connect_timeout = 60s;
    .first_byte_timeout = 60s;
    .between_bytes_timeout = 600s;
}
sub vcl_recv {

    unset req.http.Cookie;
   remove req.http.X-Forwarded-For;
    set req.http.X-Forwarded-For = client.ip;
    set req.grace = 1m;
    if (req.url ~ "^/is-alive") {
        error 750 "Up";
    }

    if (req.http.Authenticate) {
        return (pass);
    }

    if (req.url ~ "^http://") {
        set req.url = regsub(req.url, "http://[^/]*", "");
    }

    if (req.http.host ~ "static.ece.tsr.ch") {
        return (pass);
    }

    if (req.http.host == "ece.tsr.ch" && req.url !~ "^/robots.txt") {
        set req.http.host = "www.tsr.ch";
    }

    if (req.url ~ "^/tsr/.+/index\.html") {
        set req.url = regsub(req.url, "^/tsr/.+/index\.html",
"/tsr/index.html");
    }

    if (req.url ~ "&") {
        set req.url = regsuball(req.url, "&", "&");
    }


    if (req.request == "GET" || req.request == "HEAD") {
        return (lookup);
    }
}
sub vcl_error {
    if (obj.status == 750) {
        set obj.http.Content-Type = "text/plain; charset=utf-8";
        synthetic {"escenic-up"};
        return (deliver);
    }

}
sub vcl_fetch {

    unset beresp.http.x-ece-cache;
    set beresp.http.X-ece-cache = server.hostname;
    set beresp.grace = 1m;

    unset beresp.http.X-ece-was-cached;

    # respect no-cache from the backend
    if (beresp.http.Cache-Control ~ "no-cache") {
        set beresp.http.X-ece-was-cached = "backend no-cache";
        return (pass);
    }
    #respect custom cache from the backend
    if (beresp.http.X-Ece-Cache-Control ~ "custom") {
       unset beresp.http.X-Ece-Cache-Control;
       set beresp.http.X-ece-was-cached = "backend custom cache";
       return (deliver);
    }

    unset beresp.http.cache-control;
    unset beresp.http.expires;
    unset beresp.http.pragma;
    unset beresp.http.Set-Cookie;

    # errors cache 5 min ttl + 2 min akamai
    if (beresp.status == 403 || beresp.status == 404 || beresp.status == 500
|| beresp.status == 503) {
        set beresp.http.Cache-Control = "max-age=120";
        set beresp.http.X-ece-was-cached = "errors: 5m ttl, max-age=120, age
0";
        set beresp.ttl = 5m;

    # short cache 60 sec ttl + 60 sec akamai
    } else if ( req.url == "/" || req.url == "/info/" || req.url ==
"/sport/" ) {
        set beresp.http.Cache-Control = "max-age=60";
        set beresp.http.X-ece-was-cached = "short: 60s, max-age=60, age 0";
        set beresp.ttl = 60s;

    # very long cache 90 days ttl + 1 week akamai
    } else if (
               beresp.status == 301 || # redirects
               req.url ~ "^/[0-9]{4}/[0-9]{2}/.+\.image($|\?)" || # images,
see CMS-4354
               req.url ~ "\?format=(css|js)&.*cKey=" # javascript or css
with cache key
              ) {
        set beresp.http.Cache-Control = "max-age=604800";
        set beresp.http.X-ece-was-cached = "very-long: 90d ttl,
max-age=604800, age 0";
        set beresp.ttl = 90d;
    } else if (req.url ~ "\?.*page=[0-9]+") {
        set beresp.http.Cache-Control = "max-age=7200";
        set beresp.http.X-ece-was-cached = "intermediate: 22h ttl,
max-age=7200, age 0";
        set beresp.ttl = 22h;
    
    # medium cache 24h ttl + 24h akamai
    } else if (req.url ~ "\.(image|gif|jpg|jpe|css|js|png|swf|ico)($|\?)") {
        set beresp.http.Cache-Control = "max-age=86400";
        set beresp.http.X-ece-was-cached = "medium: 1d ttl, max-age=86400,
age 0";
        set beresp.ttl = 1d;
    
    # default cache 10 min ttl + 5 min akamai
    } else {
        set beresp.http.Cache-Control = "max-age=300";
        set beresp.http.X-ece-was-cached = "default: 10m ttl, max-age 300,
age 0";
        set beresp.ttl = 10m;
    }    
    return (deliver);
}

Here are my varnish startup options:
 /usr/sbin/varnishd -P /var/run/varnish.pid -a :80 -f
/etc/varnish/default.vcl -T 0.0.0.0:6082 -t 120 -w 100,3000,120 -u varnish
-g varnish -s file,/var/lib/varnish/varnish_storage.bin,50G -p thread_pools
4 -p thread_pool_add_delay 2 -p cli_timeout 20 -p session_linger 50/100/150


3. high load situation:
* only varnish is running on this server no apache or whatever, there is no
other process using resources as varnish does.
* varnish was not restarted before, you would have see it in the graph sent
in my previous email because the number of objects would have dropped to
zero and for sure message in the logs
* command line settings: as specified earlier, I only have a 20s timeout,
should I specify something else or more parameters ?
* there is no panic message in the logs

You¹re still completely about this statement:
The real question is why the Varnish child was no longer responding to
pings.

In the end my real question is :
How could I trace child activity ?


Cheers,

Jef

Le 21/04/11 13:42, « Geoff Simmons » a écrit :

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> On 4/21/11 10:51 AM, Jean-Francois Laurens wrote:
>> > 
>> > We¹re run varnish 2.1.5 for some week now and we still do not understand
>> > some behavior regarding the shared memory activity.
> 
> There's not enough information here for anything better than guesses
> about what's going on.
> 
>> > We specified a ­sfile,/var/lib/varnish/varnish_storage.bin,50G in the
>> > configuration but it¹s impossible to go higher than 25G used by varnish.
> [...]
>> > 
>> > In addition I can see varnish doesn¹t seem to be able to handle more
>> > than 1 million objects:
> 
> It's not uncommon for Varnish to use significantly less memory than what
> was allocated, but not because Varnish can't "handle" it, but just
> because it works out that way. Due to a combination of factors like
> usage patterns, TTLs, your command line settings and your VCL, Varnish
> may decide that it doesn't need more than that.
> 
> What do your cache hit ratios say? Do the logs or varnishstat give any
> indication that objects are not being cached when you think they should
> be? Do you have objects that, semantically, could be cached, but aren't
> because, for example, they are unnecessarily setting cookies? You might
> be able to get more into the cache more by tweaking VCL, but as I said,
> that's just a guess.
> 
>> > When the child process get killed, the load of the system was very high:
>> > Apr 20 21:46:44 server-01-39 varnishd[21087]: Child (5372) not
>> > responding to CLI, killing it.
>> > ....
>> > Apr 20 21:49:57 server-01-39 nrpe[18101]: Command completed with return
>> > code 2 and output: CRITICAL -*load average: 159.00, 159.32,
>> > 77.02*|load1=159.000;15.000;30.000;0; load5=159.320;10.000;25.000;0;
>> > load15=77.020;5.000;20.000;0;
>> > ....
>> > Apr 20 21:48:43 server-01-39 varnishd[21087]: Child (5372) not
>> > responding to CLI, killing it.
> 
> It looks like the message about high load came after the Varnish
> processes died, and that might have happened, at least in part, because
> Varnish was restarted and was getting nothing but cache misses. Unless
> the high load was caused by something else. Which processes were showing
> the highest CPU usage?
> 
> The real question is why the Varnish child was no longer responding to
> pings. Do you have any panic messages from Varnish in your syslog, or
> anything else indicating the error? If the load was that high *before*
> the processes died, your system might have been under so much stress
> that the child processes just couldn't answer pings in time. In which
> case your real problem might be something other than Varnish.
> 
>> > All this makes me believe we have an issue with some kernel parameters
>> > that do not allow varnish to handle as many objects as we configured it.
> 
> It could be that, it could be another process that was causing heavy
> load, it could be your VCL or your command line settings. Too many open
> questions here.
> 
> 
> Best,
> Geoff
> - -- 
> UPLEX Systemoptimierung
> Schwanenwik 24
> 22087 Hamburg
> http://uplex.de/
> Mob: +49-176-63690917
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.14 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iQIcBAEBCAAGBQJNsBgKAAoJEOUwvh9pJNURzS0QAJOvVWr3Yi4DsA2x0Ck+/HTa
> pkL69dRhUskq5Ll6Ny+e0DBB9I3Dx48ZT9ZxzRcvIZQn4shPl1GPdQQRHCB0ek82
> o8lLCdS/ta2HZhQI96FSUBj5RYDrPd3B78cAlvDLYzHsZIUbg90WmizHE/x9vPOi
> z5TOS/0S3Ao7JIuqkMpkWYyVs4AH6aKIX1L9er9jYLbHp5s8R2ilzs3USeLdC8Kl
> spGAaSn4mcCVHmhR+ZQ2XQjaf2nxN7oXEIviGOZOWfZ1XX1hQpDtjhp1D9BoInBW
> oNZmamt6Hd+m00LCu88YhTiBMRDD7zbom9C0NWLf6n7LaCIQteM/KEo1z9tPLAS6
> qmQzv+EvBKG5Dpcp81v5TqiUyVDzsYFegoKR6FKCCXvTlCI6avBlik1AlXRhecsF
> 27da7zMVvoDC44Wo+zqRkwMrtzpmE/Y55wdkP3YBUg/m4nzvci1VYTy3W436NfMe
> ypjWJ+bQEL9erSURNVDZLl6+I/J4cdcRxPEn96/7vaoDnq9HlvSI9SbAGWj4TDhA
> ksyvDB2VBGyfaVPnmPy/4CdjbDFXB5lzF2PezUhChrehKoJXeKXPNqegKV89VAo9
> EH298HuxKO+xZkVMfO9g0kHdFp6VGSCU8Y+ddU2/tMhxHGMCoXOC/sdcuCHl5HRW
> G6cSzXYum2Y1ootALk7U
> =OksF
> -----END PGP SIGNATURE-----
> 
> Jean-Francois Laurens
> Ingénieur Système Unix
> Resources et Développement
> Secteur Backend
> RTS - Radio Télévision Suisse
> Quai Ernest-Ansermet 20
> Case postale 234 
> CH - 1211 Genève 8
> T +41 (0)58 236 81 63
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20110421/e34efce8/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 28088 bytes
Desc: not available
URL: <http://www.varnish-cache.org/lists/pipermail/varnish-misc/attachments/20110421/e34efce8/attachment-0001.png>


More information about the varnish-misc mailing list