Varnish crashing excessivly

Wed Sep 10 10:47:35 CEST 2008

We've been having problems with squid for a while, so we started trying 
varnish in our development environment.

Everything has been working out well. But after lots of problems with 
squid last week we decided to give the 2.0 beta a test drive in the 
production environment. At first everything seemed to be working out great.

Our cache machine is a Dell Poweredge 2950 X5355  @ 2.66GHz with about 
16GB's of RAM with 5 76GB SAS-drives. We are running Ubuntu. The server 
is serving images from the backend storage servers. The image sizes 
varys from about 10k to 200k. At peak hours we have about 4000 requests 
per second.

I compiled varnish with no options. Used this config file.

backend images01 {
.host = "x.x.x.x";
.port = "80";
}
backend images02 {
.host = "x.x.x.x";
.port = "80";
}

backend lighty {
.host  = "x.x.x.x";
.port  = "80";
}

sub vcl_recv {
        set req.grace = 30s;
        if (req.request != "GET") {
                error 507 "Method not allowed";
        }
        if(req.url ~ "^/((imgs)|(stat)|(b))/") {
                set req.backend = lighty;
                lookup;
        } else if (req.url ~ "^/(([0-9]{1,2}/)|(avs))(.*)\.jpg$" ){
                if(req.url ~ 
"^/(([0-9]{1})|([1]{1}[0-9]{1})|([2]{1}[0-8]{1})|(avs))/") {
                        set req.backend = images01;
                } else if (req.url ~ 
"^/(([2]{1}[9]{1})|([3]{1}[0-9]{1})|([4]{1}[0-2]{1}))/") {
                        set req.backend = images02;
                } else {
                        error 508 "Storage not found";
                }
        } else {
                error 404 "Not Found";
        }
        if (req.http.host ~ "^xy.com$") {
                set req.http.host = "x.se";
                lookup;
        } else {
                set req.http.host = "x.se";
                if(req.http.Cookie ~ "viewer=ok") {
                        lookup;
                } else {
                       error 506 "Please visit x.se to view this image";
                }
        }
}
sub vcl_fetch {
    set req.grace = 30s;
    if (!obj.cacheable) {
        pass;
    }
    if (obj.http.Set-Cookie) {
        pass;  
    }
    set obj.prefetch =  -30s;
    deliver;
}

I Started varnish with the following options:

ulimit -n 500000
/usr/local/sbin/varnishd -a x.x.x.x:80 \
        -f /usr/local/etc/varnish/raptor.vcl \
        -T 127.0.0.1:2000 \
       -s file,/mnt/cache1/varnish_storage1.bin,80% \
       -s file,/mnt/cache2/varnish_storage2.bin,80% \
       -s file,/mnt/cache3/varnish_storage3.bin,80% \
       -s file,/mnt/cache4/varnish_storage4.bin,80% \
       -s file,/mnt/cache5/varnish_storage5.bin,80% \
       -p thread_pool_max=4000 \
       -p listen_depth=4096 \
       -p lru_interval=3600 \
       -h classic,800011 \
       -t 600

Then we ran in to problems, when watching thru varnishstat the server 
seemed to stop every now and then (every ~30 seconds). We soon 
established that it had to do with disk writes hogging up all resources. 
We've tried tweaking some of the /proc/vm-variables but with no luck so far.

Problem #2 is that varnish segfaults every now and then, sometimes many 
times in a short period, but sometimes it runs for a couple of days 
without problems. The only lead i have on this is
[105474.200474] varnishd[24170]: segfault at 00000000000004a0 rip 
000000000041ced0 rsp 00002aef65046af8 error 4
which i got from dmesg

We switched over to -s malloc,300G and formatted the disks to swap, and 
added them with the same priority

It ran pretty well for a while, but then the segfaults began again. And 
when it didn't segfault it hogged up sys %.
But the VM-hangs were gone.

I tried running with -d -d but didn't get any info about the sys % 
hogging. The symptom was a pretty unresponsive cache-server during high 
loads.

I am aware that this is probably not even close to all the information 
you need, I need your help to collect more data about my problems. I 
would really like to replace squid as soon as possible.