Bayron Guevara
Thu Mar 4 01:15:28 CET 2010


I'm using Varnish 2.0.5 running on the following server's specification:

   -  2 Quadcore Intel Xeon 2.00Ghz 64bits
   -  OS: RHEL 5 (64 bits)
   -  8MB RAM
   -  1GB Ethernet

I've configured my network infraestructure with a Load Balancer, a Varnish
dedicated server and five web servers plus database servers. We have the
following network configuration:
external client ---> Load Balancer (public VIP) ---> Varnish Proxy --> Load
Balancer (private VIP) --> Web Servers

In this configuration, the Load Balancer have the responsability for send
the request to the respective server according to the domain. The Varnish
server have configurated the Load Balancer's private VIP as unique backend.

Now, let me explain the issue. On a low traffic scenario, the websites are
served correctly, but sometimes the page get blank or partially loaded. In
both cases a 200 OK response code is received and also the response body,
however it is received incomplete. Then I proceed to check the varnishstat
and varnishlog output, and I have some observations: The varnish frecuently
restarted and at execute  *varnishlog -i Debug -I* I got the following
400 Debug        c "Write error, len = 34500/55022, errno = Success"

I don't know what it means exactly, but some google seach give me a clue:
maybe be caused by an interruption during client communication. So, this
error could show the cause of the problem. Although I don't know why the
cause of this error, I guess a network buffer overflow, so I show you some
OS related values:

/proc/sys/net/ipv4/ip_local_port_range = 32768   61000
/proc/sys/net/core/rmem_max = 131071
/proc/sys/net/core/wmem_max = 131071
/proc/sys/net/ipv4/tcp_mem = 196608  262144  393216
/proc/sys/net/ipv4/tcp_wmem = 4096    16384   4194304
/proc/sys/net/ipv4/tcp_fin_timeout = 60
/proc/sys/net/core/netdev_max_backlog = 1000
/proc/sys/net/core/somaxconn = 128
/proc/sys/net/ipv4/tcp_syncookies = 1
/proc/sys/net/ipv4/tcp_max_orphans = 65536
/proc/sys/net/ipv4/tcp_max_syn_backlog = 1024
/proc/sys/net/ipv4/tcp_synack_retries = 5
/proc/sys/net/ipv4/tcp_syn_retries = 5

This same values can be found in this varnish performance article: The mine ones seems very low and
maybe it is one of the causes. With the average traffic (around 500
concurrent users for all sites), the Varnish service not respond and the
server load raise up to 612. Respect to the web site response, a Connection
refused error (Code 503) is returned. In this ocassion I didn't can review
the varnish statistics.

Here are my varnish params, maybe it can help:
200 2224
accept_fd_holdoff          50 [ms]
acceptor                   default (epoll, poll)
auto_restart               on [bool]
backend_http11             on [bool]
between_bytes_timeout      60.000000 [s]
cache_vbe_conns            off [bool]
cc_command                 "exec cc -fpic -shared -Wl,-x -o %o %s"
cli_buffer                 8192 [bytes]
cli_timeout                5 [seconds]
client_http11              off [bool]
clock_skew                 10 [s]
connect_timeout            0.400000 [s]
default_grace              10
default_ttl                180 [seconds]
diag_bitmap                0x0 [bitmap]
err_ttl                    0 [seconds]
esi_syntax                 0 [bitmap]
fetch_chunksize            128 [kilobytes]
first_byte_timeout         60.000000 [s]
group                      varnish (103)
listen_address             :80
listen_depth               1024 [connections]
log_hashstring             off [bool]
log_local_address          off [bool]
lru_interval               360 [seconds]
max_esi_includes           5 [includes]
max_restarts               4 [restarts]
obj_workspace              8192 [bytes]
overflow_max               100 [%]
ping_interval              3 [seconds]
pipe_timeout               60 [seconds]
prefer_ipv6                off [bool]
purge_dups                 on [bool]
purge_hash                 on [bool]
rush_exponent              3 [requests per request]
send_timeout               600 [seconds]
sess_timeout               5 [seconds]
sess_workspace             65536 [bytes]
session_linger             100 [ms]
session_max                100000 [sessions]
shm_reclen                 255 [bytes]
shm_workspace              8192 [bytes]
srcaddr_hash               1049 [buckets]
srcaddr_ttl                0 [seconds]
thread_pool_add_delay      2 [milliseconds]
thread_pool_add_threshold  2 [requests]
thread_pool_fail_delay     200 [milliseconds]
thread_pool_max            5000 [threads]
thread_pool_min            150 [threads]
thread_pool_purge_delay    1000 [milliseconds]
thread_pool_stack          unlimited [bytes]
thread_pool_timeout        120 [seconds]
thread_pools               8 [pools]
user                       varnish (100)
vcl_trace                  off [bool]

What are your suggestions?
Is this a Varnish or Operating System configuration problem?
