Varnish 503ing on ~1/100 POSTs
Kristian Grønfeldt Sørensen
ksorensen at nordija.com
Tue Apr 5 20:42:42 CEST 2011
On Tue, 2011-04-05 at 10:09 +0100, Ronan Mullally wrote:
> Increasing the Keepalive time on apache on the backends from 1 to 5
> seconds made the biggest impact. I suspect this suggests that the
> problem occurs when Varnish tries to direct a POST to a connection
> which apache has just closed.
That indicates to me that the hack that was implemented to fix
http://www.varnish-cache.org/trac/ticket/749 is not doing what it was
supposed to do. The earlier varnishlog snippet from your original post
includes a restart, which I assume is the restart added by the fix for
#749 - unless you are doing a manual restart in your VCL. It seems that
the backend connection that you get when the restart is done is also
closed before Varnish sends the request.
I had a similar issue (on 2.1.3 which does not include the fix for
#749), and "solved" it by setting the keepalive-timeout of my backends
insanely high (= 2 days - default was 20 seconds). This of course only
works well if you do not have anything other than Varnish talking
directly to your backend server since that would allow those clients to
hog resources on your backend for longer time - making it easier for
anyone to launch a denial of service attack on your backend.
We saw the issue when we had two load-spikes after each other closely
matching the keepalive-timeout. The first spike would make varnish
create a lot of backend-connections, the second spike would use the all
the available connections until it got a connection that had been idle
very close to it's timeout value, which would then be closed just as
Varnish tried to use it. So if you have load-spikes at regular
intervals, you will want to adjust your keepalive-settings on the
backend, so that they are different than the interval between the
I think the best way to solve this would be a configurable
keepalive-timeout of Varnish's backend connections, enabling you to set
it slightly lower than the keepalive-timeout of your backend. This would
ensure that Varnish would always be the one closing the connection.
This issue vas actually discussed at VUG3 and I added a wishlist entry
on PostTwoShoppingList for the feature a couple of weeks ago.
More information about the varnish-misc