My random thoughts

Thu Feb 9 09:08:35 CET 2006

Here are my random thoughts on Varnish until now.  Some of it mirrors
what we talked about in the meeting, some if it is more detailed or
reaches further into speculation.

Poul-Henning

Notes on Varnish
----------------

Philosophy
----------

It is not enough to deliver a technically superior piece of software,
if it is not possible for people to deploy it usefully in a sensible
way and timely fashion.

Deployment scenarios
--------------------

There are two fundamental usage scenarios for Varnish: when the
first machine is brought up to offload a struggling backend and
when a subsequent machine is brought online to help handle the load.

The first (layer of) Varnish
----------------------------

Somebodys webserver is struggling and they decide to try Varnish.

Often this will be a skunkworks operation with some random PC
purloined from wherever it wasn't being used and the Varnish "HOWTO"
in one hand.

If they do it in an orderly fashion before things reach panic proportions,
a sensible model is to setup the Varnish box, test it out from your
own browser, see that it answers correctly.  Test it some more and
then add the IP# to the DNS records so that it takes 50% of the load
off the backend.

If it happens as firefighting at 3AM the backend will be moved to another
IP, the Varnish box given the main IP and things had better work real
well, really fast.

In both cases, it would be ideal if all that is necessary to tell
Varnish are two pieces of information:

	Storage location
		Alternatively we can offer an "auto" setting that makes
		Varnish discover what is available and use what it find.

	DNS or IP# of backend.

		IP# is useful when the DNS settings are not quite certain
		or when split DNS horizon setups are used.

Ideally this can be done on the commandline so that there is no
configuration file to edit to get going, just

	varnish -d /home/varnish -s backend.example.dom

and you're off running.

A text, curses or HTML based based facility to give some instant
feedback and stats is necessary.

If circumstances are not conductive to strucured approach, it should
be possible to repeat this process and set up N independent Varnish
boxes and get some sort of relief without having to read any further
documentation.

The subsequent (layers of) Varnish
----------------------------------

This is what happens once everybody has caught their breath,
and where we start to talk about Varnish clusters.

We can assume that at this point, the already installed Varnish
machines have been configured more precisely and that people
have studied Varnish configuration to some level of detail.

When Varnish machines are put in a cluster, the administrator should
be able to consider the cluster as a unit and not have to think and
interact with the individual nodes.

Some sort of central management node or facility must exist and
it would be preferable if this was not a physical but a logical
entity so that it can follow the admin to the beach.  Ideally it
would give basic functionality in any browser, even mobile phones.

The focus here is scaleability, we want to avoid per-machine
configuration if at all possible.  Ideally, preconfigured hardware
can be plugged into power and net, find an address with DHCP, contact
preconfigured management node, get a configuration and start working.

But we also need to think about how we avoid a site of Varnish
machines from acting like a stampeeding horde when the power or
connectivity is brought back after a disruption.  Some sort of
slow starting ("warm-up" ?) must be implemented to prevent them
from hitting all the backend with the full force.

An important aspect of cluster operations is giving a statistically
meaninful judgement of the cluster size, in particular answering
the question "would adding another machine help ?" precisely.

We should have a facility that allows the administrator to type
in a REGEXP/URL and have all the nodes answer with a checksum, age
and expiry timer for any documents they have which match.  The
results should be grouped by URL and checksum.

Technical concepts
------------------

We want the central Varnish process to be that, just one process, and
we want to keep it small and efficient at all cost.

Code that will not be used for the central functionality should not
be part of the central process.  For instance code to parse, validate
and interpret the (possibly) complex configuration file should be a
separate program.

Depending on the situation, the Varnish process can either invoke
this program via a pipe or receive the ready to use data structures
via a network connection.

Exported data from the Varnish process should be made as cheap as
possible, likely shared memory.  That will allow us to deploy separate
processes for log-grabbing, statistics monitoring and similar
"off-duty" tasks and let the central process get on with the
important job.

Backend interaction
-------------------

We need a way to tune the backend interaction further than what the
HTTP protocol offers out of the box.

We can assume that all documents we get from the backend has an
expiry timer, if not we will set a default timer (configurable of
course).

But we need further policy than that.  Amongst the questions we have
to ask are:

	How long time after the expiry can we serve a cached copy
	of this document while we have reason to belive the backend
	can supply us with an update ?

	How long time after the expiry can we serve a cached copy
	of this document if the backend does not reply or is
	unreachable.

	If we cannot serve this document out of cache and the backend
	cannot inform us, what do we serve instead (404 ?  A default
	document of some sort ?)

	Should we just not serve this page at all if we are in a
	bandwidth crush (DoS/stampede) situation ?

It may also make sense to have a "emergency detector" which triggers
when the backend is overloaded and offer a scaling factor for all
timeouts for when in such an emergency state.  Something like "If
the average response time of the backend rises above 10 seconds,
multiply all expiry timers by two".

It probably also makes sense to have a bandwidth/request traffic
shaper for backend traffic to prevent any one Varnish machine from
pummeling the backend in case of attacks or misconfigured
expiry headers.

Startup/consistency
-------------------

We need to decide what to do about the cache when the Varnish
process starts.  There may be a difference between it starting
first time after the machine booted and when it is subsequently
(re)started.

By far the easiest thing to do is to disregard the cache, that saves
a lot of code for locating and validating the contents, but this
carries a penalty in backend or cluster fetches whenever a node
comes up.  Lets call this the "transient cache model"

The alternative is to allow persistently cached contents to be used
according to configured criteria:

	Can expired contents be served if we can't contact the
	backend ?  (dangerous...)

	Can unexpired contents be served if we can't contact the
	backend ?  If so, how much past the expiry ?

It is a very good question how big a fraction of the persistent
cache would be usable after typical downtimes:

	After a Varnish process restart:  Nearly all.

	After a power-failure ?  Probably at least half, but probably
	not the half that contains the most busy pages.

And we need to take into consideration if validating the format and
contents of the cache might take more resources and time than getting
the content from the backend.

Off the top of my head, I would prefer the transient model any day
because of the simplicity and lack of potential consistency problems,
but if the load on the back end is intolerable this may not be
practically feasible.

The best way to decide is to carefully analyze a number of cold
starts and cache content replacement traces.

The choice we make does affect the storage management part of Varnish,
but I see that is being modular in any instance, so it may merely be
that some storage modules come up clean on any start while other
will come up with existing objects cached.

Clustering
----------

I'm somewhat torn on clustering for traffic purposes.  For admin
and management: Yes, certainly, but starting to pass objects from
one machine in a cluster to another is likely to be just be a waste
of time and code.

Today one can trivially fit 1TB into a 1U machine so the partitioning
argument for cache clusters doesn't sound particularly urgent to me.

If all machines in the cluster have sufficient cache capacity, the
other remaining argument is backend offloading, that would likely
be better mitigated by implementing a 1:10 style two-layer cluster
with the second level node possibly having twice the storage of
the front row nodes.

The coordination necessary for keeping track of, or discovering in
real-time, who has a given object can easily turn into a traffic
and cpu load nightmare.

And from a performance point of view, it only reduces quality:
First we send out a discovery multicast, then we wait some amount
of time to see if a response arrives only then should we start
to ask the backend for the object.  With a two-level cluster
we can ask the layer-two node right away and if it doesn't have
the object it can ask the back-end right away, no timeout is
involved in that.

Finally Consider the impact on a cluster of a "must get" object
like an IMG tag with a misspelled URL.  Every hit on the front page
results in one get of the wrong URL.  One machine in the cluster
ask everybody else in the cluster "do you have this URL" every
time somebody gets the frontpage.

If we implement a negative feedback protocol ("No I don't"), then
each hit on the wrong URL will result in N+1 packets (assuming multicast).

If we use a silent negative protocol the result is less severe for
the machine that got the request, but still everybody wakes up to
to find out that no, we didn't have that URL.

Negative caching can mitigate this to some extent.

Privacy
-------

Configuration data and instructions passed forth and back should
be encrypted and signed if so configured.  Using PGP keys is
a very tempting and simple solution which would pave the way for
administrators typing a short ascii encoded pgp signed message
into a SMS from their Bahamas beach vacation...

Implementation ideas
--------------------

The simplest storage method mmap(2)'s a disk or file and puts
objects into the virtual memory on page aligned boundaries,
using a small struct for metadata.  Data is not persistant
across reboots.  Object free is incredibly cheap.  Object
allocation should reuse recently freed space if at all possible.
"First free hole" is probably a good allocation strategy.
Sendfile can be used if filebacked.  If nothing else disks
can be used by making a 1-file filesystem on them.

More complex storage methods are object per file and object
in database models.  They are relatively trival and well
understood.  May offer persistence.

Read-Only storage methods may make sense for getting hold
of static emergency contents from CD-ROM etc.

Treat each disk arm as a separate storage unit and keep track of
service time (if possible) to decide storage scheduling.

Avoid regular expressions at runtime.  If config file contains
regexps, compile them into executable code and dlopen() it
into the Varnish process.  Use versioning and refcounts to
do memory management on such segments.

Avoid committing transmit buffer space until we have bandwidth
estimate for client.  One possible way:  Send HTTP header
and time ACKs getting back, then calculate transmit buffer size
and send object.  This makes DoS attacks more harmless and
mitigates traffic stampedes.

Kill all TCP connections after N seconds, nobody waits an hour
for a web-page to load.

Abuse mitigation interface to firewall/traffic shaping:  Allow
the central node to put an IP/Net into traffic shaping or take
it out of traffic shaping firewall rules.  Monitor/interface
process (not main Varnish process) calls script to config
firewalling.

"Warm-up" instructions can take a number of forms and we don't know
what is the most efficient or most usable.  Here are some ideas:

    Start at these URL's then...

	... follow all links down to N levels.

	... follow all links that match REGEXP no deeper than N levels down.

	... follow N random links no deeper than M levels down.

	... load N objects by following random links no deeper than
	    M levels down.

    But...

	... never follow any links that match REGEXP

	... never pick up objects larger than N bytes

	... never pick up objects older than T seconds

It makes a lot of sense to not actually implement this in the main
Varnish process, but rather supply a template perl or python script
that primes the cache by requesting the objects through Varnish.
(That would require us to listen separately on 127.0.0.1
so the perlscript can get in touch with Varnish while in warm-up.)

One interesting but quite likely overengineered option in the
cluster case is if the central monitor tracks a fraction of the
requests through the logs of the running machines in the cluster,
spots the hot objects and tell the warming up varnish what objects
to get and from where.

In the cluster configuration, it is probably best to run the cluster
interaction in a separate process rather than the main Varnish
process.  From Varnish to cluster info would go through the shared
memory, but we don't want to implement locking in the shmem so
some sort of back-channel (UNIX domain or UDP socket ?) is necessary.

If we have such an "supervisor" process, it could also be tasked
with restarting the varnish process if vitals signs fail:  A time
stamp in the shmem or kill -0 $pid.

It may even make sense to run the "supervisor" process in stand
alone mode as well, there it can offer a HTML based interface
to the Varnish process (via shmem).

For cluster use the user would probably just pass an extra argument
when he starts up Varnish:

	varnish -c $cluster_args $other_args
vs

	varnish $other_args

and a "varnish" shell script will Do The Right Thing.

Shared memory
-------------

The shared memory layout needs to be thought about somewhat.  On one
hand we want it to be stable enough to allow people to write programs
or scripts that inspect it, on the other hand doing it entirely in
ascii is both slow and prone to race conditions.

The various different data types in the shared memory can either be
put into one single segment(= 1 file) or into individual segments
(= multiple files).  I don't think the number of small data types to
be big enough to make the latter impractical.

Storing the "big overview" data in shmem in ASCII or HTML would
allow one to point cat(1) or a browser directly at the mmaped file
with no interpretation necessary, a big plus in my book.

Similarly, if we don't update them too often, statistics could be stored
in shared memory in perl/awk friendly ascii format.

But the logfile will have to be (one or more) FIFO logs, probably at least
three in fact:  Good requests, Bad requests, and exception messages.

If we decide to make logentries fixed length, we could make them ascii
so that a simple "sort -n /tmp/shmem.log" would put them in order after
a leading numeric timestamp, but it is probably better to provide a
utility to cat/tail-f the log and keep the log in a bytestring FIFO
format.  Overruns should be marked in the output.

*END*
-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.