My random thoughts

Fri Feb 10 19:09:27 CET 2006

Poul-Henning Kamp <phk at phk.freebsd.dk> writes:
> It is not enough to deliver a technically superior piece of software,
> if it is not possible for people to deploy it usefully in a sensible
> way and timely fashion.

I tend to favor usability over performance.  I believe you tend to
favor performance over usability.  Hopefully, our opposing tendencies
will combine and the result will be a perfect balance ;)

> In both cases, it would be ideal if all that is necessary to tell
> Varnish are two pieces of information:
>
> 	Storage location
> 		Alternatively we can offer an "auto" setting that makes
> 		Varnish discover what is available and use what it find.

I want Varnish to support multiple storage backends:

 - quick and dirty squid-like hashed directories, to begin with

 - fancy block storage straight to disk (or to a large preallocated
   file) like you suggested

 - memcached

> Ideally this can be done on the commandline so that there is no
> configuration file to edit to get going, just
>
> 	varnish -d /home/varnish -s backend.example.dom

This would use hashed directories if /home/varnish is a directory, and
block storage if it's a file or device node.

> We need to decide what to do about the cache when the Varnish
> process starts.  There may be a difference between it starting
> first time after the machine booted and when it is subsequently
> (re)started.

This might vary depending on which storage backend is used.  With
memcached, for instance, there is a possibility that varnish
restarted, but memcached is still running and still has a warm cache;
and if memcached also restarted, it will transparently obtain any
cached object from its peers.  The disadvantage with memcached is that
we can't sendfile() from it.

> By far the easiest thing to do is to disregard the cache, that saves
> a lot of code for locating and validating the contents, but this
> carries a penalty in backend or cluster fetches whenever a node
> comes up.  Lets call this the "transient cache model"

Another issue is that a persistent cache must store both data and
metadata on disk, rather than just store data on disk and metadata in
memory.  This complicates not only the logic but also the storage
format.

> 	Can expired contents be served if we can't contact the
> 	backend ?  (dangerous...)

Dangerous, but highly desirable in certain circumstances.  I need to
locate the architecture notes I wrote last fall and place them online;
I spent quite somet time thinking about and describing how this could
/ should be done.

> It is a very good question how big a fraction of the persistent
> cache would be usable after typical downtimes:
>
> 	After a Varnish process restart:  Nearly all.
>
> 	After a power-failure ?  Probably at least half, but probably
> 	not the half that contains the most busy pages.

When using direct-to-disk storage, we can (fairly) easily design the
storage format in such a way that updates are atomic, and make liberal
use of fsync() or similar to ensure (to the extent possible) that the
cache is in a consistent state after a power failure.

> Off the top of my head, I would prefer the transient model any day
> because of the simplicity and lack of potential consistency problems,
> but if the load on the back end is intolerable this may not be
> practically feasible.

How about this: we start with the transient model, and add persistence
later.

> If all machines in the cluster have sufficient cache capacity, the
> other remaining argument is backend offloading, that would likely
> be better mitigated by implementing a 1:10 style two-layer cluster
> with the second level node possibly having twice the storage of
> the front row nodes.

Multiple cache layers may give rise to undesirable and possibly
unpredictable interaction (compare this to tunneling TCP/IP over TCP,
with both TCP layers battling each other's congestion control)

> Finally Consider the impact on a cluster of a "must get" object
> like an IMG tag with a misspelled URL.  Every hit on the front page
> results in one get of the wrong URL.  One machine in the cluster
> ask everybody else in the cluster "do you have this URL" every
> time somebody gets the frontpage.

Not if we implement negative caching, which we have to anyway -
otherwise all those requests go to the backend, which gets bogged down
sending out 404s.

> If we implement a negative feedback protocol ("No I don't"), then
> each hit on the wrong URL will result in N+1 packets (assuming
> multicast).

Or we can just ignore queries for documents which we don't have; the
requesting node will have a simply request the document from the
backend if no reply arrives within a short timeout (~1s).

> Configuration data and instructions passed forth and back should
> be encrypted and signed if so configured.  Using PGP keys is
> a very tempting and simple solution which would pave the way for
> administrators typing a short ascii encoded pgp signed message
> into a SMS from their Bahamas beach vacation...

Unfortunately, PGP is very slow, so it should only be used to
communicate with some kind of configuration server, not with the cache
itself.

> The simplest storage method mmap(2)'s a disk or file and puts
> objects into the virtual memory on page aligned boundaries,
> using a small struct for metadata.  Data is not persistant
> across reboots.  Object free is incredibly cheap.  Object
> allocation should reuse recently freed space if at all possible.
> "First free hole" is probably a good allocation strategy.
> Sendfile can be used if filebacked.  If nothing else disks
> can be used by making a 1-file filesystem on them.

hmm, I believe you can sendfile() /dev/zero if you use that trick to
get a private mmap()ed arena.

> Avoid regular expressions at runtime.  If config file contains
> regexps, compile them into executable code and dlopen() it
> into the Varnish process.  Use versioning and refcounts to
> do memory management on such segments.

unlike regexps, globs can be evaluated very efficiently.

> It makes a lot of sense to not actually implement this in the main
> Varnish process, but rather supply a template perl or python script
> that primes the cache by requesting the objects through Varnish.
> (That would require us to listen separately on 127.0.0.1
> so the perlscript can get in touch with Varnish while in warm-up.)

This can easily be done with existing software like w3mir.

> One interesting but quite likely overengineered option in the
> cluster case is if the central monitor tracks a fraction of the
> requests through the logs of the running machines in the cluster,
> spots the hot objects and tell the warming up varnish what objects
> to get and from where.

You can probably do this in ~50 lines of Perl using Net::HTTP.

> In the cluster configuration, it is probably best to run the cluster
> interaction in a separate process rather than the main Varnish
> process.  From Varnish to cluster info would go through the shared
> memory, but we don't want to implement locking in the shmem so
> some sort of back-channel (UNIX domain or UDP socket ?) is necessary.

Distributed lock managers are *hard*...  but we don't need locking for
simple stuff like reading logs out of shmem.

DES
-- 
Dag-Erling Smørgrav
Senior Software Developer
Linpro AS - www.linpro.no