My random thoughts

Sun Feb 12 22:54:00 CET 2006

Good work guys. I had a great time reading the notes.

Here comes the sys.adm approach.

P.S The sys.adm approach can easily been seen as a overengineered
solution, don't feel my approach as a must-have. More as a nice-to-have.

>Notes on Varnish
>----------------
>
>Philosophy
>----------
>
>It is not enough to deliver a technically superior piece of software,
>if it is not possible for people to deploy it usefully in a sensible
>way and timely fashion.
>[...]
>If circumstances are not conductive to strucured approach, it should
>be possible to repeat this process and set up N independent Varnish
>boxes and get some sort of relief without having to read any further
>documentation.

I think these are reasonable senarios and solutions.

>
>The subsequent (layers of) Varnish
>----------------------------------
>
>[...]
>When Varnish machines are put in a cluster, the administrator should
>be able to consider the cluster as a unit and not have to think and
>interact with the individual nodes.

That would be great. Imho far to little software acts like this.
There could be a good reason for that, but I wouldn't know.

>Some sort of central management node or facility must exist and
>it would be preferable if this was not a physical but a logical
>entity so that it can follow the admin to the beach.  Ideally it
>would give basic functionality in any browser, even mobile phones.

A web-browser interface and a CLI should cover 99% of use. An easy
protocol/API would make it possible for anybody to write their own
interface to the central managment node.

>The focus here is scaleability, we want to avoid per-machine
>configuration if at all possible.  Ideally, preconfigured hardware
>can be plugged into power and net, find an address with DHCP, contact
>preconfigured management node, get a configuration and start working.

This would ease many things. If one should make a image of some sort, one
does not have to change/make new image for every config change (if that
happens more ofte than software updates).

>But we also need to think about how we avoid a site of Varnish
>machines from acting like a stampeeding horde when the power or
>connectivity is brought back after a disruption.  Some sort of
>slow starting ("warm-up" ?) must be implemented to prevent them
>from hitting all the backend with the full force.

Yes. As you said in Oslo Poul, this could be a killer-app feature for some
sites.

>An important aspect of cluster operations is giving a statistically
>meaninful judgement of the cluster size, in particular answering
>the question "would adding another machine help ?" precisely.

Is this possible? It would involve knowing how the backend is doing with
added load.
One thing is to measure how it's doing right now (responstime), but to
predict added load is hard.
My guess is also that the only reason somebody would ask "would adding
another machine help ?" was if the CPU or bandwith was exhausted on the
accelerator(s) in place, and one really needed to do something anyway. The
only other reason I can think of is responstime from the accelerator, and
then we have the predict load problem.

>We should have a facility that allows the administrator to type
>in a REGEXP/URL and have all the nodes answer with a checksum, age
>and expiry timer for any documents they have which match.  The
>results should be grouped by URL and checksum.

Not only the admin needs this. Its great when programmers/implementors
need to debug how "good" the new/old application caches.
In a world of rapid development, little or no time is often given to
make/check the "cachebility" of the app.
A "check www.rapiddev.com/newapp/*" after a couple of clicks on the app
could save developers huge amount of time, and reduce backend load
immensely.

>
>Technical concepts
>------------------
>
>We want the central Varnish process to be that, just one process, and
>we want to keep it small and efficient at all cost.

Yes. When you say 1 process, you mean 1 process per CPU/Core?

>Code that will not be used for the central functionality should not
>be part of the central process.  For instance code to parse, validate
>and interpret the (possibly) complex configuration file should be a
>separate program.

Lets list possible processes:

1. Varnish main.
2. Disk/storage process.
3. Config process/program.
4. Managment process.
5. Logger/stats.

>Depending on the situation, the Varnish process can either invoke
>this program via a pipe or receive the ready to use data structures
>via a network connection.
>
>Exported data from the Varnish process should be made as cheap as
>possible, likely shared memory.  That will allow us to deploy separate
>processes for log-grabbing, statistics monitoring and similar
>"off-duty" tasks and let the central process get on with the
>important job.

Sounds great.

>
>Backend interaction
>-------------------
>
>We need a way to tune the backend interaction further than what the
>HTTP protocol offers out of the box.
>
>We can assume that all documents we get from the backend has an
>expiry timer, if not we will set a default timer (configurable of
>course).
>
>But we need further policy than that.  Amongst the questions we have
>to ask are:
>
>	How long time after the expiry can we serve a cached copy
>	of this document while we have reason to belive the backend
>	can supply us with an update ?
>
>	How long time after the expiry can we serve a cached copy
>	of this document if the backend does not reply or is
>	unreachable.
>
>	If we cannot serve this document out of cache and the backend
>	cannot inform us, what do we serve instead (404 ?  A default
>	document of some sort ?)
>
>	Should we just not serve this page at all if we are in a
>	bandwidth crush (DoS/stampede) situation ?

You are correct. Did you mean ask the user or did you mean questions to
answer in a specification?
I think the best approach is to ask the user, and let him answer in the
config. I can see as many answers to these questions (and more) as there
are websites :) Also a site might answer differently in different
scenarios.

>It may also make sense to have a "emergency detector" which triggers
>when the backend is overloaded and offer a scaling factor for all
>timeouts for when in such an emergency state.  Something like "If
>the average response time of the backend rises above 10 seconds,
>multiply all expiry timers by two".

Good idea. Once again I opt for a config choice on that one.

>It probably also makes sense to have a bandwidth/request traffic
>shaper for backend traffic to prevent any one Varnish machine from
>pummeling the backend in case of attacks or misconfigured
>expiry headers.

Good idea, but this one I am unsure about. The reason: One more thing that
can make the accelerator behave in a way you don't understand.
You are delivering stale documents from the accelerator. You start
"debugging". "Hmm, most of thre requests are given from backen in timely
fashion..." You debug more and start examining the headers. I can see
myself going through loads of different stuff, and than: "Ahh, the traffic
shaper..."
As I said, I like the idea, but to many rules for backoffs will make the
sys.admin scratch his head even more.
Can we come up with a way for Varnish to tell the sys.adm. "Hey, you are
delivering stale's here. Because ..." Or is this overengineer?

>
>Startup/consistency
>-------------------
>
>We need to decide what to do about the cache when the Varnish
>process starts.  There may be a difference between it starting
>first time after the machine booted and when it is subsequently
>(re)started.
>
>By far the easiest thing to do is to disregard the cache, that saves
>a lot of code for locating and validating the contents, but this
>carries a penalty in backend or cluster fetches whenever a node
>comes up.  Lets call this the "transient cache model"

I agree with Dag here. Lets start with "transient cache model" and add
more later.
We will discuss some scenarios at spec writing, and maybe come up with
some models for later implementation.
Better dig out those architecture notes Dag :)

>The alternative is to allow persistently cached contents to be used
>according to configured criteria:
>[...]
>The choice we make does affect the storage management part of Varnish,
>but I see that is being modular in any instance, so it may merely be
>that some storage modules come up clean on any start while other
>will come up with existing objects cached.

Ironically at VG the stuff that can be cahced long (JPG's, GIF's etc) can
be cached long, while the costly stuff is the documents that cost CPU
making.
It would not be surprised if its like that many places.

>
>Clustering
>----------
>
>I'm somewhat torn on clustering for traffic purposes.  For admin
>and management: Yes, certainly, but starting to pass objects from
>one machine in a cluster to another is likely to be just be a waste
>of time and code.
>
>Today one can trivially fit 1TB into a 1U machine so the partitioning
>argument for cache clusters doesn't sound particularly urgent to me.
>
>If all machines in the cluster have sufficient cache capacity, the
>other remaining argument is backend offloading, that would likely
>be better mitigated by implementing a 1:10 style two-layer cluster
>with the second level node possibly having twice the storage of
>the front row nodes.

I am also torn here.
A part of me says. Hey, there is ICP v2 and such, lets use it, it's good
economy.
Another part is thinking that ICP works at it's best when you have many
accelerators, and if Varnish can deliver what we hope, not many frontends
are needed for most sites in the world :) At that level, you can for sure
deliver the extra content ICP and such would save you from.
I know that in saying that I am sacrificing design because of
implementation, but there it is.

>The coordination necessary for keeping track of, or discovering in
>real-time, who has a given object can easily turn into a traffic
>and cpu load nightmare.
>
>And from a performance point of view, it only reduces quality:
>First we send out a discovery multicast, then we wait some amount
>of time to see if a response arrives only then should we start
>to ask the backend for the object.  With a two-level cluster
>we can ask the layer-two node right away and if it doesn't have
>the object it can ask the back-end right away, no timeout is
>involved in that.

A note. One of the reasons to be wary of two-level clusters in my opinion
is that if you cache a document from the backend at the lowest lvl for say
2 min. And then the level over comes and gets it 1 min. into those 2 min.,
looks up in its config and finds out this is a 2 min. cache document, the
document will be 1 min stale before a refesh. This could of cource be
solved with Expires tags, but it makes sys.adm's wary.
Dag also noted problems with this when we have two-layer approach and
first layer is in backoff-mode.

>Finally Consider the impact on a cluster of a "must get" object
>like an IMG tag with a misspelled URL.  Every hit on the front page
>results in one get of the wrong URL.  One machine in the cluster
>ask everybody else in the cluster "do you have this URL" every
>time somebody gets the frontpage.
>[...]
>Negative caching can mitigate this to some extent.
>
>
>Privacy
>-------
>
>Configuration data and instructions passed forth and back should
>be encrypted and signed if so configured.  Using PGP keys is
>a very tempting and simple solution which would pave the way for
>administrators typing a short ascii encoded pgp signed message
>into a SMS from their Bahamas beach vacation...

Bahamas? Vaction? :)

>
>Implementation ideas
>--------------------
>
>The simplest storage method mmap(2)'s a disk or file and puts
>objects into the virtual memory on page aligned boundaries,
>using a small struct for metadata.  Data is not persistant
>across reboots.  Object free is incredibly cheap.  Object
>allocation should reuse recently freed space if at all possible.
>"First free hole" is probably a good allocation strategy.
>Sendfile can be used if filebacked.  If nothing else disks
>can be used by making a 1-file filesystem on them.
>
>More complex storage methods are object per file and object
>in database models.  They are relatively trival and well
>understood.  May offer persistence.

Dag says:

>- quick and dirty squid-like hashed directories, to begin with
>
> - fancy block storage straight to disk (or to a large preallocated
>   file) like you suggested
>
> - memcached

as Poul later comments, squid is slow and dirty. Lets try to avoid it.
I am fine with fancy block storage, and I am tempted to suggest: Berkeley DB
I have always pictured Varnish with a Berkley DB backend. Why? I _think_
it is fast (only website info to go on here).

http://www.sleepycat.com/products/bdb.html and
http://www.sleepycat.com/products/bdb.html

its block storage, and wildcard purge could potentially be as easy as:
delete from table where URL like '%bye-bye%';
Another thing I am just gonna base on my wildest fantasies, could we use
the Berkley DB replication to make a cache up-to-date after downtime?
Would be fun, wouldn't it? :)

I also like memcached, and I am excited to hear Poul suggest that we build
a "better" approach.
When I read that, I must admit that my first thought was that it would be
really nice if this is a deamon/shem process that one can build a php (or
whatever) interface against. This is out of scope, but imagine you have
full access to the cache-data in php if only in RO mode. That means you
can build php apps with a superquick backend with loads of metadata. :)

>Read-Only storage methods may make sense for getting hold
>of static emergency contents from CD-ROM etc.

Nice feature.

>Treat each disk arm as a separate storage unit and keep track of
>service time (if possible) to decide storage scheduling.
>
>Avoid regular expressions at runtime.  If config file contains
>regexps, compile them into executable code and dlopen() it
>into the Varnish process.  Use versioning and refcounts to
>do memory management on such segments.

I smell a glob vs. compiled regexp showdown. Hehe.
My only contrib here would be. Don't do it in java regexp :)

>Avoid committing transmit buffer space until we have bandwidth
>estimate for client.  One possible way:  Send HTTP header
>and time ACKs getting back, then calculate transmit buffer size
>and send object.  This makes DoS attacks more harmless and
>mitigates traffic stampedes.

Yes. Are you thinking of writing a FreeBSD kernel module (accept_filter)
for this? Like accf_http.

>Kill all TCP connections after N seconds, nobody waits an hour
>for a web-page to load.
>
>Abuse mitigation interface to firewall/traffic shaping:  Allow
>the central node to put an IP/Net into traffic shaping or take
>it out of traffic shaping firewall rules.  Monitor/interface
>process (not main Varnish process) calls script to config
>firewalling.

This sounds like a really good feature. Hope it can be solved in Linux as
well. Not sure they have the fancy IPFW filters etc.

>"Warm-up" instructions can take a number of forms and we don't know
>what is the most efficient or most usable.  Here are some ideas:
>[...]
>
>One interesting but quite likely overengineered option in the
>cluster case is if the central monitor tracks a fraction of the
>requests through the logs of the running machines in the cluster,
>spots the hot objects and tell the warming up varnish what objects
>to get and from where.

>>This can easily be done with existing software like w3mir.
>>[...]
>>You can probably do this in ~50 lines of Perl using Net::HTTP.

>>>Sounds like you just won this bite :-)

Nice :) But I am not sure this is as "easy" as it sounds at first.

>In the cluster configuration, it is probably best to run the cluster
>interaction in a separate process rather than the main Varnish
>process.  From Varnish to cluster info would go through the shared
>memory, but we don't want to implement locking in the shmem so
>some sort of back-channel (UNIX domain or UDP socket ?) is necessary.
>
>If we have such an "supervisor" process, it could also be tasked
>with restarting the varnish process if vitals signs fail:  A time
>stamp in the shmem or kill -0 $pid.

You got to like programs that keep themselvs alive.

>It may even make sense to run the "supervisor" process in stand
>alone mode as well, there it can offer a HTML based interface
>to the Varnish process (via shmem).
>
>For cluster use the user would probably just pass an extra argument
>when he starts up Varnish:
>
>	varnish -c $cluster_args $other_args
>vs
>
>	varnish $other_args
>
>and a "varnish" shell script will Do The Right Thing.

Thats what we should aim at.

>Shared memory
>-------------
>
>The shared memory layout needs to be thought about somewhat.  On one
>hand we want it to be stable enough to allow people to write programs
>or scripts that inspect it, on the other hand doing it entirely in
>ascii is both slow and prone to race conditions.
>
>The various different data types in the shared memory can either be
>put into one single segment(= 1 file) or into individual segments
>(= multiple files).  I don't think the number of small data types to
>be big enough to make the latter impractical.
>
>Storing the "big overview" data in shmem in ASCII or HTML would
>allow one to point cat(1) or a browser directly at the mmaped file
>with no interpretation necessary, a big plus in my book.
>
>Similarly, if we don't update them too often, statistics could be stored
>in shared memory in perl/awk friendly ascii format.

That would be a big pluss with the stats either in HTML or in ASCII at least.

>But the logfile will have to be (one or more) FIFO logs, probably at least
>three in fact:  Good requests, Bad requests, and exception messages.

And a debug logg. The squid modell is not to bad there. Only poorly
documented.
In short its a "binary configuration", 1=some part a, 4=some part b, ...,
128=some part i.
Debug=133=a,b and i.

I mentioned on the meeting some URL's that would provide some relevant
reading:

http://www.web-cache.com/

is old but good. It lists all relevant protocols:

http://www.web-cache.com/Writings/protocols-standards.html

and other written things:

http://www.web-cache.com/writings.html

Here is also the Hypertext Caching Protocol - alternative and improvement
to ICP, what I refered to as WCCP at the last meeting.
Another RFC to take a look on might be: Web Cache Invalidation Protocol
(WCIP)
Here is what ESI.org has to say about WCIP: http://www.esi.org/tfaq.html#q8
And here is their approach: http://www.esi.org/invalidation_protocol_1-0.html

Sorry about all the text :)

P.S I was not on the list when Poul wrote the first post, so I don't have
the ID either. My post will come as a seperate one.

Anders Berg