Expiring a million items
phk at phk.freebsd.dk
Tue Jun 20 23:41:16 CEST 2006
Warning: algorithm-wanking in progress :-)
Imagine a varnish running and having a million objects in the cache.
Each of those objects have a TTL and consequently an expiry time,
but before that time, we want the prefetcher to get a shot at the object.
Our only requirement to the data structure apart from speed is that
we can pick the lowest expiry time out fast. What order the remaining
elements are in, or how they are organized is not really relevant.
A sorted list
The simple solution of a sorted list will not work, as every insert
of a long TTL object will have to traverse the majority of the list
to find the right spot.
A callout wheel is a circle of A sorted linked lists. An object
which expires at time B is inserted into the ordered list number
(B mod A).
The prefetcher will run around the circle like a hamster in a threadmill,
taking one bucket per second and looking for all those items which are
about to expire.
Basically, this reduces the work by around a factor A.
But A is the trouble. If we have 100 objects, A=10 is a good choice.
If we have a million objects, A should probably be 10000 or maybe
Changing A on the fly is not fun and takes extra CPU work if it has
to be perfect. If we allow some objects to miss the prefetcher
opportunity when A changes, it is less troublesome.
A Tree (of some sort)
The main problem with using a tree structure is locking.
To avoid cross-object locks, it needs to be a tree where branch
nodes are distinct from objects. Or said another way: the tree
holds pointer(s) to the object, the object holds pointer(s) to the
tree but not to any other object.
Most tree types can be implemented this way, but seldom are.
One exception to this rule is the binary heap: It is often implemented
as an array of pointers to leaf elements where element [i] is the
parent of elements [2i+1] and [2i+2].
The binary heap has the extra bonus that the root is always the
extreeme key according to the chosen sort order, this matches
our requirement perfectly.
We already have object store for large sequential chunks of data
in Varnish. If we need to extend the array-object, we will incur
a memcpy hit in order to move to a larger object, but I think this
will be lost in the noise performance wise. Alternatives are
to use an array of equalsized objects or a linked list of objects
to implement the array.
After studying the search-trees for an evening, I have not found
any that beat a minimum binary heap for our needs: We have no need
to merge trees, so the advanced B trees families are not relevant.
A plain B-tree would keep getting skewed as time passes, that means
a lot of one-side rebalancing effort. I doubt the caching of the
least recently used node will buy us anything at all so splay trees
are not interesting and the complexity of the red-black tree also
seems surplus to requirements.
I don't think I know of any binary heap implemenations we can
adopt so I will probably be writing my own.
Unless there are better suggestions ?
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
More information about the varnish-dev