hash algorithms...

Poul-Henning Kamp phk at phk.freebsd.dk
Tue Apr 11 10:14:30 CEST 2006


In message <ujrpsjov9tn.fsf at cat.linpro.no>, Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?= wri
tes:
>Poul-Henning Kamp <phk at phk.freebsd.dk> writes:
>> Between the phone and my email, I have been playing with hash
>> algorithms today.
>
>I'd like to point out the difference between a hash function and a
>hash table...

Sorry for sloppy terminology here.

>hash table...  I agree that we need a hash function, but if we use
>md5, it's probably simpler and faster to keep the data in a red-black
>tree with the hash value as lookup key than in a hash table where we
>need to further hash the md5 digest and are guaranteed to get
>collisions.

The number of URLs you're managing will have a lot to do with which
algorithm is best.  If you have a million URLS, the search will be
quite deep and the cost of using a hign-entropy hash like MD5 would
quickly pay off, whereas if you have only 100 URLs it would never
pay off compared to a plain linked list.

I know red-black trees are all the rage these days, but I'm not
convinced they are the silver bullet for us.

If we use a high-entropy hash function like MD5, there is no need
to do red-black, a simple binary tree will be better.  Red-black
tress shine when your keyspace is lumpy and/or sparse, but since
md5 hashes are dense, a strict two-branch binary tree has less than
a quarter single-branch nodes when populated with more than a few
hundred MD5 outputs (I tested this up to 10M).

But mostly, a red-black tree does not offer us a way to distribute
locking over the tree unless we put a lock in every node, which on
the other hand would be way too expensive.

One very interesting and simple idea we may want to try out at some
point is a tree (of some kind) indexed by the URL read back to
front.  The idea is that the closer we get to the tail of the url,
the more information there will be.

For instance, if practically all your URLS look like:
	http://www.myplace.xxx/article.php/id=#######
there would be very little point in bothering with a hash in the
first place: what we need is right there at the end.

I fear that the object lookup may be a significantly interesting
performance hotspot, and I can see so many creative things one could
do, so I have decided to wrap the "URL lookup" facility in a named
API (struct hash_slinger :-), that way all you have to write is
three functions (maybe four if you want to report stats) to try
out a new idea.


-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.



More information about the varnish-dev mailing list