how to...accelarate randon access to millions of images?

Sascha Ottolski ottolski at web.de
Mon Mar 17 19:19:08 CET 2008


Michael,

thanks a lot for taking the time to give me such a detailed answer. 
please see my replies below.


Am Sonntag 16 März 2008 18:00:42 schrieb Michael S. Fischer:
> On Fri, Mar 14, 2008 at 1:37 PM, Sascha Ottolski <ottolski at web.de> 
wrote:
> >  The challenge is to server 20+ million image files, I guess with
> > up to 1500 req/sec at peak.
>
> A modern disk drive can service 100 random IOPS (@ 10ms/seek, that's
> reasonable).  Without any caching, you'd need 15 disks to service
> your peak load, with a bit over 10ms I/O latency (seek + read).
>
> > The files tend to be small, most of them in a
> >  range of 5-50 k. Currently the image store is about 400 GB in size
> > (and growing every day). The access pattern is very random, so it
> > will be very unlikely that any size of RAM will be big enough...
>
> Are you saying that the hit ratio is likely to be zero?  If so,
> consider whether you want to have caching turned on the first place.
> There's little sense buying extra RAM if it's useless to you.

well, wo far I have analyzed the webserver logs of one week. this 
indicates that indeed there would be at least some cache hits. we have 
about 20 mio. images on our storage, and in one week about 3.5 mio 
images were repeadetly requested. to be more precise:

272,517,167 requests made to a total of
  7,489,059 different URLs

  3,226,150 URLs were requested at least 10 times, accounting for
257,306,351 "repeated" request

so, if I made my analysis not to lousy, I guess there is quite a 
opportunity that a cache will help.

roughly, the currently 20 mio images use 400 GB of storage; so 3.5 mio 
images may account for 17.5% of 400 GB ~70 GB. well, but 70 GB RAM is 
still a lot. but may be a mix of "enough" RAM and fast disks may be the 
way to go; may be in addition to a content based load-balancing to 
several caches (say, one for thumbnails, one for larger size images).

currently, at peak times we only serve about 350 images/sec, due to the 
bottleneck of the storage backend. so the targed of 1500 req/sec may be 
bit of wishful thinking, as I don't know what the real peak would look 
like without the bottleneck; may very well be more like 500-1000 
req/sec; but of course I'd like to leave room for growth :-)


Thanks a lot,

Sascha


>
> >  Now my question is: what kind of hardware would I need? Lots of
> > RAM seems to be obvious, what ever "a lot" may be...What about the
> > disk subsystem? Should I look into something like RAID-0 with many
> > disk to push the IO-performance?
>
> You didn't say what your failure tolerance requirements were.  Do you
> care if you lose data?   Do you care if you're unable to serve some
> requests while a machine is down?

well, it's a cache, after all. the real image store is in place and high 
available and backed up and all the like. but, the webservers can't get 
the images fast enough of the storage. we just enabled apache's 
mod_cache, which seems to help a bit, but I suspect a dedicated tool 
like varnish could perform better (plus, you don't get any runtime 
information to learn how efficient the apache cache is).


>
> Consider dividing up your image store onto multiple machines.  Not
> only would you get better performance, but you would be able to
> survive hardware failures with fewer catastropic effects (i.e., you'd
> lose only 1/n of service).
>
> If I were designing such a service, my choices would be:
>
> (1) 4 machines, each with 4-disk RAID 1 (fast, but dangerous)
> (2) 4 machines, each with 5-disk RAID 5 (safe, fast reads, but slow
> writes for your file size - also, RAID 5 should be battery backed,
> which adds cost)
> (3) 4 machines, each with 4-disk RAID 10 (will meet workload
> requirement, but won't handle peak load in degraded mode)
> (4) 5 machines, each with 4-disk RAID 10
> (5) 9 machines, each with 2-disk RAID 0
>
> Multiply each of these machine counts by 2 if you want to be
> resilient to failures other than disk failures.
>
> You can then put a Varnish proxy layer in front of your image storage
> servers, and direct incoming requests to the appropriate backend
> server.
>
> --Michael





More information about the varnish-misc mailing list