Controlling memory usage

Mon Jul 17 21:02:20 CEST 2006

In message <B06E690C-CA93-4547-AD2A-61246521142D at vgnett.no>, Anders Berg writes
:

>The reason Anders N. asks about this is how Squid works today. The  
>squid.conf file leaves you with a option to specify how much RAM you  
>wanna use for Squid.

And this is right where the trouble starts.

Squid is written for a machine model that has not existed since
1980 when 3BSD was released.

In that model, a process has some amount of "memory" and either all
of that "memory" is present in RAM or none of it is.  When RAM grew
short, an entire process was swapped out (hence the name: "swap out
one process for another")

In that environment, it gives great meaning to tell Squid how much
RAM it can use, because there is some magic size where the best
performance compromise for the entire machine is reached.

We spent a lot of time tuning stuff like that in 1980ies, we told
sort(1) how many records to sort in memory and to switch to merging
temporary files if it found more etc etc.

Virtual memory on the other hand, means that the kernel "fakes"
things such that the process has access to the entire address-space
(ie: 2^32 bytes or 2^64 bytes) and the operating.  It does this by
tracking which pages are used, which are modified and all that
stuff.

In a VM system, what you think of as "RAM" is not RAM in the hardware
sense.  You may in fact have all of it accessible in hardware RAM,
but if the system is short of memory, you won't have, some of it
will be "paged out to disk" or because we sloppily adopted the old
terminology: swapped out.

The real trouble starts when Squid decides that an object in its
"RAM" should be purged to disk.  Quite likely, the operating system
already found that out earlier so the "RAM" is already on disk,
somewhere in the paging- (or swap-)partition.

So what happens is that first we do a disk read to pull in the RAM,
then we write it to disk some other place.

Twice as much I/O for no gain.  The same pattern happens all over
Squid, and that is responsible for the observed "once squid starts
paging, it goes straight south".

It doesn't help in this context that Squid stores headers and body
the same place.  That means that if the "RAM" of some object has
been paged out, we have to page it in to see the headers, even for
a conditional request which ends up not transferring the objects
body.

>Your answer was detailed Poul-Henning, but  
>what will prevent this from happpening in Varnish? Lets say you have  
>2 applications running on a Varnish box, and both use the memory  
>model Varnish uses, what will happen in the long run with a lot of  
>traffic?

All programs running in a VM system has a function which describes
how fast they reach their goal, for a given number of pages of
hardware memory they have access to.

Unfortunately the function also has other variables, the input to
the program, the timing its interaction with the world (how long
must it wait for disk-I/O etc) and the state of all sorts of kernel
caches come into play.

There is no way to predict the function realistically.  You can
measure it under some set of circumstances and get an idea how it
looks.

The only trick there really is to writing an VM kernel is being
good at estimating this function on the fly.

If two processes run at the same time, and they both need more
hw-RAM pages than the system has, the kernel will be flipping some
pages between them.

When a program accesses a page which is not "resident", the kernel
will hunt around for a page that doesn't look used (ideally: doesn't
look like it _will be_ used (soon)) writes that page to disk and
reassigns the page to the faulting process, possibly after filling
it from a disk first.

In the meantime the process (or at least: thread) cannot do anthing.

If you're just one page short, there is undoubtedly some page in
the process which is seldomly used, the first bit of the program
which is only used during startup, some table of error messages
that are only accessed when there is an error etc.

As memory pressure increases, more and more such pages will be paged
out.  At some point, we get to pages which are infact used every
so often, and then it starts hurting performance.

The thing to remember in writing programs for virtual memory sytems
is therefore not to be careful about how much memory you allocate,
but instead be careful about how much of it you use.

Something as simple as variable order in the source code can affect
this:

	int	busy_integer_variable;
	char	seldomly_used_error_string[5000];
	double	often_used_number;

With a pagesize of 4096, this will take up two pages, both of which
will be busy.  Flipping the order:

	int	busy_integer_variable;
	double	often_used_number;
	char	seldomly_used_error_string[5000];

means that one page will seldomly be used, and the other will be
used all the time.  (The example also improves CPU-cache hitrates,
but forget that for now).

What Varnish does is to rely on the kernel to do this work.  Instead
of trying to control how much memory we use and partition our data
into the fast stuff which should be in RAM and the slow stuff which
we can put on the disk, we simply operate on one data set, but make
sure to arrange our data such that the kernel can easily deport
data which we don't use, without us needing to get involved.

Therefore all object storage in Varnish is allocated on a page-aligned
border.  That means that entire objects can be paged out, without
affecting the neighbor objects.  Yes, this may waste 4095 bytes for
padding, but you'd be surprised what you save in performance.

>http://www.tns-gallup.no/index.asp? 
>type=tabelno_url&did=185235&sort=uv&sort_ret=desc&UgeSelect=&path_by_id= 
>/12000/12003/12077/12266&aid=12266
>
>will show what I mean. They have "few" users and sessions, but loads  
>of pageviews, also at any given time many thousand "wares" are "hot"  
>for the user, not a few popular articles. This does something to mem  
>usage and I/O.

You're thinking about memory in the oldfashioned terms here :-)

Try this:  Imagine the disk is the real memory and the RAM is
only a cache.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.