Avoiding big objects

Wed Apr 27 02:25:40 CEST 2011

I was working on something in my quest to keep big (eventually
uncacheable) objects from wreaking havoc on my cache. Even if I employ
a scheme to call "restart" from vcl_fetch, after adding a header that
tells vcl_recv to call 'pipe', the object still gets fetched from the
origin server. And if it's 1.5 gig, it can be pretty painful.

So I was hoping to throw this by you guys, esp the Varnish devs.
Mainly I wanted to hear if anyone thought this was a tremendously bad
idea. I wrote this about 45 minutes ago, so it's not particularly
well-tested out, but if you guys said this was the worst idea ever,
then I might reconsider putting a lot more time into perfecting it.
Thus there are likely to be big corner cases here. There was another
recent thread about this subject, so I know there are some other
people looking for a similar solution, so I thought I'd throw this out
there too. This doesn't protect me from 1.5 gig JPEG files but it does
most of the job. and a further comment is that, yes, I'm ok with all
the extra backend reqs, providing their HEADs.

Mainly what it's doing is this:

1. Huge files won't ever be HITs in my environment, since I'll have piped them.
2. If a MISS (as it should be), rewrite backend method from GET (I
don't do POSTs on varnish) to HEAD in vcl_miss if it's a file
extension likely to be a biggish file and matches other conditions.
3. In vcl_fetch, if it's a rewritten HEAD, do size check. If it's too
big, add the header that indicates to vcl_fetch to drop immediately to
'pipe'
4. In either case, in vcl_fetch, rewrite the method back to GET and
call 'restart'.

Here's the essence of the VCL (imagine regularly-working VCL alongside
it). I typed this out so ignore dumb typos:

sub vcl_fetch {
   ....
   # If we've got the header that says to pipe this request, pipe it
(thanks Tollef)
   if ( req.http.X-PIPEME && req.restarts > 0 ) {
                return( pipe );
   }
   ....
}

# The URLs in this regex are some sample ones that are often huge in
size; the eventual list would be bigger and have others like 'mpg'
etc. Note that I don't send POSTs over varnish, so ignore lack of POST
sub vcl_miss {
        # If no headcheck header and GET and type is on big list,
rewrite to HEAD
        if ( ! req.http.X-HEADCHECK && bereq.request == "GET" &&
req.url ~ "\.(gz|wmv|zip|flv|avi)$" && req.restarts == 0 ) {
                set req.http.X-HEADCHECK = "1";
                set bereq.request = "HEAD";
                set bereq.http.User-Agent = "HEAD Check";
                log "DEBUG: Rewriting to HEAD";
        }
}

sub vcl_fetch {
        # If this used to be a GET request that we changed to HEAD, do
length check. But try to avoid restart loops.
        if ( req.http.X-HEADCHECK && req.request == "GET" &&
bereq.request == "HEAD" && req.url ~ "\.(gz|wmv|zip|flv|avi)$" &&
req.restarts < 1) {
                unset req.http.X-HEADCHECK;
                set bereq.request = "GET";
                log "DEBUG: [fetch] Rewriting to HEAD";

                # If content is over 10 meg, pipe it
                if ( beresp.http.Content-Length ~ "[0-9]{8,}" ) {
                        set req.http.X-PIPEME = "1";
                }

                restart;
        }
       ....
}

Mainly I'm just looking for whether the Varnish devs think that this
would cause something to completely explode and/or melt down or this
is the worst security hole ever. It seems to work ok so far. For reqs
that match 'beresp.http.Content-Length ~ "[0-9]{8,}"', the "SMA bytes
allocated" counter never budges, where it normally does for anything
fetched (memory backend).

Thanks! Hope someone else can benefit from this too. If someone else
uses this (after thorough testing), be sure to remove the 'log' calls
in production.