[Varnish] #897: sess_mem "leak" on hyper-threaded cpu

Tue Apr 12 00:21:49 CEST 2011

#897: sess_mem "leak" on hyper-threaded cpu
-------------------------------------------------+--------------------------
 Reporter:  askalski                             |        Type:  defect
   Status:  new                                  |    Priority:  normal
Milestone:                                       |   Component:  build 
  Version:  trunk                                |    Severity:  major 
 Keywords:  sess_mem leak n_sess race condition  |  
-------------------------------------------------+--------------------------

Comment(by askalski):

 I did some testing to figure out how this bug relates to abnormally high
 memory usage on production varnish servers (the original issue that got me
 looking into this.)  I generated synthetic load against a varnishd
 (malloc,64M) running on my laptop (Ubuntu 10.10, kernel 2.6.35), until the
 n_sess_mem counter maxed out at 100000.  Varnish memory usage reached 2GB
 resident.

 {{{
 USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
 nobody   23388  5.7 52.7 8369944 2032540 ?     Sl   15:07   9:24
 /usr/sbin/varnishd \
         -P /var/run/varnishd.pid -a :6081 -T localhost:6082 \
         -f /etc/varnish/default.vcl -S /etc/varnish/secret \
         -s malloc,64M
 }}}

 I then ran a program which attached to the varnishd process via ptrace,
 scanned its mapped memory for sess_mem objects (magic number 0x555859c5),
 then tallied up the number of dirty pages belonging to each (scanning 17
 pages per struct: 3288 byte sess_mem/http + 65536 bytes workspace).  The
 scanner found all 100000 of the structs.  (Note: The 64MB of SMA-cached
 data had all expired by the time I was able to scan the process memory; --
 I used short TTL's.)

 {{{
 sess_mem_count = 100000
 dirty_pages    = 504430
 memory_used    = 2017720kB (1970 MB)
 }}}

 The reason that so much session workspace memory was dirtied was not
 because of my generated request/response load (they relatively little of
 the workspace.)  Rather, it was because the workspaces were allocated, by
 malloc, to memory locations that had previously been dirtied by the SMA
 stevedore (objects that either had expired, or were LRU-evicted.)

 I have a few production machines where the varnishd is using 6.5GB over
 what SMA was configured to use.  Unfortunately, I cannot perform the same
 analysis on the memory of those processes, because ptrace() causes the
 varnish child to exit on RHEL5/2.6.18 (the poll() in CLS_Poll returns
 EINTR, which is interpreted as an error.  Kernel 2.6.24+ is able to
 restart sys_poll() without userspace intervention.)

 Here's the post-mortem varnishstat:

 {{{
 client_conn           3846502       362.50 Client connections accepted
 client_drop                96         0.01 Connection dropped, no sess/wrk
 client_req            3813266       359.37 Client requests received
 cache_hit                   0         0.00 Cache hits
 cache_hitpass               0         0.00 Cache hits for pass
 cache_miss             185362        17.47 Cache misses
 backend_conn             1833         0.17 Backend conn. success
 backend_unhealthy            0         0.00 Backend conn. not attempted
 backend_busy                0         0.00 Backend conn. too many
 backend_fail                0         0.00 Backend conn. failures
 backend_reuse          183098        17.26 Backend conn. reuses
 backend_toolate             0         0.00 Backend conn. was closed
 backend_recycle        183527        17.30 Backend conn. recycles
 backend_unused              0         0.00 Backend conn. unused
 fetch_head                  0         0.00 Fetch head
 fetch_length                0         0.00 Fetch with Length
 fetch_chunked          185362        17.47 Fetch chunked
 fetch_eof                   0         0.00 Fetch EOF
 fetch_bad                   0         0.00 Fetch had bad headers
 fetch_close                 0         0.00 Fetch wanted close
 fetch_oldhttp               0         0.00 Fetch pre HTTP/1.1 closed
 fetch_zero                  0         0.00 Fetch zero len
 fetch_failed                0         0.00 Fetch failed
 n_sess_mem             100000          .   N struct sess_mem
 n_sess                 100442          .   N struct sess
 n_object                    0          .   N struct object
 n_vampireobject             0          .   N unresurrected objects
 n_objectcore                3          .   N struct objectcore
 n_objecthead                3          .   N struct objecthead
 n_smf                       0          .   N struct smf
 n_smf_frag                  0          .   N small free smf
 n_smf_large                 0          .   N large free smf
 n_vbe_conn                  1          .   N struct vbe_conn
 n_wrk                      10          .   N worker threads
 n_wrk_create              502         0.05 N worker threads created
 n_wrk_failed                0         0.00 N worker threads not created
 n_wrk_max                1829         0.17 N worker threads limited
 n_wrk_queue                 0         0.00 N queued work requests
 n_wrk_overflow          94955         8.95 N overflowed work requests
 n_wrk_drop                303         0.03 N dropped work requests
 n_backend                   1          .   N backends
 n_expired                1001          .   N expired objects
 n_lru_nuked            184361          .   N LRU nuked objects
 n_lru_saved                 0          .   N LRU saved objects
 n_lru_moved                 1          .   N LRU moved objects
 n_deathrow                  0          .   N objects on deathrow
 losthdr                     0         0.00 HTTP header overflows
 n_objsendfile               0         0.00 Objects sent with sendfile
 n_objwrite            3808607       358.93 Objects sent with write
 n_objoverflow               0         0.00 Objects overflowing workspace
 s_sess                3846406       362.49 Total Sessions
 s_req                 3813266       359.37 Total Requests
 s_pipe                      0         0.00 Total pipe
 s_pass                      0         0.00 Total pass
 s_fetch                185362        17.47 Total fetch
 s_hdrbytes          877580576     82704.79 Total header bytes
 s_bodybytes       13580918034   1279890.49 Total body bytes
 sess_closed           3831452       361.08 Session Closed
 sess_pipeline               0         0.00 Session Pipeline
 sess_readahead              0         0.00 Session Read Ahead
 sess_linger            185362        17.47 Session Linger
 sess_herd              411446        38.78 Session herd
 shm_records         129034970     12160.49 SHM records
 shm_writes           16142937      1521.34 SHM writes
 shm_flushes                25         0.00 SHM flushes due to overflow
 shm_cont              1069825       100.82 SHM MTX contention
 shm_cycles                 37         0.00 SHM cycles through buffer
 sm_nreq                     0         0.00 allocator requests
 sm_nobj                     0          .   outstanding allocations
 sm_balloc                   0          .   bytes allocated
 sm_bfree                    0          .   bytes free
 sma_nreq               555085        52.31 SMA allocator requests
 sma_nobj                    0          .   SMA outstanding allocations
 sma_nbytes                  0          .   SMA outstanding bytes
 sma_balloc        24435160288          .   SMA bytes allocated
 sma_bfree         24435160288          .   SMA bytes free
 sms_nreq              3627904       341.90 SMS allocator requests
 sms_nobj                    0          .   SMS outstanding allocations
 sms_nbytes                  0          .   SMS outstanding bytes
 sms_balloc         1443905792          .   SMS bytes allocated
 sms_bfree          1443905792          .   SMS bytes freed
 backend_req            184726        17.41 Backend requests made
 n_vcl                       1         0.00 N vcl total
 n_vcl_avail                 1         0.00 N vcl available
 n_vcl_discard               0         0.00 N vcl discarded
 n_purge                     1          .   N total active purges
 n_purge_add                 1         0.00 N new purges added
 n_purge_retire              0         0.00 N old purges deleted
 n_purge_obj_test            0         0.00 N objects tested
 n_purge_re_test             0         0.00 N regexps tested against
 n_purge_dups                0         0.00 N duplicate purges removed
 hcb_nolock             183710        17.31 HCB Lookups without lock
 hcb_lock               185362        17.47 HCB Lookups with lock
 hcb_insert             185360        17.47 HCB Inserts
 esi_parse                   0         0.00 Objects ESI parsed (unlock)
 esi_errors                  0         0.00 ESI parse errors (unlock)
 accept_fail                 0         0.00 Accept failures
 client_drop_late          207         0.02 Connection dropped late
 uptime                  10611         1.00 Client uptime
 backend_retry               0         0.00 Backend conn. retry
 dir_dns_lookups             0         0.00 DNS director lookups
 dir_dns_failed              0         0.00 DNS director failed lookups
 dir_dns_hit                 0         0.00 DNS director cached lookups hit
 dir_dns_cache_full            0         0.00 DNS director full dnscache
 fetch_1xx                   0         0.00 Fetch no body (1xx)
 fetch_204                   0         0.00 Fetch no body (204)
 fetch_304                   0         0.00 Fetch no body (304)
 }}}

-- 
Ticket URL: <http://varnish-cache.org/trac/ticket/897#comment:2>
Varnish <http://varnish-cache.org/>
The Varnish HTTP Accelerator