4.56. What's with all these Mailman/python/qrunner processes? Why are they eating up all the memory on my server? (performance-tuning tips)

From the mailman-users mailing list (see http://mail.python.org/pipermail/mailman-users/2004-November/040809.html):

Mailman 2.1.x will have eight or nine qrunner processes constantly in memory, but they shouldn't be running unless they have actual work to do.

Depending on what OS you're using, the system may try to keep everything in memory that it can, so as to make maximum use of what is available. Anything that is not currently running is liable to be paged out in favour of other processes, filesystem/disk caching, etc....

On many systems I'm familiar with, it is not at all uncommon to see what appears to be just a few KB "free", but on closer inspection you discover that most of the memory that is "used" is actually "cache" or "inactive", and therefore available for immediate page-out and re-use by other processes.

Here's a FreeBSD 5.2.1 system I help administer:

 USER       PID %CPU %MEM   VSZ  RSS  TT  STAT STARTED      TIME COMMAND
 mailman  53524  0.0  0.0  7928   12  ??  Ss   Wed02PM   0:00.46 mailmanctl
 mailman  54142  0.0  1.5  8544 3828  ??  S    Wed02PM   0:57.26 VirginRunner
 mailman  54143  0.0  0.7  7892 1844  ??  S    Wed02PM   0:55.32 CommandRunner
 mailman  54144  0.0  1.7  8592 4252  ??  S    Wed02PM   1:03.93 IncomingRunner
 mailman  54145  0.0  0.4  7892 1064  ??  S    Wed02PM   0:00.69 RetryRunner
 mailman  54146  0.0  0.7  8328 1888  ??  S    Wed02PM   0:57.06 NewsRunner
 mailman  54147  0.0  0.8  8512 2036  ??  S    Wed02PM   0:59.44 BounceRunner
 mailman  54148  0.0  1.1 10180 2784  ??  S    Wed02PM   1:18.16 ArchRunner
 mailman  54149  0.0  1.7  8940 4332  ??  S    Wed02PM   1:47.38 OutgoingRunner

On this machine, the first few lines of "top" shows:

 last pid: 75984;  load averages:  0.00,  0.00,  0.00   up 10+08:30:45  02:13:03
 79 processes:  3 running, 76 sleeping
 CPU states: 14.3% user,  0.0% nice, 23.8% system,  0.0% interrupt, 61.9% idle
 Mem: 132M Active, 20M Inact, 60M Wired, 6280K Cache, 34M Buf, 25M Free
 Swap: 513M Total, 73M Used, 440M Free, 14% Inuse

Here's a Debian Linux (kernel 2.4.26) machine I help administer:

 USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
 mailman   5130  0.0  0.1  5828  2088 ?       S    Jul06   0:00 mailmanctl
 mailman   5131  2.0  1.6 54028 34896 ?       S    Jul06 3807:00 ArchRunner
 mailman   5132  0.3  0.7 25740 15252 ?       S    Jul06 606:43 BounceRunner
 mailman   5133  0.0  0.7 19328 15608 ?       S    Jul06  73:38 CommandRunner
 mailman   5134  0.1  0.7 18696 16040 ?       S    Jul06 305:05 IncomingRunner
 mailman   5135  0.0  0.3  9212  6840 ?       S    Jul06  43:38 NewsRunner
 mailman   5136  2.4  1.1 25316 22816 ?       S    Jul06 4528:36 OutgoingRunner
 mailman   5137  0.1  0.7 16828 14500 ?       S    Jul06 307:12 VirginRunner
 mailman   5138  0.0  0.0  9624  1848 ?       S    Jul06   0:03 RetryRunner
 mailman  19970  0.0  0.0 10184  1592 ?       S    Aug21   0:00 gate_news

Top shows:

   11:16:51 up 129 days, 14:07,  1 user,  load average: 0.08, 0.18, 0.22
 145 processes: 142 sleeping, 2 running, 1 zombie, 0 stopped
 CPU states:  71.4% user,  33.0% system,   0.9% nice,  2333.9% idle
 Mem:   2069316K total,  2020500K used,    48816K free,    53956K buffers
 Swap:  1951888K total,   191316K used,  1760572K free,   935268K cached

Both of these machines are effectively completely idle (2333% !?!) at the moment, and yet neither of them has a whole lot of memory that, on first glance, would appear to be free. If you really want to find out whether or not you're tight on memory that is actively being used, with your system thrashing about trying to always free up memory from processes that are fighting for the same resources, you need to use other tools to investigate this matter. One good tool is "iostat", another one is "vmstat".

Looking at that FreeBSD machine again, vmstat shows:

 % vmstat 1 20
  procs      memory      page                    disks     faults      cpu
  r b w     avm    fre  flt  re  pi  po  fr  sr da0 da1   in   sy  cs us sy id
  1 0 0  500148  49132   58   1   0   0  36  95   0   0  363    0 317  2  1 97
  0 0 0  500148  49132    5   0   0   0   5   0   0   0  365    0 304  0  4 96
  0 0 0  500148  49132    0   0   0   0   1   0   0   0  357    0 291  0  2 98
  0 0 0  500148  49132    0   0   0   0   0   0   0   0  358    0 284  0  2 98
  0 0 0  500148  49132    0   0   0   0   0   0   0   0  357    0 284  1  2 98
  0 0 0  500148  49132    0   0   0   0   0   0   0   0  361    0 298  1  2 97
  0 0 0  500148  49132    0   0   0   0   0   0   0   0  369    0 312  0  2 98
  0 0 0  500148  49132    0   0   0   0   0   0   3   0  369    0 341  0  2 98
  0 0 0  500148  49132    0   0   0   0   0   0   0   0  364    0 296  0  2 98
  0 0 0  500148  49132    0   0   0   0   0   0   0   0  357    0 287  0  2 98
  0 0 0  500148  49132    0   0   0   0   0   0   0   0  361    0 301  2  2 97
  0 0 0  500148  49132    0   0   0   0   4   0   9   0  386    0 344  1  2 98
  0 0 0  500148  49132    0   0   0   0   0   0   0   0  367    0 301  1  4 95
  0 0 0  500148  49132    0   0   0   0   0   0   0   0  360    0 288  1  3 96
  0 0 0  500148  49132    0   0   0   0   0   0   0   0  357    0 286  0  2 98
  0 0 0  500148  49132    0   0   0   0   0   0   0   0  361    0 301  0  2 98
  0 0 0  500148  49132    0   0   0   0   0   0   2   0  368    0 312  1  2 98
  2 0 0  500148  49132    0   0   0   0   1   0   0   0  358    0 290  2  2 96
  2 0 0  500148  49132    0   0   0   0   0   0   0   0  357    0 285  1  3 96
  2 0 0  500148  49132    0   0   0   0   0   0   0   0  358    0 289  0  3 97

Looking at the Linux box, vmstat shows:

 % vmstat 1 20
    procs                      memory    swap          io     system         cpu
  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
  0  0  0 191316  51020  54252 936864   0   0    14    15    9    19   2  10  18
  0  0  0 191316  50940  54280 936880   0   0     4   256  251   329   0   0  99
  0  0  0 191316  50828  54284 936900   0   0     8     0  263   422   0   0  99
  0  0  0 191316  50780  54288 936916   0   0     8     0  241   414   0   1  99
  0  0  0 191316  50016  54292 936932   0   0     4     0  216   345   0   1  99
  0  0  0 191316  49244  54292 936940   0   0     0     0  195   248   0   0 100
  0  0  0 191316  50588  54316 936956   0   0     0  1160  300   644   3   2  95
  0  0  0 191316  50516  54328 936968   0   0     8    64  222   252   1   0  99
  0  0  0 191316  50360  54344 936996   0   0    24     0  236   324   0   1  99
  0  0  0 191316  49708  54352 936980   0   0    16     0  246   433   1   0  98
  0  0  0 191316  50364  54360 936992   0   0    12     0  315   466   0   1  99
  0  0  0 191316  48820  54376 937004   0   0     0   272  220   314   0   1  99
  0  0  0 191316  49112  54380 936936   0   0     4     0  225   343   1   0  99
  0  0  0 191316  50836  54388 936920   0   0     4     0  206   304   3   1  96
  0  0  0 191316  50772  54388 936932   0   0     0     0  174   171   0   0 100
  0  1  0 191316  50728  54392 936944   0   0    12     0  237   435   0   0 100
  0  0  0 191316  50696  54412 936956   0   0     0   220  193   221   0   1  99
  0  0  0 191316  50640  54416 936968   0   0     8     0  187   120   0   0 100
  0  0  0 191316  49928  54416 936984   0   0     0     0  214   362   1   0  99
  0  0  0 191316  50576  54416 936992   0   0     0     0  215   344   0   0  99

For the FreeBSD box, look at the columns for "pi" (page in) and "po" (page out). This machine isn't doing any paging at all, which means that there is no memory pressure. It may appear to be short on memory, but that's only because the system is keeping everything in memory that it can, and it hasn't needed to page anything out that it's got currently loaded. You can also look at the columns for "fr" (free) and "sr" (scan rate). The former is "pages freed per second", and the latter is "pages scanned by clock algorithm, per-second". Both fields show that this system is doing very little in either of these categories, and confirms the conclusions drawn from the pi/po columns.

Using some command-line options to vmstat which are specific to Linux, and doing a comparison/contrast on this same box at a later time, we can see the following:

 % vmstat 1 20
 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
  0  2 975916  32988 140552 920632    0    0     1     1    0     1 22 12 66  0
  0  0 975916  32792 140560 920752    0    0   144   556  647  1015  4  1 95  0
  0  0 975916  31928 140560 920820    0    0    64     0 1150   724  3  2 94  0
  1  0 975916  31764 140568 920832    0    0    16     0  694   523  2  1 97  0
  0  0 975916  32508 140572 920852    0    0     8   672 1243  1068  5  3 92  0
  0  0 975916  31192 140584 920844    0    0     0   400  550   834  7  3 90  0
  0  0 975916  31064 140588 920852    0    0     8     0  403   593  3  1 96  0
  0  0 975916  30892 140588 920856    0    0     0     0  447   594  3  1 96  0
  0  0 975916  30868 140588 920860    0    0     0     0  463   779  3  1 95  0
  0  0 975916  30656 140592 920956    0    0    92     0  417   503  3  1 96  0
  0  0 975916  30192 140616 920980    0    0    24   416  370   518  4  1 95  0
  0  0 975916  30176 140616 920992    0    0     0   256  364   570  3  1 96  0
  1  0 975916  30128 140620 920992    0    0     4     0  292   375  1  1 97  0
  0  0 975916  30120 140620 920996    0    0     0     0  350   665  1  1 98  0
  0  0 975916  30072 140620 921000    0    0     4     0  282   439  2  2 96  0
  0  0 975916  30004 140636 921020    0    0    16   780  237   494  4  2 94  0
  0  0 975916  29892 140636 921024    0    0     4     0  235   325  3  0 97  0
  0  0 975916  30012 140640 921040    0    0    20     0  322   497  3  2 95  0
  0  0 975916  29984 140648 921056    0    0    20     0  360   666  3  1 96  0
  0  0 975916  30036 140656 921128    0    0    80   116  410   791  3  1 95  0

And here's what vmstat looks like when given the "-a" argument:

 % vmstat -a 1 20
 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
  r  b   swpd   free  inact active   si   so    bi    bo   in    cs us sy id wa
  0  0 975916  30164 1082752 229380    0    0     1     1    0     1 22 12 66  0
  0  0 975916  30068 1082836 229384    0    0    80     0  333   533  3  1 96  0
  0  0 975916  30060 1082856 229388    0    0    12   220  293   507  3  1 97  0
  0  0 975916  30080 1082860 229392    0    0     4     0  220   312  3  1 96  0
  0  0 975916  30220 1082700 229392    0    0     0     0  198   294  3  1 96  0
  1  0 975916  29376 1083532 229396    0    0    44     0  338   654  1  0 99  0
  0  0 975916  30336 1082604 229396    0    0     0     0  285   507  4  2 94  0
  0  0 975916  30300 1082640 229404    4    0    12   572  275   445  3  2 95  0
  0  0 975916  30432 1082496 229408    0    0    12     0  272   440  3  1 96  0
  0  0 975916  30396 1082044 229896    0    0     4     0  612   356  3  1 96  0
  0  0 975916  31292 1080800 230252    0    0   140     0  781   682  3  2 95  0
  0  4 975916  31208 1080892 230260    0    0    76   444  356   572  3  0 97  0
  0  0 975916  31352 1080748 230264    0    0    16    40  227   305  3  0 97  0
  0  0 975916  31324 1080764 230264    0    0     0     0  337   721  4  2 95  0
  0  0 975916  31520 1080584 230264    0    0     0     0  266   442  4  0 96  0
  1  0 975916  31520 1080592 230264    0    0     8     0  308   653  2  0 98  0
  0  0 975916  31660 1080452 230268    0    0     8   544  269   371  2  1 98  0
  0  0 975916  31660 1080452 230284    0    0     4     0  242   366  3  1 97  0
  0  0 975916  31840 1080276 230284    0    0     0     0  187   224  3  0 96  0
  0  0 975916  31820 1080260 230316    0    0    16     0  289   429  3  1 95  0

In particular, by looking at the "inact" versus "active" columns, you can see that this machine has no memory pressure, and almost all the memory that is used is actually inactive. If you add up the respective columns, it's obvious that this machine has 2GB of memory, of which about 1GB is inactive.

Some additional Linux-specific options to vmstat show some very detailed information:

 % vmstat -m
 Cache                       Num  Total   Size  Pages
 kmem_cache                   80     80    244      5
 ip_conntrack               1963   6513    288    382
 tcp_tw_bucket               710   1020    128     34
 tcp_bind_bucket             388    678     32      6
 tcp_open_request            720    720     96     18
 inet_peer_cache              59     59     64      1
 ip_fib_hash                   9    226     32      2
 ip_dst_cache               1344   2352    160     93
 arp_cache                     2     30    128      1
 blkdev_requests            4096   4160     96    104
 journal_head                730   2028     48     20
 revoke_table                  3    253     12      1
 revoke_record               226    226     32      2
 dnotify_cache                 0      0     20      0
 file_lock_cache             455    520     96    13
 fasync_cache                  0      0     16      0
 uid_cache                    18    452     32      4
 skbuff_head_cache           756    888    160     37
 sock                        682    864    960    215
 sigqueue                    522    522    132     18
 kiobuf                        0      0     64     0
 Cache                       Num  Total   Size  Pages
 cdev_cache                  973   1062     64     18
 bdev_cache                    4    177     64      3
 mnt_cache                    14    177     64      3
 inode_cache              833119 833119    512 119017
 dentry_cache             1289340 1289340    128  42978
 filp                      12297  12360    128    412
 names_cache                  64     64   4096     64
 buffer_head              267637 325280     96   8132
 mm_struct                   666    720    160     30
 vm_area_struct             7463  11720     96    292
 fs_cache                    661    767     64     13
 files_cache                 344    441    416     49
 signal_act                  306    306   1312    102
 size-131072(DMA)              0      0 131072      0
 size-131072                   0      0 131072      0
 size-65536(DMA)               0      0  65536      0
 size-65536                    0      0  65536      0
 size-32768(DMA)               0      0  32768      0
 size-32768                    1      2  32768      1
 size-16384(DMA)               0      0  16384      0
 size-16384                    0      1  16384      0
 Cache                       Num  Total   Size  Pages
 size-8192(DMA)                0      0   8192      0
 size-8192                     2      6   8192      2
 size-4096(DMA)                0      0   4096      0
 size-4096                   179    179   4096    179
 size-2048(DMA)                0      0   2048      0
 size-2048                   218    338   2048    130
 size-1024(DMA)                0      0   1024      0
 size-1024                   454    516   1024    129
 size-512(DMA)                 0      0    512      0
 size-512                    560    560    512     70
 size-256(DMA)                 0      0    256      0
 size-256                    540    540    256     36
 size-128(DMA)                 0      0    128      0
 size-128                    961   1230    128     41
 size-64(DMA)                  0      0     64      0
 size-64                  150332 150332     64   2548
 size-32(DMA)                  0      0     32      0
 size-32                  170140 179218     32   1586

 % vmstat -s
       2069316  total memory
       2038880  used memory
        232384  active memory
       1080640  inactive memory
         30436  free memory
        142724  buffer memory
        937524  swap cache
       1951888  total swap
        975916  used swap
        975972  free swap
     826138426 non-nice user cpu ticks
      28477042 nice user cpu ticks
     466997502 system cpu ticks
    2583888858 idle cpu ticks
             0 IO-wait cpu ticks
             0 IRQ cpu ticks
             0 softirq cpu ticks
    1453923144 pages paged in
    1620774295 pages paged out
        317133 pages swapped in
        445086 pages swapped out
     131794970 interrupts
     245776829 CPU context switches
    1130916810 boot time
     115549581 forks

To go any further into this topic, you really have to know more about your OS and how to do proper performance monitoring, analysis, and tuning for it. Of course, that is really beyond the scope of this mailing list.

Converted from the Mailman FAQ Wizard

This is one of many Frequently Asked Questions.

MailmanWiki: DOC/4.56 What's with all these Mailman-python-qrunner processes? Why are they eating up all the memory on my server? (perfor (last edited 2015-01-31 02:36:58 by msapiro)