3.36. Processing old mbox archives with procmail/formail

My mailman-2.1.4 breaks my old mbox archives when processing them into html(pipermail). As a result, some messages get merged together. Often those HTML versions have no Subject line etc. It's a mess.

It helps a bit to cleanup the old archives. Get procmail from http://www.procmail.org and install it. You will get procmail(1) and formail(1) binaries.

Create /usr/local/mailman/etc/clean-mbox-file.rc as follows:

  # formail -b -s procmail -m /usr/local/mailman/etc/clean-mbox-file.rc < \
  # /usr/local/mailman/archives/private/$listname.mbox
  # You will receive cleaned up old archive file in
  # $HOME/new-mbox-archive.mbox. The original file is left untouched.

  |formail -Y -q- -I"Posted-Date:" -I"Received-Date:" -I"Old-From:" \
  -I "Old-To:" -I "Old-Subject:" -I "Old-Date:" -I "Old-Reply-To:" \
  -I"X-Mailer:" -I"Received:" -I"Status:" -I"X-Status:" \
  -I"X-MIME-Autoconverted:" -I"Organization:" -I"Sender:" \
  -I"Reply-To:" -I"List-Id:" -I"List-Unsubscribe:" -I"List-Archive:" \
  -I"List-Post:" -I"List-Help:" -I"List-Subscribe:" \
  -I"X-List-Received-Date:" -I"Errors-To:" -I"X-Errors-To:" \
  -I"X-Listname:" -I"Listname:" -I"Status:" -I"Content-Length:" \
  -I"Lines:" -I"X-Loop:" -I"X-Mailing-List:" -I"List-Info:" \
  -I"X-Info:" -I"Precedence:"

  # http://www.natur.cuni.cz/~mmokrejs/procmail/mobiles
  # to be sure that quoted-printable in headers got converted to 8 bit
  # De-mangle RFC 2047 header mangling
  * =\?[^?]+\?[qb]\?[^?]+\?=

  # Fix dates like below
  #From cita@xxxxx.cz  Thu Jan 20 03:27:35 2000
  #From: Ctirad Hruby <cita@xxxxx.cz>
  #Date: Thu, 13 May 1999 08:40:57 +0200

  # convert the date extracted from the From_ line to Date: format
  # From line: Fri Jan 14 11:22:23 2000
  # Date: line: Fri, 14 Jan 2000 11:21:31 +0100
  * ^From +[a-zA-Z0-9_@.-]+[       ]+\/[^  ].*
    FR11=`echo $FR1 | perl -e 'while (<>) { $l=$_; $l =~ m/^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+/; print "$1, $3 $2 $5 $4\n";}'`

  * ^Date: +\/[^ ].*
  { FR2=$MATCH }

  * ^Date: +\/[a-zA-Z\,]+

  # TODO
  # figure out both dates differ significantly, if yes then recreate
  # Date:
  # I'm lazy now, so I'll blindly overwrite the Date: field with the
  # date extracted from the From_ line. This is particularly bad as an
  # email sent just before midnight might end up dated now on next day
  # (as it could have been delivered on mailserver just after midnight
  # ... and if *that* very next day was in new month we shift the
  # message mistakenly ...).
  | formail -I "Date: $FR11"

  # the next line will store all emails into a file in current directory

  # the next line will store all emails into a file in $HOME/

Then run command:

  $ formail -b -s procmail -m /usr/local/mailman/etc/clean-mbox-file.rc < /usr/local/mailman/archives/private/$listname.mbox

You will receive cleaned up old archive file in your home directory called mylistname or in $HOME/new-mbox-archive.mbox. The original file is left untouched. Append/prepend the archive file to


Would be nice if someone would describe how to define procmail as an external archiver. It could be used to mask email addresses in raw archives, execute pipermail/MhonArc/glimpse, just any program after each email archived.

Other notes:

One user had a mbox file exported by Lyris with From_ lines of the form

 From user@example.com Thu, 01 Dec 2005 15:02:08 -0500

which need to be converted to the form

 From user@example.com Thu Dec  1 15:02:08 2005

This conversion can be done by

 python script.py <old_mbox >new_mbox

where script.py is between the dashed lines

 import sys
 import time

 for line in sys.stdin.readlines():
     if line.startswith('From '):

         fields = line.split()
         date = ' '.join(fields[2:7])
             t = time.strptime(date, '%a, %d %b %Y %H:%M:%S')
             newtime = time.asctime(t)
             line = ' '.join(fields[0:2] + [newtime] + ['\n'])
         except ValueError:


Also, there is the bin/cleanarch script in the distribution which can escape unescaped 'From ' in the bodies of messages so bin/arch can do a better job. This script requires the legitimate 'From ' separators in the input mailbox have times in the 'Thu Dec 1 15:02:08 2005' format.

