MailmanWiki

3.36. Processing old mbox archives with procmail/formail

My mailman-2.1.4 breaks my old mbox archives when processing them into html(pipermail). As a result, some messages get merged together. Often those HTML versions have no Subject line etc. It's a mess.

It helps a bit to cleanup the old archives. Get procmail from http://www.procmail.org and install it. You will get procmail(1) and formail(1) binaries.

Create /usr/local/mailman/etc/clean-mbox-file.rc as follows:

  :0fh
  #
  # formail -b -s procmail -m /usr/local/mailman/etc/clean-mbox-file.rc < \
  # /usr/local/mailman/archives/private/$listname.mbox
  #
  # You will receive cleaned up old archive file in
  # $HOME/new-mbox-archive.mbox. The original file is left untouched.

  |formail -Y -q- -I"Posted-Date:" -I"Received-Date:" -I"Old-From:" \
  -I "Old-To:" -I "Old-Subject:" -I "Old-Date:" -I "Old-Reply-To:" \
  -I"X-Mailer:" -I"Received:" -I"Status:" -I"X-Status:" \
  -I"X-MIME-Autoconverted:" -I"Organization:" -I"Sender:" \
  -I"Reply-To:" -I"List-Id:" -I"List-Unsubscribe:" -I"List-Archive:" \
  -I"List-Post:" -I"List-Help:" -I"List-Subscribe:" \
  -I"X-List-Received-Date:" -I"Errors-To:" -I"X-Errors-To:" \
  -I"X-Listname:" -I"Listname:" -I"Status:" -I"Content-Length:" \
  -I"Lines:" -I"X-Loop:" -I"X-Mailing-List:" -I"List-Info:" \
  -I"X-Info:" -I"Precedence:"

  # http://www.natur.cuni.cz/~mmokrejs/procmail/mobiles
  # to be sure that quoted-printable in headers got converted to 8 bit
  # De-mangle RFC 2047 header mangling
  :0Hhfw
  * =\?[^?]+\?[qb]\?[^?]+\?=
  |/usr/local/bin/dmmh

  #
  # Fix dates like below
  #From cita@xxxxx.cz  Thu Jan 20 03:27:35 2000
  #From: Ctirad Hruby <cita@xxxxx.cz>
  #Date: Thu, 13 May 1999 08:40:57 +0200

  # convert the date extracted from the From_ line to Date: format
  # From line: Fri Jan 14 11:22:23 2000
  # Date: line: Fri, 14 Jan 2000 11:21:31 +0100
  :0
  * ^From +[a-zA-Z0-9_@.-]+[       ]+\/[^  ].*
  {
    FR1=$MATCH
    FR11=`echo $FR1 | perl -e 'while (<>) { $l=$_; $l =~ m/^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+/; print "$1, $3 $2 $5 $4\n";}'`
  }

  :0
  * ^Date: +\/[^ ].*
  { FR2=$MATCH }

  :0
  * ^Date: +\/[a-zA-Z\,]+
  { OVERLAP=$MATCH }

  #
  # TODO
  # figure out both dates differ significantly, if yes then recreate
  # Date:
  #
  # I'm lazy now, so I'll blindly overwrite the Date: field with the
  # date extracted from the From_ line. This is particularly bad as an
  # email sent just before midnight might end up dated now on next day
  # (as it could have been delivered on mailserver just after midnight
  # ... and if *that* very next day was in new month we shift the
  # message mistakenly ...).
  #
  :0f
  | formail -I "Date: $FR11"

  # the next line will store all emails into a file in current directory
  :0:
  mylistname

  # the next line will store all emails into a file in $HOME/
  #:0:
  #$HOME/new-mbox-archive.mbox

Then run command:

  $ formail -b -s procmail -m /usr/local/mailman/etc/clean-mbox-file.rc < /usr/local/mailman/archives/private/$listname.mbox

You will receive cleaned up old archive file in your home directory called mylistname or in $HOME/new-mbox-archive.mbox. The original file is left untouched. Append/prepend the archive file to

  $prefix/archives/private/$listname.mbox/$listname.mbox.

Would be nice if someone would describe how to define procmail as an external archiver. It could be used to mask email addresses in raw archives, execute pipermail/MhonArc/glimpse, just any program after each email archived.

Other notes:

One user had a mbox file exported by Lyris with From_ lines of the form

 From user@example.com Thu, 01 Dec 2005 15:02:08 -0500

which need to be converted to the form

 From user@example.com Thu Dec  1 15:02:08 2005

This conversion can be done by

 python script.py <old_mbox >new_mbox

where script.py is between the dashed lines

 ----------------------------------------------------------
 import sys
 import time

 for line in sys.stdin.readlines():
     if line.startswith('From '):

         fields = line.split()
         date = ' '.join(fields[2:7])
         try:
             t = time.strptime(date, '%a, %d %b %Y %H:%M:%S')
             newtime = time.asctime(t)
             line = ' '.join(fields[0:2] + [newtime] + ['\n'])
         except ValueError:
             pass

     sys.stdout.writelines([line])
 ----------------------------------------------------------

Also, there is the bin/cleanarch script in the distribution which can escape unescaped 'From ' in the bodies of messages so bin/arch can do a better job. This script requires the legitimate 'From ' separators in the input mailbox have times in the 'Thu Dec 1 15:02:08 2005' format.

Converted from the Mailman FAQ Wizard

This is one of many Frequently Asked Questions.

MailmanWiki: DOC/Processing old mbox archives with procmail-formail (last edited 2019-06-22 13:51:22 by msapiro)