3.36. Processing old mbox archives with procmail/formail
My mailman-2.1.4 breaks my old mbox archives when processing them into html(pipermail). As a result, some messages get merged together. Often those HTML versions have no Subject line etc. It's a mess.
It helps a bit to cleanup the old archives. Get procmail from http://www.procmail.org and install it. You will get procmail(1) and formail(1) binaries.
Create /usr/local/mailman/etc/clean-mbox-file.rc as follows:
:0fh # # formail -b -s procmail -m /usr/local/mailman/etc/clean-mbox-file.rc < \ # /usr/local/mailman/archives/private/$listname.mbox # # You will receive cleaned up old archive file in # $HOME/new-mbox-archive.mbox. The original file is left untouched. |formail -Y -q- -I"Posted-Date:" -I"Received-Date:" -I"Old-From:" \ -I "Old-To:" -I "Old-Subject:" -I "Old-Date:" -I "Old-Reply-To:" \ -I"X-Mailer:" -I"Received:" -I"Status:" -I"X-Status:" \ -I"X-MIME-Autoconverted:" -I"Organization:" -I"Sender:" \ -I"Reply-To:" -I"List-Id:" -I"List-Unsubscribe:" -I"List-Archive:" \ -I"List-Post:" -I"List-Help:" -I"List-Subscribe:" \ -I"X-List-Received-Date:" -I"Errors-To:" -I"X-Errors-To:" \ -I"X-Listname:" -I"Listname:" -I"Status:" -I"Content-Length:" \ -I"Lines:" -I"X-Loop:" -I"X-Mailing-List:" -I"List-Info:" \ -I"X-Info:" -I"Precedence:" # http://www.natur.cuni.cz/~mmokrejs/procmail/mobiles # to be sure that quoted-printable in headers got converted to 8 bit # De-mangle RFC 2047 header mangling :0Hhfw * =\?[^?]+\?[qb]\?[^?]+\?= |/usr/local/bin/dmmh # # Fix dates like below #From cita@xxxxx.cz Thu Jan 20 03:27:35 2000 #From: Ctirad Hruby <cita@xxxxx.cz> #Date: Thu, 13 May 1999 08:40:57 +0200 # convert the date extracted from the From_ line to Date: format # From line: Fri Jan 14 11:22:23 2000 # Date: line: Fri, 14 Jan 2000 11:21:31 +0100 :0 * ^From +[a-zA-Z0-9_@.-]+[ ]+\/[^ ].* { FR1=$MATCH FR11=`echo $FR1 | perl -e 'while (<>) { $l=$_; $l =~ m/^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+/; print "$1, $3 $2 $5 $4\n";}'` } :0 * ^Date: +\/[^ ].* { FR2=$MATCH } :0 * ^Date: +\/[a-zA-Z\,]+ { OVERLAP=$MATCH } # # TODO # figure out both dates differ significantly, if yes then recreate # Date: # # I'm lazy now, so I'll blindly overwrite the Date: field with the # date extracted from the From_ line. This is particularly bad as an # email sent just before midnight might end up dated now on next day # (as it could have been delivered on mailserver just after midnight # ... and if *that* very next day was in new month we shift the # message mistakenly ...). # :0f | formail -I "Date: $FR11" # the next line will store all emails into a file in current directory :0: mylistname # the next line will store all emails into a file in $HOME/ #:0: #$HOME/new-mbox-archive.mbox
Then run command:
$ formail -b -s procmail -m /usr/local/mailman/etc/clean-mbox-file.rc < /usr/local/mailman/archives/private/$listname.mbox
You will receive cleaned up old archive file in your home directory called mylistname or in $HOME/new-mbox-archive.mbox. The original file is left untouched. Append/prepend the archive file to
$prefix/archives/private/$listname.mbox/$listname.mbox.
Would be nice if someone would describe how to define procmail as an external archiver. It could be used to mask email addresses in raw archives, execute pipermail/MhonArc/glimpse, just any program after each email archived.
Other notes:
One user had a mbox file exported by Lyris with From_ lines of the form
From user@example.com Thu, 01 Dec 2005 15:02:08 -0500
which need to be converted to the form
From user@example.com Thu Dec 1 15:02:08 2005
This conversion can be done by
python script.py <old_mbox >new_mbox
where script.py is between the dashed lines
---------------------------------------------------------- import sys import time for line in sys.stdin.readlines(): if line.startswith('From '): fields = line.split() date = ' '.join(fields[2:7]) try: t = time.strptime(date, '%a, %d %b %Y %H:%M:%S') newtime = time.asctime(t) line = ' '.join(fields[0:2] + [newtime] + ['\n']) except ValueError: pass sys.stdout.writelines([line]) ----------------------------------------------------------
Also, there is the bin/cleanarch script in the distribution which can escape unescaped 'From ' in the bodies of messages so bin/arch can do a better job. This script requires the legitimate 'From ' separators in the input mailbox have times in the 'Thu Dec 1 15:02:08 2005' format.
Converted from the Mailman FAQ Wizard
This is one of many Frequently Asked Questions.