3.36. Processing old mbox archives with procmail/formail
My mailman-2.1.4 breaks my old mbox archives when processing them into html(pipermail). As a result, some messages get merged together. Often those HTML versions have no Subject line etc. It's a mess.
It helps a bit to cleanup the old archives. Get procmail from http://www.procmail.org and install it. You will get procmail(1) and formail(1) binaries.
Create /usr/local/mailman/etc/clean-mbox-file.rc as follows:
:0fh
#
# formail -b -s procmail -m /usr/local/mailman/etc/clean-mbox-file.rc < \
# /usr/local/mailman/archives/private/$listname.mbox
#
# You will receive cleaned up old archive file in
# $HOME/new-mbox-archive.mbox. The original file is left untouched.
|formail -Y -q- -I"Posted-Date:" -I"Received-Date:" -I"Old-From:" \
-I "Old-To:" -I "Old-Subject:" -I "Old-Date:" -I "Old-Reply-To:" \
-I"X-Mailer:" -I"Received:" -I"Status:" -I"X-Status:" \
-I"X-MIME-Autoconverted:" -I"Organization:" -I"Sender:" \
-I"Reply-To:" -I"List-Id:" -I"List-Unsubscribe:" -I"List-Archive:" \
-I"List-Post:" -I"List-Help:" -I"List-Subscribe:" \
-I"X-List-Received-Date:" -I"Errors-To:" -I"X-Errors-To:" \
-I"X-Listname:" -I"Listname:" -I"Status:" -I"Content-Length:" \
-I"Lines:" -I"X-Loop:" -I"X-Mailing-List:" -I"List-Info:" \
-I"X-Info:" -I"Precedence:"
# http://www.natur.cuni.cz/~mmokrejs/procmail/mobiles
# to be sure that quoted-printable in headers got converted to 8 bit
# De-mangle RFC 2047 header mangling
:0Hhfw
* =\?[^?]+\?[qb]\?[^?]+\?=
|/usr/local/bin/dmmh
#
# Fix dates like below
#From cita@xxxxx.cz Thu Jan 20 03:27:35 2000
#From: Ctirad Hruby <cita@xxxxx.cz>
#Date: Thu, 13 May 1999 08:40:57 +0200
# convert the date extracted from the From_ line to Date: format
# From line: Fri Jan 14 11:22:23 2000
# Date: line: Fri, 14 Jan 2000 11:21:31 +0100
:0
* ^From +[a-zA-Z0-9_@.-]+[ ]+\/[^ ].*
{
FR1=$MATCH
FR11=`echo $FR1 | perl -e 'while (<>) { $l=$_; $l =~ m/^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+/; print "$1, $3 $2 $5 $4\n";}'`
}
:0
* ^Date: +\/[^ ].*
{ FR2=$MATCH }
:0
* ^Date: +\/[a-zA-Z\,]+
{ OVERLAP=$MATCH }
#
# TODO
# figure out both dates differ significantly, if yes then recreate
# Date:
#
# I'm lazy now, so I'll blindly overwrite the Date: field with the
# date extracted from the From_ line. This is particularly bad as an
# email sent just before midnight might end up dated now on next day
# (as it could have been delivered on mailserver just after midnight
# ... and if *that* very next day was in new month we shift the
# message mistakenly ...).
#
:0f
| formail -I "Date: $FR11"
# the next line will store all emails into a file in current directory
:0:
mylistname
# the next line will store all emails into a file in $HOME/
#:0:
#$HOME/new-mbox-archive.mboxThen run command:
$ formail -b -s procmail -m /usr/local/mailman/etc/clean-mbox-file.rc < /usr/local/mailman/archives/private/$listname.mbox
You will receive cleaned up old archive file in your home directory called mylistname or in $HOME/new-mbox-archive.mbox. The original file is left untouched. Append/prepend the archive file to
$prefix/archives/private/$listname.mbox/$listname.mbox.
Would be nice if someone would describe how to define procmail as an external archiver. It could be used to mask email addresses in raw archives, execute pipermail/MhonArc/glimpse, just any program after each email archived.
Other notes:
One user had a mbox file exported by Lyris with From_ lines of the form
From user@example.com Thu, 01 Dec 2005 15:02:08 -0500
which need to be converted to the form
From user@example.com Thu Dec 1 15:02:08 2005
This conversion can be done by
python script.py <old_mbox >new_mbox
where script.py is between the dashed lines
----------------------------------------------------------
import sys
import time
for line in sys.stdin.readlines():
if line.startswith('From '):
fields = line.split()
date = ' '.join(fields[2:7])
try:
t = time.strptime(date, '%a, %d %b %Y %H:%M:%S')
newtime = time.asctime(t)
line = ' '.join(fields[0:2] + [newtime] + ['\n'])
except ValueError:
pass
sys.stdout.writelines([line])
----------------------------------------------------------Also, there is the bin/cleanarch script in the distribution which can escape unescaped 'From ' in the bodies of messages so bin/arch can do a better job. This script requires the legitimate 'From ' separators in the input mailbox have times in the 'Thu Dec 1 15:02:08 2005' format.
Converted from the Mailman FAQ Wizard
This is one of many Frequently Asked Questions.