#pragma page-filename DOC/versions/4030689 == 3.36. Processing old mbox archives with procmail/formail == My mailman-2.1.4 breaks my old mbox archives when processing them into html(pipermail). As a result, some messages get merged together. Often those HTML versions have no Subject line etc. It's a mess. It helps a bit to cleanup the old archives. Get procmail from [[http://www.procmail.org|http://www.procmail.org]] and install it. You will get procmail(1) and formail(1) binaries. Create /usr/local/mailman/etc/clean-mbox-file.rc as follows: {{{ :0fh # # formail -b -s procmail -m /usr/local/mailman/etc/clean-mbox-file.rc < \ # /usr/local/mailman/archives/private/$listname.mbox # # You will receive cleaned up old archive file in # $HOME/new-mbox-archive.mbox. The original file is left untouched. |formail -Y -q- -I"Posted-Date:" -I"Received-Date:" -I"Old-From:" \ -I "Old-To:" -I "Old-Subject:" -I "Old-Date:" -I "Old-Reply-To:" \ -I"X-Mailer:" -I"Received:" -I"Status:" -I"X-Status:" \ -I"X-MIME-Autoconverted:" -I"Organization:" -I"Sender:" \ -I"Reply-To:" -I"List-Id:" -I"List-Unsubscribe:" -I"List-Archive:" \ -I"List-Post:" -I"List-Help:" -I"List-Subscribe:" \ -I"X-List-Received-Date:" -I"Errors-To:" -I"X-Errors-To:" \ -I"X-Listname:" -I"Listname:" -I"Status:" -I"Content-Length:" \ -I"Lines:" -I"X-Loop:" -I"X-Mailing-List:" -I"List-Info:" \ -I"X-Info:" -I"Precedence:" # http://www.natur.cuni.cz/~mmokrejs/procmail/mobiles # to be sure that quoted-printable in headers got converted to 8 bit # De-mangle RFC 2047 header mangling :0Hhfw * =\?[^?]+\?[qb]\?[^?]+\?= |/usr/local/bin/dmmh # # Fix dates like below #From cita@xxxxx.cz Thu Jan 20 03:27:35 2000 #From: Ctirad Hruby <cita@xxxxx.cz> #Date: Thu, 13 May 1999 08:40:57 +0200 # convert the date extracted from the From_ line to Date: format # From line: Fri Jan 14 11:22:23 2000 # Date: line: Fri, 14 Jan 2000 11:21:31 +0100 :0 * ^From +[a-zA-Z0-9_@.-]+[ ]+\/[^ ].* { FR1=$MATCH FR11=`echo $FR1 | perl -e 'while (<>) { $l=$_; $l =~ m/^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+/; print "$1, $3 $2 $5 $4\n";}'` } :0 * ^Date: +\/[^ ].* { FR2=$MATCH } :0 * ^Date: +\/[a-zA-Z\,]+ { OVERLAP=$MATCH } # # TODO # figure out both dates differ significantly, if yes then recreate # Date: # # I'm lazy now, so I'll blindly overwrite the Date: field with the # date extracted from the From_ line. This is particularly bad as an # email sent just before midnight might end up dated now on next day # (as it could have been delivered on mailserver just after midnight # ... and if *that* very next day was in new month we shift the # message mistakenly ...). # :0f | formail -I "Date: $FR11" # the next line will store all emails into a file in current directory :0: mylistname # the next line will store all emails into a file in $HOME/ #:0: #$HOME/new-mbox-archive.mbox }}} Then run command: {{{ $ formail -b -s procmail -m /usr/local/mailman/etc/clean-mbox-file.rc < /usr/local/mailman/archives/private/$listname.mbox }}} You will receive cleaned up old archive file in your home directory called mylistname or in $HOME/new-mbox-archive.mbox. The original file is left untouched. Append/prepend the archive file to {{{ $prefix/archives/private/$listname.mbox/$listname.mbox. }}} Would be nice if someone would describe how to define procmail as an external archiver. It could be used to mask email addresses in raw archives, execute pipermail/MhonArc/glimpse, just any program after each email archived. Other notes: One user had a mbox file exported by Lyris with From_ lines of the form {{{ From user@example.com Thu, 01 Dec 2005 15:02:08 -0500 }}} which need to be converted to the form {{{ From user@example.com Thu Dec 1 15:02:08 2005 }}} This conversion can be done by {{{ python script.py <old_mbox >new_mbox }}} where script.py is between the dashed lines {{{ ---------------------------------------------------------- import sys import time for line in sys.stdin.readlines(): if line.startswith('From '): fields = line.split() date = ' '.join(fields[2:7]) try: t = time.strptime(date, '%a, %d %b %Y %H:%M:%S') newtime = time.asctime(t) line = ' '.join(fields[0:2] + [newtime] + ['\n']) except ValueError: pass sys.stdout.writelines([line]) ---------------------------------------------------------- }}} Also, there is the bin/cleanarch script in the distribution which can escape unescaped 'From ' in the bodies of messages so bin/arch can do a better job. This script requires the legitimate 'From ' separators in the input mailbox have times in the 'Thu Dec 1 15:02:08 2005' format. Converted from the Mailman FAQ Wizard This is one of many [[../Frequently Asked Questions|Frequently Asked Questions]].