Differences between revisions 3 and 4

Stable URLs

There are many reasons to want a message to be addressed by a stable URL, i.e. one that does not change if the message is edited (for the most part), moved to a different archive, or referred to by different access methods. Further, this stable URL should be calculable with just minimal information, and with access to a non-list copy of the message. Some use cases for a message's stable URL include:

Archives URLs that survive regeneration, even if messages are deleted from the archive or edited by a list administrator.
The ability to pre-calculate the stable URL for inclusion in a message footer, without having to talk to the archiver (which may be a remote system).
The ability for programmatic access to a message in a message store when all you have is the off-list copy of the message.
The ability of a 3rd party archiver to implement a "mail me this message" function.

Stephen Turnbull starts the conversation off with this thread from the mailman-developers mailing list. I suggest you read the entire thread, but below is my counter proposal.

Message-IDs

RFC 2822 describes the Message-ID header. Everyone assumes that Message-ID is globally unique, so why can't it just be used as the stable URL?

Well, maybe it can, but not in its raw form. It's worth noting that RFC 2822 does not require the header, instead specifying that the header SHOULD be included. The header also SHOULD be globally unique but of course, because this is supplied by the client, it may not actually be unique. The Message-ID header certainly isn't very user-friendly and it is not url-friendly because it may contain characters that would have to be url-encoded.

Jeff Breidenbach of The Mail Archive did some analysis of their very large corpus of messages and makes a convincing argument that Message-ID is unique enough to rely on in the real world. It still suffers from lack of user- and url-friendliness. The following proposal uses Message-ID while supporting these other constraints.

RFC 5064

RFC 5064 (draft) is a specification for the Archived-At header. This is a very interesting proposal which should be honored by Mailman. It specifies where the message url should be included in the list copy, though we'll probably also include it in the message footer (the same url will be used in both places). Section 3.2 describes implementation considerations which mirror the same goals we're trying to achieve here, but the draft suffers from the same problems of user- and url-friendliness we've described here.

Thus, you can consider the following specification as a replacement for section 3.2 in the draft RFC 5064.

Specification

Here then is an informal specification for stable URL generation, such that could be used in a web service to provide messages on demand, or included in message footers without requiring communication with the archive.

Headers

We will use the RFC 5064 Archived-At header to contain the full url to the archived message. We'll also introduce a new header called X-Message-ID-Hash which will contain a user- and url-friendly token calculated from the Message-ID and provided as the last component of the Archived-At header. The X-Message-ID-Hash header is provided as a convenience only and is not required for this algorithm to work.

X-Message-ID-Hash is calculated from the Base 32 encoded SHA 1 hash of the complete Message-ID header (including angle bracket delimiters) of the original message. If the incoming message is missing its Message-ID or Date header, the mailing list manager SHOULD add its own version of the header, with the understanding that the non-list copy of the message will not contain this header. If the incoming message has more than one such header, the mailing list manager MUST use the first header found and MAY delete subsequent such headers. A mail server feeding the mailing list manager MAY reject messages with duplicate or missing Message-ID headers.

} is calculated using the following template: <baseurl>/<listname>/<midhash>.  So for example, a message for the Mailman Developers mailing list at {{{mail.python.org}}} with the {{{Message-ID}}} value of {{{<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>}}} might have an

Stable URL calculation

The use of X-Message-ID-Hash, List-Archive, and Archived-At headers provides a unique, stable, easily calculated location for the message. Here's a more complete example of a message as posted through the mailing list.

Subject: An important message
Date: Wed, 04 Jul 2007 16:49:58 +0900
Message-ID: <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
X-Message-ID-Hash: AGDWSNXXKCWEILKKNYTBOHRDQGOX3Y35
List-Archive: http://mail.python.org/archives/mailman-developers
Archived-At: http://mail.python.org/archives/mailman-developers/AGDWSNXXKCWEILKKNYTBOHRDQGOX3Y35

Off-list copy

What if you receive an off-list copy of the message? How could you locate this message in the archive? Let's assume you know that the list's base archive url is at http://mail.python.org/archives/mailman-developers, and let's assume that the off-list copy was well-formed, with one unique Message-ID header. You could calculate the X-Message-ID-Hash easily, for example with the following bit of Python:

>>> from hashlib import sha1
>>> from base64 import b32encode
>>> base_url = 'http://mail.python.org/archives/mailman-developers'
>>> token = b32encode(sha1(msg['message-id']).digest())
>>> archive_url = '%s/%s' % (base_url, token)

Why Base 32?

Base 32 was chosen because of its limited alphabet and because it consists of only ASCII numbers and letters. This makes it easy (if slightly verbose) for humans to read and pronounce. For example, the numbers 0 and 1 are omitted from the alphabet to reduce confusion between them and the letters O and I in some fonts. Also, base 32 contains upper case letters only, however an archive MAY treat urls as case insensitive (accepting any combination of upper and lower case letters). An archive MAY also accept 0 for O and 1 for I in the X-Message-ID-Hash part only. The base32 hash is also completely url-safe, requiring no encoding in web applications.

Base 64 was rejected because, while providing minimal space savings, the expanded alphabet and case sensitivity was deemed to be less adaptable to human mistake (i.e. "be liberal in what you accept").

Comments

Barry Warsaw

An alternative for the Mail Archive is described here: http://www.mail-archive.com/faq.html#listserver

Brad Knowles

If you're going to generate a hash, it should be generated across all the required RFC-2822 headers (in addition to the others being discussed), including From:, To:, Subject:, and Date:. This would help to guarantee the uniqueness of the hash, even if the message-id were to collide.

Second, although you want to use a hashing algorithm that is considered reasonably secure today (e.g., SHA-256), you also want to explicitly include up-front an extension/alternative mechanism, so that in the future when SHA-256 gets thrown onto the same scrapheap as MD5 and SHA-1, you can easily do so.

-  ⇤ ← Revision 3 as of 2007-10-02 22:13:30 → 
  Size: 8148
  Editor: barry
  Comment:
+   ← Revision 4 as of 2008-01-13 17:37:04 → ⇥
  Size: 7060
  Editor: barry
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-#pragma page-filename DEV/versions/3047444
+#pragma page-filename DEV/versions/3047445
 Line 9:
-Stephen Turnbull starts the conversation off with [[http://mail.python.org/pipermail/mailman-developers/2007-July/019634.html|this thread]] from the mailman-developers mailing list.  I suggest you read the entire thread, but below is my counter proposal, which the Mailman 3 code base currently implements.
+Stephen Turnbull starts the conversation off with [[http://mail.python.org/pipermail/mailman-developers/2007-July/019634.html|this thread]] from the mailman-developers mailing list.  I suggest you read the entire thread, but below is my counter proposal.
 Line 14:
-There are several problems in practice with relying only on {{{Message-ID}}}.  First, RFC 2822 does not require the header, although messages SHOULD include them.  In a real-world analysis of several python.org mailing lists, some small number of messages (approximately 0.2%) simply had no {{{Message-ID}}} header.  Second, while the recommendation is that these be globally unique, in practice there is a small percentage of collisions (somewhere in the < 1% range) for messages that are otherwise different.  These collisions could be because spam sending agents don't care about RFCs, or because some MTAs are broken on resends.  Of course, because {{{Message-ID}}}'s are generated by untrusted systems, we can't, er, trust them (completely).
+Well, maybe it can, but not in its raw form.  It's worth noting that RFC 2822 does not require the header, instead specifying that the header SHOULD be included.  The header also SHOULD be globally unique but of course, because this is supplied by the client, it may not actually be unique.  The {{{Message-ID}}} header certainly isn't very user-friendly and it is not url-friendly because it may contain characters that would have to be url-encoded.
 Line 16:
-Aside from those problems, the {{{Message-ID}}} is not very user friendly.  I'd rather not require a human to type in all the funky dots, at-signs, and angle brackets.
+Jeff Breidenbach of [[http://mail-archive.com|The Mail Archive]] did some [[http://mail.python.org/pipermail/mailman-developers/2007-August/019708.html|analysis]] of their very large corpus of messages and makes a convincing argument that {{{Message-ID}}} is unique enough to rely on in the real world.  It still suffers from lack of user- and url-friendliness.  The following proposal uses {{{Message-ID}}} while supporting these other constraints.
 Line 18:
-It's tempting to say that we should just ignore any message that has a broken or colliding {{{Message-ID}}} but I think we should instead be liberal in what we accept, and recognize that even legitimate messages may have missing or colliding {{{Message-ID}}}'s, so we should be prepared to handle that case.
+== RFC 5064 ==
[[http://tools.ietf.org/html/rfc5064|RFC 5064]] (draft) is a specification for the {{{Archived-At}}} header.  This is a very interesting proposal which should be honored by Mailman.  It specifies ''where'' the message url should be included in the list copy, though we'll probably also include it in the message footer (the same url will be used in both places).  Section 3.2 describes implementation considerations which mirror the same goals we're trying to achieve here, but the draft suffers from the same problems of user- and url-friendliness we've described here.

Thus, you can consider the following specification as a replacement for section 3.2 in the draft RFC 5064.
-Line 24:
+Line 27:
-Two new headers are proposed, currently called {{{X-List-ID-Hash}}} and {{{X-List-Sequence-Number}}}.  These are named in honor of [[http://www.faqs.org/rfcs/rfc2369.html|RFC 2369]] but named with a leading {{{X-}}} until this proposal becomes an internet standard.  These headers have the following definition:
+We will use the RFC 5064 {{{Archived-At}}} header to contain the full url to the archived message.  We'll also introduce a new header called {{{X-Message-ID-Hash}}} which will contain a user- and url-friendly token calculated from the {{{Message-ID}}} and provided as the last component of the {{{Archived-At}}} header.  The {{{X-Message-ID-Hash}}} header is provided as a convenience only and is not required for this algorithm to work.
-Line 26:
+Line 29:
- * {{{X-List-ID-Hash}}} is calculated from the [[http://www.faqs.org/rfcs/rfc3548.html|Base 32]] encoded [[http://www.faqs.org/rfcs/rfc3174.html|SHA 1]] hash of the {{{Message-ID}}} and {{{Date}}} headers of the original message.  If the incoming message is missing its {{{Message-ID}}} or {{{Date}}} header, the mailing list manager SHOULD add its own version of the header, with the understanding that the non-list copy of the message will not contain this header.  If the incoming message has more than one such header, the mailing list manager MUST use the first header found and MAY delete subsequent such headers.
 * {{{X-List-Sequence-Number}}} is a unique integer assigned by the mailing list software to distinguish between messages which otherwise collide in their {{{X-List-ID-Hash}}} value.  The mailing list manager MAY assign this sequential ID globally across all message, or it MAY keep a separate counter each colliding hash values.  The only requirement is that within a hash value in a particular message store (see below), the sequence number is unique.  A message need not be addressable by its sequence number alone.
+ * {{{X-Message-ID-Hash}}} is calculated from the [[http://www.faqs.org/rfcs/rfc3548.html|Base 32]] encoded [[http://www.faqs.org/rfcs/rfc3174.html|SHA 1]] hash of the complete {{{Message-ID}}} header (including angle bracket delimiters) of the original message.  If the incoming message is missing its {{{Message-ID}}} or {{{Date}}} header, the mailing list manager SHOULD add its own version of the header, with the understanding that the non-list copy of the message will not contain this header.  If the incoming message has more than one such header, the mailing list manager MUST use the first header found and MAY delete subsequent such headers.  A mail server feeding the mailing list manager MAY reject messages with duplicate or missing {{{Message-ID}}} headers.
 * {{{{
} is calculated using the following template: <baseurl>/<listname>/<midhash>.  So for example, a message for the Mailman Developers mailing list at {{{mail.python.org}}} with the {{{Message-ID}}} value of {{{<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>}}} might have an
}}}}
-Line 29:
+Line 35:
-The combination of the {{{X-List-ID-Hash}}} and the {{{X-List-Sequence-Number}}} provides a unique address for the message, ''relative to the message store's base URL''.  To determine the stable, globally unique address, you must also consult RFC 2369's {{{List-Archive}}} header.  Thus, for a message with the following headers:
+The use of {{{X-Message-ID-Hash}}}, {{{List-Archive}}}, and {{{Archived-At}}} headers provides a unique, stable, easily calculated location for the message. Here's a more complete example of a message as posted through the mailing list.
-Line 35:
+Line 41:
-X-List-ID-Hash: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
X-List-Sequence-Number: 801
List-Archive: http://archive.example.com
}}}

the stable URL to the message would be:

{{{
http://archive.example.com/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801
+X-Message-ID-Hash: AGDWSNXXKCWEILKKNYTBOHRDQGOX3Y35
List-Archive: http://mail.python.org/archives/mailman-developers
Archived-At: http://mail.python.org/archives/mailman-developers/AGDWSNXXKCWEILKKNYTBOHRDQGOX3Y35
 Line 47:
-What if you receive an off-list copy of the message?  How could you locate this message in the archive?  Let's assume you know that the archive is kept at {{{http://archive.example.com}}}, and let's assume that the off-list copy was well-formed, with one {{{Message-ID}}} header and one {{{Date}}} header.  You could calculate the {{{X-List-ID-Hash}}} easily, for example with the following bit of Python:
+What if you receive an off-list copy of the message?  How could you locate this message in the archive?  Let's assume you know that the list's base archive url is at {{{http://mail.python.org/archives/mailman-developers}}}, and let's assume that the off-list copy was well-formed, with one unique {{{Message-ID}}} header.  You could calculate the {{{X-Message-ID-Hash}}} easily, for example with the following bit of Python:
 Line 50:
->>> import hashlib
>>> shaobj = hashlib.sha1(msg['message-id'])
>>> shaobj.update(msg['date])
>>> hash32 = base64.b32encode(shaobj.digest())
+>>> from hashlib import sha1
>>> from base64 import b32encode
>>> base_url = 'http://mail.python.org/archives/mailman-developers'
>>> token = b32encode(sha1(msg['message-id']).digest())
>>> archive_url = '%s/%s' % (base_url, token)
-Line 56:
+Line 57:
-Because the off-list copy won't have the {{{X-List-Sequence-Number}}}, the best you can now do is visit this url in your web browser or REST client:

{{{
http://archive.example.com/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
}}}

One of two things will be returned (assuming the message hasn't been deleted from the archive):

 * The mailing list manager has received only one message containing the same {{{Message-ID}}} and {{{Date}}} combination.  IOW, there is no collision, and the URL returns the message.
 * The mailing list manager has received multiple messages with a colliding hash.  In that case, the URL above will return a list of all matching messages.  The user (or program) could then present (or visit) each link to find the original message.  The links would of course include the sequence number tail path component.
What this means is that when there's been only a single message received with a particular {{{Message-ID}}} and {{{Date}}} combination, there are two addresses which point to the same resource, e.g.

{{{
http://archive.example.com/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
http://archive.example.com/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801
}}}

but when multiple matching messages have been received, only the latter address points directly to the message; the former returns a list of all the matching messages (or more correctly, a list of links to all the matching messages).
-Line 76:
+Line 58:
-Base 32 was chosen because of its limited alphabet and because it consists of only ASCII numbers and letters, so it should be easy (if slightly verbose) for humans to read it or pronounce it.  For example, the numbers 0 and 1 are omitted from the alphabet to reduce confusion between them and the letters O and I in some fonts.  Also, base 32 contains upper case letters only, however an archive MAY treat urls as case insensitive (accepting any combination of upper and lower case letters).  An archive MAY also accept 0 for O and 1 for I in the {{{X-List-ID-Hash}}} part only.
+Base 32 was chosen because of its limited alphabet and because it consists of only ASCII numbers and letters.  This makes it easy (if slightly verbose) for humans to read and pronounce.  For example, the numbers 0 and 1 are omitted from the alphabet to reduce confusion between them and the letters O and I in some fonts.  Also, base 32 contains upper case letters only, however an archive MAY treat urls as case insensitive (accepting any combination of upper and lower case letters).  An archive MAY also accept 0 for O and 1 for I in the {{{X-Message-ID-Hash}}} part only.  The base32 hash is also completely url-safe, requiring no encoding in web applications.
-Line 79:
+Line 61:
-=== Tools ===
[[attachment:^scan.py]] is the Python program I've been using to gather statistics from sample mbox files.