#pragma page-filename DEV/versions/786633 = Stable URLs = There are many reasons to want a message to be addressed by a stable URL, i.e. one that does not change if the message is edited (for the most part), moved to a different archive, or referred to by different access methods. Further, this stable URL should be calculable with just minimal information, and with access to a non-list copy of the message. Some use cases for a message's stable URL include: * Archives URLs that survive regeneration, even if messages are deleted from the archive or edited by a list administrator. * The ability to pre-calculate the stable URL for inclusion in headers and footers, without having to talk to the archiver (which may be a remote system). * The ability for programmatic access to a message in a message store when all you have is the off-list copy of the message. * The ability of a 3rd party archiver to implement a "mail this message to me" function. Stephen Turnbull starts the conversation off with [[http://mail.python.org/pipermail/mailman-developers/2007-July/019634.html|this thread]] from the mailman-developers mailing list. I suggest you read the entire thread, but below is my counter proposal. == Message-IDs == [[http://www.faqs.org/rfcs/rfc2822.html|RFC 2822]] describes the `Message-ID` header. Everyone assumes that `Message-ID` is globally unique, so why can't it just be used as the stable URL? Well, maybe it can, but not in its raw form. It's worth noting that RFC 2822 does not require the header, instead specifying that the header SHOULD be included. The header also SHOULD be globally unique but of course, because this is supplied by the client, it may not actually be unique. The `Message-ID` header certainly isn't very user-friendly and it is not url-friendly because it may contain characters that would have to be url-encoded. Jeff Breidenbach of [[http://mail-archive.com|The Mail Archive]] did some [[http://mail.python.org/pipermail/mailman-developers/2007-August/019708.html|analysis]] of their very large corpus of messages and makes a convincing argument that `Message-ID` is unique enough to rely on in the real world. It still suffers from lack of user- and url-friendliness. The following proposal uses `Message-ID` while supporting these other constraints. == RFC 5064 == [[http://tools.ietf.org/html/rfc5064|RFC 5064]] (draft) is a specification for the `Archived-At` header. This is a very interesting proposal which will be honored by Mailman. It specifies ''where'' the message url should be included in the list copy, though we'll probably also include it in the message footer (the same url will be used in both places). Section 3.2 describes implementation considerations which mirror the same goals we're trying to achieve here, but the draft suffers from the same problems of user- and url-friendliness we've described here. Thus, you can consider the following specification as a replacement for section 3.2 in the draft RFC 5064. == Primary Specification == Here then is an informal specification for stable URL generation, such that could be used in a web service to provide messages on demand, or included in message footers without requiring communication with the archive. === Headers === We will use the RFC 5064 `Archived-At` header to contain the full url to the archived message. We'll also introduce a new header called `Message-ID-Hash`[1] which will contain a user- and url-friendly token calculated from the `Message-ID` and provided as the last component of the `Archived-At` header. The `Message-ID-Hash` header is provided as a convenience only and is not required for this algorithm to work. * `Message-ID-Hash` is calculated from the [[http://www.faqs.org/rfcs/rfc3548.html|Base 32]] encoded [[http://www.faqs.org/rfcs/rfc3174.html|SHA 1]] hash of the `Message-ID` header. As with RFC 2822, the angle bracket delimiters are '''not''' considered part of the Message-ID and MUST NOT contribute to the hash[2]. * If the incoming message is missing its `Message-ID` header, the mailing list manager SHOULD add its own version of the header, with the understanding that the non-list copy of the message will not contain this header. * If the incoming message's `Message-ID` is malformed, say because it is missing one or both angle bracket delimiters, the mailing list manager MUST use the entire `Message-ID` verbatim, without removing any delimiters which are present[3]. * If the incoming message has more than one `Message-ID` header, the mailing list manager MUST use the first header found and MAY delete subsequent such headers. A mail server feeding the mailing list manager MAY reject messages with duplicate or missing `Message-ID` headers. * The mailing list manager MUST strip any existing `Message-ID-Hash` header (and `X-Message-ID-Header`) in the original message. * `Archived-At` is calculated using the following template: ''''/'''', where '''' is the base url to the archiving service. The mailing list manager MAY use the value given in the RFC 2369 `List-Archive` header. The '''' is given in the `Message-ID-Hash` header. So for example, a message for the Mailman Developers mailing list at `mail.python.org` with the `Message-ID` value of `<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>` might have a `List-Archive` header of `http://mail.python.org/archives/mailman-developers`, and an `Archived-At` header value of `http://mail.python.org/archives/mailman-developers/JJIGKPKB6CVDX6B2CUG4IHAJRIQIOUTP`. [1] Previously, this specification proposed `X-Message-ID-Hash` as the header name, but this has since been modified to remove the `X-` prefix. [2] As long as both the leading and trailing angle brackets are present. [3] Previously, this specification required repair of the malformed `Message-ID` header, but this isn't what was actually implemented, and such repair is problematic anyway. Now, both delimiters must be present for them to be removed, or the `Message-ID` MUST be used verbatim. === Stable URL calculation === The use of `Message-ID-Hash` and `Archived-At` headers provides a unique, stable, easily calculated location for the message. Here's a more complete example of a message as posted through the mailing list. {{{ Subject: An important message Date: Wed, 04 Jul 2007 16:49:58 +0900 Message-ID: <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID-Hash: JJIGKPKB6CVDX6B2CUG4IHAJRIQIOUTP List-Archive: http://mail.python.org/archives/mailman-developers Archived-At: http://mail.python.org/archives/mailman-developers/JJIGKPKB6CVDX6B2CUG4IHAJRIQIOUTP }}} === Off-list copy === What if you receive an off-list copy of the message? How could you locate this message in the archive? Let's assume you know that the list's base archive url is at `[[http://mail.python.org/archives/mailman-developers|http://mail.python.org/archives/mailman-developers]]`, and let's assume that the off-list copy was well-formed, with one unique, compliant `Message-ID` header. You could calculate the `Message-ID-Hash` easily, for example with the following bit of Python: {{{ >>> from hashlib import sha1 >>> from base64 import b32encode >>> base_url = 'http://mail.python.org/archives/mailman-developers' >>> mid = msg['message-id'] >>> if mid.startswith('<') and mid.endswith('>'): ... mid = mid[1:-1] >>> token = b32encode(sha1(mid).digest()) >>> archive_url = '%s/%s' % (base_url, token) }}} === Why Base 32? === Base 32 was chosen because of its limited alphabet and because it consists of only ASCII numbers and letters. This makes it easy (if slightly verbose) for humans to read and pronounce. For example, the numbers 0 and 1 are omitted from the alphabet to reduce confusion between them and the letters O and I in some fonts. Also, base 32 contains upper case letters only, however an archive MAY treat urls as case insensitive (accepting any combination of upper and lower case letters). An archive MAY also accept 0 for O and 1 for I in the `Message-ID-Hash` part only (note: we could take it further and go to [[http://www.red-bean.com/kfogel/numbersandletters.txt|base 22]], to tolerate a number of common mis-transcriptions). Base 32 also completely url-safe, requiring no encoding in web applications. Base 64 was rejected because, while providing minimal space savings, the expanded alphabet and case sensitivity was deemed to be less adaptable to human mistake (i.e. "be liberal in what you accept"). === Security === There is a possibility of denial-of-service attacks against the archive if the `Message-ID` is guessable. An attacker could inject a future `Message-ID` into the system, which would cause an `Message-ID-Hash` collision, thus causing future legitimate messages to get discarded from the archive. This is defended by relying on good, random `Message-ID` calculation in the sender's mail system. A poorly generated `Message-ID` headers can have other adverse affects on messages from that sender, so this does not add any additional burden or vulnerability. When the messages are archived locally to the mail server, there is a high degree of trust between the two. In this architecture, it should be nearly impossible to bypass the list server and inject messages directly into the archive. When the archiver is remote from the list server though, it will be possible to inject attacks directly into the archiver. Because none of the RFC 2822 headers are reliable, the archiver will have to use other means to verify the message. It could use [[http://www.dkim.org|DKIM]] headers (though Mailman itself will never sign headers). If archive links are easily guessed, then DoS attacks are made easier. == Alternate Specification == === Rationale === The primary specification is oriented around the idea of a universal unique identifiers (UUID). This approach has several powerful advantages.  However, Steve Huston points that the concept of [[http://www.mail-archive.com/mailman-developers@python.org/msg10397.html|sequence numbers]] has usability benefits; people are very good at remembering and reasoning about decimal numbers up to about [[http://www.psypress.com/pip/resources/slp/popup.asp?popup=ch09-rs-01|seven digits]]. Also, sequence numbers are already in widespread use by archiving software such as pipermail and [[http://www.mhonarc.org|Mhonarc]]. This specification is designed to allow the Mailing List Manager to tightly operate with and synchronize multiple sequence number based archiving systems, without creating a redundant UUID namespace. Sequence numbers have the disadvantage that they are less stable in the face of archive regeneration or move. The sequence number is maintained by the mailing list manager, so if a message is deleted and the archive is regenerated, there may be no way to reproduce the sequence number. Also, the sequence number cannot be calculated from off-list message content. === Headers === * `X-List-Sequence` is a simple message counter. This header was [[http://www.nisto.com/listspec/header-fields.html|considered]]for RFC 2369 but not included; it will be promoted to List-Sequence if this specification ever gets standards tracked by the IETF. Compliant mailing list managers MUST strip any such headers from inbound messages, and insert it for outbound messages. A compliant mailing list manager MUST strip out any such headers from inbound messages and add it on outbound messages. * `Archived-At` represents an out of band agreement between the mailing list manger and a particular archiving agent. The mailing list manager is expected to construct a URL solely from information present in headers. The corresponding archiving agent is expected to fulfill that obligation upon receipt of the message. The sequence number SHOULD be included in the URL. The archiving agent SHOULD be able to fulfill its obligation without reading the `Archived-At` header. In this example, there are three `Archived-At` headers included in outbound messages. Each is handled by a different archiving agent. They are composed of fixed strings, the sequence number. the contents of the `Date` header, the contents of the `List-Post` header, and minor transformations like zero padding. {{{ X-List-Sequence: 72 Archived-At: Archived-At: Archived-At: }}} === Discussion === There are several best practices associated with this specification. For example, it is considered poor form for the mailing list manager to make promises that the archiving agent cannot fulfill. At the very least this is guaranteed to create broken URLs and in the worst case can prevent archiving entirely. One easy way to get in trouble is for the mailing list manager to reset sequence numbers. This might happen when mailing list manger software is replaced, upgraded, or reconfigured. For this reason, arbitrary communication between the mailing list manager and the archiving agent is permitted to determine the initial value of the sequence number. At no other time is two way communication encouraged. Compliant archiving agents SHOULD respect `X-List-Sequence` even if there are no relevant `Archived-At` headers. For lists with multiple archives, this will synchronize sequence numbers, e,g. "message 72" is will have the same content across multiple archives. In general archiving agents are not expected to parse the `Archived-At` header, instead they fulfill their out-of-band agreement with the mailing list manager by dead reckoning. === Security === Sequence numbers are more prone to denial-of-service attacks on the archiver. If messages can be injected into the archiver that do not come from the mailing list manager, future `X-List-Sequence` header can be easily guessed, and the archiver could be easily populated by forgeries. == Open issues == Mailman's message scrubber also uses the archiver in a way not handled by the current specification. When a message reaches Mailman containing attachments, these attachments can be stripped and stored in the archive. The list copy of the message is then altered to contain just a url to the attachment as it appears in the archive. This is called ''scrubbing'' the message. How can we handle scrubbed attachments? The primary specification can lead to unusably long urls. For example, the [[http://www.mail-archive.com|Mail-Archive]] requires the use of the list name in the url. This can lead to urls such as [[http://www.mail-archive.com/mailman-developers@python.org/AGDWSNXXKCWEILKKNYTBOHRDQGOX3Y35|http://www.mail-archive.com/mailman-developers@python.org/AGDWSNXXKCWEILKKNYTBOHRDQGOX3Y35]] which, at 90 characters, no sane person would want to see in a message body footer. Even switching to a short domain name, you'd still only get the header down to 70 some odd characters, which is still a lot. Omitting the list name gets you into a more reasonable 40-50 character range, but requires a mapping to get from message hash to mailing list and has problems when messages with the same `Message-ID` are cross-posted. Choosing Base64 instead of Base32 drops you down some 5 characters, at the ''possible'' cost of much worse readability (collisions between 0/1 and O/I, and upper/lower-case confusion). This readability claim has not been verified by user testing, and may be specious. ---- <>