开发者

Parsing forwarded emails

I'm writing some code to parse forwarded emails. What I'm not sure is if maybe there is some Python library, some RFC I could stick to or some other resource that would allow me to automate the task.

To be precise, I don't know if the "layout" of forwarded emails is covered by some standard or recommendation, or if it has just evolved over the years so now most email clients produce similar output for the text part:

    Begin forwarded message: 

    > From: Me <me@me.me>
    > Date: January 30, 2010 18:26:33 PM GMT+02:00
    > To: Other Me <other-me@me.me>
    > Subject: Unwise question

-- and go wild for attachments (and whatever other MIME sections can be there).

If it's still not precise enough I'll clarify it, it's just that I'm not 100% sure what to ask about (RFC, Python lib, convention开发者_如何学C or something else).


Unlike what many other people said, there is a standard on forwarded emails, RFC 2046, "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", more than ten years old. See specially its section 5.2, "Message Media Type".

The basic idea behind RFC 2046 is to encapsulate one message into the MIME part of another, of type named (unfortunately) message/rfc822 (never forget that MIME is recursive). The MIME library of Python can handle it fine.

I did not downvote the other answers because they are right in one respect: the standard is not followed by every mailer. For instance, the mutt mailer can forward a message in RFC 2046 format but also in a adhoc format. So, in practice, a mailer probably cannot handle only RFC 2046, it also has to parse the various others and underspecified syntaxes.


In my experience just about ever email client forwards/replies differently. Typically you'll have a plain text version and a html encoded version in the mime at the bottom of the mail pack. Mail headers do have a RFC (http://www.faqs.org/rfcs/rfc2822.html "2822"), but unfortunately the content of the message body is out side the scope.

Not only do you have to contend with the mail client variance, but the variance of user preferences. As an example: Lotus Notes puts replies at the top and Thunderbird replies at the bottom. So when a Thunderbird user is replying to a Lotus Notes user's reply they might insert their reply at the top and leave their signature at the bottom.

Another pitfall maybe contending with word wrapping of replied chains.

>>>> The outer reply that goes over the limit and is word wraped by
the middle replier's mail client\n
>> The message body of a middle reply
> Previous reply
Newest reply

I wouldn't parse the message and leave it to the user to parse in their heads. Or, I'd borrow the code from another project.


As the other answers already indicate: there is no standard, and your program is not going to be flawless.

You could have a look at the headers, in particular the User-Agent header, to see what kind of client was used, and code specifically for the most common clients.

To find out what clients you should consider to support, have a look at this popularity study. Various Outlooks, Yahoo!, Hotmail, Mail.app, iPhone mail, Gmail and Lotus Notes rank highly. About 11% of the mail is classified as "undetectable", but using headers from the forwarded e-mail you might be able to do better than that. Note that the statistics were gathered by placing an image inside the e-mail, so results may be skewed.

Another problem is HTML mail, which may or may not include a plain-text version. I'm not sure about clients' usual behaviour in this respect.


Standard for a reply/forward is > prepending each line the number of times the mail is nested including who sent the initial e-mail is up to the client to sort out. So what you need to do in python is simply add > to the start of each line.

imap Test <imap@gazler.com> Wrote:
>
>twice
>imap Test wrote:
>> nested
>>
>> imap@gazler.com wrote:
>>> test
>>>
>>> -- 
>>> Message sent via AHEM.
>>>   
>>
>

Attachments just simply need to be attached to the message or as you put it 'go wild.'

I am not familiar with python, but believe the code would be:

string = string.replace("\n","\n>")
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜