Parsing forwarded emails

2022-12-19 03:19 问答作者：

I'm writing some code to parse forwarded emails. What I'm not sure is if maybe there is some Python library, some RFC I could stick to or some other resource that would allow me to automate the task.

To be precise, I don't know if the "layout" of forwarded emails is covered by some standard or recommendation, or if it has just evolved over the years so now most email clients produce similar output for the text part:

    Begin forwarded message: 

    > From: Me <me@me.me>
    > Date: January 30, 2010 18:26:33 PM GMT+02:00
    > To: Other Me <other-me@me.me>
    > Subject: Unwise question

-- and go wild for attachments (and whatever other MIME sections can be there).

If it's still not precise enough I'll clarify it, it's just that I'm not 100% sure what to ask about (RFC, Python lib, convention开发者_如何学C or something else).

Unlike what many other people said, there is a standard on forwarded emails, RFC 2046, "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", more than ten years old. See specially its section 5.2, "Message Media Type".

The basic idea behind RFC 2046 is to encapsulate one message into the MIME part of another, of type named (unfortunately) message/rfc822 (never forget that MIME is recursive). The MIME library of Python can handle it fine.

I did not downvote the other answers because they are right in one respect: the standard is not followed by every mailer. For instance, the mutt mailer can forward a message in RFC 2046 format but also in a adhoc format. So, in practice, a mailer probably cannot handle only RFC 2046, it also has to parse the various others and underspecified syntaxes.

In my experience just about ever email client forwards/replies differently. Typically you'll have a plain text version and a html encoded version in the mime at the bottom of the mail pack. Mail headers do have a RFC (http://www.faqs.org/rfcs/rfc2822.html "2822"), but unfortunately the content of the message body is out side the scope.

Not only do you have to contend with the mail client variance, but the variance of user preferences. As an example: Lotus Notes puts replies at the top and Thunderbird replies at the bottom. So when a Thunderbird user is replying to a Lotus Notes user's reply they might insert their reply at the top and leave their signature at the bottom.

Another pitfall maybe contending with word wrapping of replied chains.

>>>> The outer reply that goes over the limit and is word wraped by
the middle replier's mail client\n
>> The message body of a middle reply
> Previous reply
Newest reply

I wouldn't parse the message and leave it to the user to parse in their heads. Or, I'd borrow the code from another project.

As the other answers already indicate: there is no standard, and your program is not going to be flawless.

You could have a look at the headers, in particular the User-Agent header, to see what kind of client was used, and code specifically for the most common clients.

To find out what clients you should consider to support, have a look at this popularity study. Various Outlooks, Yahoo!, Hotmail, Mail.app, iPhone mail, Gmail and Lotus Notes rank highly. About 11% of the mail is classified as "undetectable", but using headers from the forwarded e-mail you might be able to do better than that. Note that the statistics were gathered by placing an image inside the e-mail, so results may be skewed.

Another problem is HTML mail, which may or may not include a plain-text version. I'm not sure about clients' usual behaviour in this respect.

Standard for a reply/forward is > prepending each line the number of times the mail is nested including who sent the initial e-mail is up to the client to sort out. So what you need to do in python is simply add > to the start of each line.

imap Test <imap@gazler.com> Wrote:
>
>twice
>imap Test wrote:
>> nested
>>
>> imap@gazler.com wrote:
>>> test
>>>
>>> -- 
>>> Message sent via AHEM.
>>>   
>>
>

Attachments just simply need to be attached to the message or as you put it 'go wild.'

I am not familiar with python, but believe the code would be:

string = string.replace("\n","\n>")

继续阅读：python rfc

Parsing forwarded emails

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？