Regex to extract new content from email body

2023-04-06 02:59 问答作者：

Given a string representing the entire text body of an email, I would like to extract only the part that the sender composed if it is only one contiguous block of text. For example:

Dear Sir:
That is a good point.

On Wednesday, June 1, John wrote:
> Hello world.

Would extract:

Dear Sir:
That is a good point.

By contiguous, I mean that the block may contain single newlines but not consecutive newlines. So this would not match:

Dear Sir:

That is a good point.

On Wednesday, June 1, John wrote:
> Hello world.

By 'the part the sender composed', I mean that the email body may contain replied or forwarded text, or a signature, all of which I want to exclude (let's call it "non-original content"). While there may be lots of variation in the wild, it would be sufficient (for now) to handle just the following cases:

1) a line starting with two dashes (eg: ----- Forwarded message -----), s开发者_开发知识库ince signatures also often have two dashes at the beginning of a line

2) a line starting with "On " followed by a line starting with a ">" to catch this kind of format:

On Wednesday, June 1, John wrote:
> Hello world.

If there is nothing (no non-white-space) above a non-original block, then there should be no match.

Finally, keep in mind that there may be any amount of white space at the beginning of the message as well as between the targeted text block and the end of the message or between the targeted text block and the beginning of the non-original content. Also, keep in mind that carriage returns in email may be just a linefeed or a crlf.

This is my first attempt, which gets closer than I thought when I started writing this; it uses the s flag:

^\s*(\S[^(?:\n\n|\r\n\r\n)]*\S)\s*(?:$|(?:$|\-\-.*|On [^\n]*\n\>.*))

From my testing so far, it appears to work if the targeted text is just one line, but not if it's more than one line. So the main flaw appears to be in this part:

_______[^(?:\n\n|\r\n\r\n)]*________________________________________

UPDATE: this is the solution I'm using:

'/\A\s*((?:[^\r\n]+\r?(?:\n|\z))+)\s*(?:\z|(--.*|On .+:\n\>.*))/s'

Note that the "On" line may wrap to multiple lines (eg- if the date and email address are long), but in general there will be a ":\n>" in there.

In the part you flagged:

[^(?:\n\n|\r\n\r\n)]*

Square braces mean a character class, and the carat inverts the characters to match. So I imagine the regular expression engine is building a character class that doesn't match a (, doesn't match a ?, doesn't match a :, and so on.

Here's a regular expression that I believe does what you want for this part:

((?:[^\r\n]+\r?\n)*)

This means "match anything but a CR or LF, any number but at least one, followed optionally by a CR and then definitely by an LF. Then when it repeats by the * (zero or more times) it won't match two line endings in a row, because the beginning of the pattern is anything but a line ending. Then that whole thing is in parens to make a match group.

Now, we need to anchor this so that it comes right where you want it. It looks like you are expecting three anchor cases: end of string, the "On wrote" line, or a signature line ("--\n"). Your regular expression is more complicated than it really needs to be to anchor these three cases; this would do:

(?:$|--\r?\n|On \d\d/\d\d/\d\d\d\d \d\d:\d\d [AP]M, .*wrote:\r?\n)

It's longer than yours because I wanted to make sure it wouldn't anchor on actual email message text that happens to start with the word "On" at the beginning of a line.

And you allow any number of blank lines between the match group and the anchor:

(?:\r?\n)*

Put these together:

((?:[^\r\n]+\r?\n)*)(?:\r?\n)*(?:$|--\r?\n|On \d\d/\d\d/\d\d\d\d \d\d:\d\d [AP]M, .*wrote:\r?\n)

I tested these with an actual email message from my inbox, using Python's re module to test the regexp.

NOTE: Actually, now that I think about it, I don't recommend using such a rigorous regexp to match the "On" line. The "On" line is inserted by the email client that the sender was using, and you have no control over it. What if the user's email client inserts 24-hour time instead of AM/PM? (I even have seen French people's email clients insert French language instead of "On" so the whole line wouldn't even match!) So you might want a looser match pattern for the "On" line, but beware that if it's too loose and an email contains a line that happens to start with "On" you might chop early.

Here's a simple pattern that should work:

On \d[^\n]+\n>

On, followed by a digit and then whatever until end of line, but the next line must start with >. That ought to work, except for the pathological case where an email body has a line starting with "On" and a number and then the very next line starts with the word "From" so the email client inserts a > before "From".

Anyway, putting it all together:

((?:[^\r\n]+\r?\n)*)(?:\r?\n)*(?:$|--\r?\n|On \d[^\n]+\n>)

EDIT: You asked me to do a quick edit and update it with your final pattern, so here you go:

/\A\s*((?:[^\r\n]+\r?(?:\n|\z))+)\s*(?:\z|(--.*|On [^\n]+\n\>.*))/s

/^(?!>|On|--)(.*)+/m should match any line not starting with On, > or --

Using JavaScript .match() this should match all your test cases:

/((.|[\r\n])+?)([\r\n][\r\n]|On.+[\r\n]\>|--)/

Which means: start regex / followed by any character or newline (.|[\r\n]) one or more times (+) ungreedily (?) followed by either two newlines ([\r\n\r\n]) or 'On newline >' or '--' ([\r\n][\r\n]|On.+[\r\n]\>|--) followed by regex ends (/).

First grouping is the string you are after.

See demo here: http://jsfiddle.net/57L5t/

继续阅读：regex

Regex to extract new content from email body

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？