Regex to extract new content from email body
Given a string representing the entire text body of an email, I would like to extract only the part that the sender composed if it is only one contiguous block of text. For example:
Dear Sir:
That is a good point.
On Wednesday, June 1, John wrote:
> Hello world.
Would extract:
Dear Sir:
That is a good point.
By contiguous, I mean that the block may contain single newlines but not consecutive newlines. So this would not match:
Dear Sir:
That is a good point.
On Wednesday, June 1, John wrote:
> Hello world.
By 'the part the sender composed', I mean that the email body may contain replied or forwarded text, or a signature, all of which I want to exclude (let's call it "non-original content"). While there may be lots of variation in the wild, it would be sufficient (for now) to handle just the following cases:
1) a line starting with two dashes (eg: ----- Forwarded message -----), s开发者_开发知识库ince signatures also often have two dashes at the beginning of a line
2) a line starting with "On " followed by a line starting with a ">" to catch this kind of format:
On Wednesday, June 1, John wrote:
> Hello world.
If there is nothing (no non-white-space) above a non-original block, then there should be no match.
Finally, keep in mind that there may be any amount of white space at the beginning of the message as well as between the targeted text block and the end of the message or between the targeted text block and the beginning of the non-original content. Also, keep in mind that carriage returns in email may be just a linefeed or a crlf.
This is my first attempt, which gets closer than I thought when I started writing this; it uses the s flag:
^\s*(\S[^(?:\n\n|\r\n\r\n)]*\S)\s*(?:$|(?:$|\-\-.*|On [^\n]*\n\>.*))
From my testing so far, it appears to work if the targeted text is just one line, but not if it's more than one line. So the main flaw appears to be in this part:
_______[^(?:\n\n|\r\n\r\n)]*________________________________________
UPDATE: this is the solution I'm using:
'/\A\s*((?:[^\r\n]+\r?(?:\n|\z))+)\s*(?:\z|(--.*|On .+:\n\>.*))/s'
Note that the "On" line may wrap to multiple lines (eg- if the date and email address are long), but in general there will be a ":\n>" in there.
In the part you flagged:
[^(?:\n\n|\r\n\r\n)]*
Square braces mean a character class, and the carat inverts the characters to match. So I imagine the regular expression engine is building a character class that doesn't match a (
, doesn't match a ?
, doesn't match a :
, and so on.
Here's a regular expression that I believe does what you want for this part:
((?:[^\r\n]+\r?\n)*)
This means "match anything but a CR or LF, any number but at least one, followed optionally by a CR and then definitely by an LF. Then when it repeats by the *
(zero or more times) it won't match two line endings in a row, because the beginning of the pattern is anything but a line ending. Then that whole thing is in parens to make a match group.
Now, we need to anchor this so that it comes right where you want it. It looks like you are expecting three anchor cases: end of string, the "On wrote" line, or a signature line ("--\n"). Your regular expression is more complicated than it really needs to be to anchor these three cases; this would do:
(?:$|--\r?\n|On \d\d/\d\d/\d\d\d\d \d\d:\d\d [AP]M, .*wrote:\r?\n)
It's longer than yours because I wanted to make sure it wouldn't anchor on actual email message text that happens to start with the word "On" at the beginning of a line.
And you allow any number of blank lines between the match group and the anchor:
(?:\r?\n)*
Put these together:
((?:[^\r\n]+\r?\n)*)(?:\r?\n)*(?:$|--\r?\n|On \d\d/\d\d/\d\d\d\d \d\d:\d\d [AP]M, .*wrote:\r?\n)
I tested these with an actual email message from my inbox, using Python's re
module to test the regexp.
NOTE: Actually, now that I think about it, I don't recommend using such a rigorous regexp to match the "On" line. The "On" line is inserted by the email client that the sender was using, and you have no control over it. What if the user's email client inserts 24-hour time instead of AM/PM? (I even have seen French people's email clients insert French language instead of "On" so the whole line wouldn't even match!) So you might want a looser match pattern for the "On" line, but beware that if it's too loose and an email contains a line that happens to start with "On" you might chop early.
Here's a simple pattern that should work:
On \d[^\n]+\n>
On, followed by a digit and then whatever until end of line, but the next line must start with >
. That ought to work, except for the pathological case where an email body has a line starting with "On" and a number and then the very next line starts with the word "From" so the email client inserts a >
before "From".
Anyway, putting it all together:
((?:[^\r\n]+\r?\n)*)(?:\r?\n)*(?:$|--\r?\n|On \d[^\n]+\n>)
EDIT: You asked me to do a quick edit and update it with your final pattern, so here you go:
/\A\s*((?:[^\r\n]+\r?(?:\n|\z))+)\s*(?:\z|(--.*|On [^\n]+\n\>.*))/s
/^(?!>|On|--)(.*)+/m
should match any line not starting with On, > or --
Using JavaScript .match()
this should match all your test cases:
/((.|[\r\n])+?)([\r\n][\r\n]|On.+[\r\n]\>|--)/
Which means: start regex /
followed by any character or newline (.|[\r\n]
) one or more times (+
) ungreedily (?
) followed by either two newlines ([\r\n\r\n]
) or 'On newline >' or '--' ([\r\n][\r\n]|On.+[\r\n]\>|--
) followed by regex ends (/
).
First grouping is the string you are after.
See demo here: http://jsfiddle.net/57L5t/
精彩评论