开发者

How to remove the quoted text from an email and only show the new text

I am parsing emails. When I see a reply to an email, I would like to remove the quoted text so that I can append the text to the previous email (even if its a reply).

Typically, you'll see this:

1st email (start of conversation)

This is the first email

2nd email (reply to first)

This is the second email

Tim said:
This is the first email

The output of this would be "This is the second email" only. Although different email clients quote text differently, if there were someway to get mostly the ne开发者_C百科w email text only, that would also be acceptable.


I use the following regex(s) to match the lead in for quoted text (the last one is the one that counts):

  /** general spacers for time and date */
  private static final String spacers = "[\\s,/\\.\\-]";

  /** matches times */
  private static final String timePattern  = "(?:[0-2])?[0-9]:[0-5][0-9](?::[0-5][0-9])?(?:(?:\\s)?[AP]M)?";

  /** matches day of the week */
  private static final String dayPattern   = "(?:(?:Mon(?:day)?)|(?:Tue(?:sday)?)|(?:Wed(?:nesday)?)|(?:Thu(?:rsday)?)|(?:Fri(?:day)?)|(?:Sat(?:urday)?)|(?:Sun(?:day)?))";

  /** matches day of the month (number and st, nd, rd, th) */
  private static final String dayOfMonthPattern = "[0-3]?[0-9]" + spacers + "*(?:(?:th)|(?:st)|(?:nd)|(?:rd))?";

  /** matches months (numeric and text) */
  private static final String monthPattern = "(?:(?:Jan(?:uary)?)|(?:Feb(?:uary)?)|(?:Mar(?:ch)?)|(?:Apr(?:il)?)|(?:May)|(?:Jun(?:e)?)|(?:Jul(?:y)?)" +
                                              "|(?:Aug(?:ust)?)|(?:Sep(?:tember)?)|(?:Oct(?:ober)?)|(?:Nov(?:ember)?)|(?:Dec(?:ember)?)|(?:[0-1]?[0-9]))";

  /** matches years (only 1000's and 2000's, because we are matching emails) */
  private static final String yearPattern  = "(?:[1-2]?[0-9])[0-9][0-9]";

  /** matches a full date */
  private static final String datePattern     = "(?:" + dayPattern + spacers + "+)?(?:(?:" + dayOfMonthPattern + spacers + "+" + monthPattern + ")|" +
                                                "(?:" + monthPattern + spacers + "+" + dayOfMonthPattern + "))" +
                                                 spacers + "+" + yearPattern;

  /** matches a date and time combo (in either order) */
  private static final String dateTimePattern = "(?:" + datePattern + "[\\s,]*(?:(?:at)|(?:@))?\\s*" + timePattern + ")|" +
                                                "(?:" + timePattern + "[\\s,]*(?:on)?\\s*"+ datePattern + ")";

  /** matches a leading line such as
   * ----Original Message----
   * or simply
   * ------------------------
   */
  private static final String leadInLine    = "-+\\s*(?:Original(?:\\sMessage)?)?\\s*-+\n";

  /** matches a header line indicating the date */
  private static final String dateLine    = "(?:(?:date)|(?:sent)|(?:time)):\\s*"+ dateTimePattern + ".*\n";

  /** matches a subject or address line */
  private static final String subjectOrAddressLine    = "((?:from)|(?:subject)|(?:b?cc)|(?:to))|:.*\n";

  /** matches gmail style quoted text beginning, i.e.
   * On Mon Jun 7, 2010 at 8:50 PM, Simon wrote:
   */
  private static final String gmailQuotedTextBeginning = "(On\\s+" + dateTimePattern + ".*wrote:\n)";


  /** matches the start of a quoted section of an email */
  private static final Pattern QUOTED_TEXT_BEGINNING = Pattern.compile("(?i)(?:(?:" + leadInLine + ")?" +
                                                                        "(?:(?:" +subjectOrAddressLine + ")|(?:" + dateLine + ")){2,6})|(?:" +
                                                                        gmailQuotedTextBeginning + ")"
                                                                      );

I know that in some ways this is overkill (and might be slow!) but it works pretty well. Please let me know if you find anything that doesn't match this so I can improve it!


Check out the google patent on this: http://www.google.com/patents/US7222299

In summary they hash portions of the text (presumably something like sentences) and then look for matches to hashes in the previous messages. Super fast and they probably use this as input to the threading algorithm too. What a great idea!


When the previous emails are stored on the disk, or available somwhow, you could check all mails, send by a specific receiver to determine, which is the response text.

You also could try to determine the quote character, by checking the first character of the last lines. Normaly the last lines always start with the same character.

When the last 2 lines starting with a ifferent character, youcould try the first lines, because sometimes the answer is appended atthe end of the text.

If you have detected these character, you could delete the last lines which are starting with this character until a empty line or a line starting with another character is detected.

NOT TESTED and is more like pseudo code

    String[] lines;

    // Check the size of the array first, length > 2
    char startingChar = lines[lines.length - 1].charAt(0);
    int foundCounter = 0;
    for (int i = lines.length - 2; i >=0; --i) {
        String line = lines[i];

        // Check line size > 0
        if(startingChar == line.charAt(0)){
            ++foundCounter;
        }
    }

    final int YOUR_DECISION = 2; // You can decide
    if(foundCounter > YOUR_DECISION){
        deleteLastLinesHere(startingChar, foundCounter);
    }


RegEx works fine except it matches text that starts from Subject and ignores everything that goes before "Subject"

Text
-------- Original Message -------- 
<TABLE border="0" cellpadding="0" cellspacing="0">
  <TBODY>
    <TR>
      <TH align="right" valign="baseline">
      // the matcher starts working from here


From observing the Gmail's behavior in this regard I have observed their strategy:

  1. write the complete 2nd mail.
  2. Append text like: On [timestamp], [first email sender name] <[first email sender email address]> wrote:
  3. Append the complete first email. a. If your email is in plain text then prepend '>' before every line of the first email. b. If it's in HTML then Gmail gives a left side margin like:

    border-left: 1px solid #CCC; margin: 0px 0px 0px 0.8ex; padding-left: 1ex; user agent stylesheet blockquote

    and then appends the first email's text.

You can reverse engineer this when parsing emails from Gmail address. I haven't looked into other clients but they should have the same behavior.


You'll get it almost right with a couple of lines of code:

String newMessage = "";
for (String line : emailLines) {
  if (!line.matches("^[>].*")) {
    newMessage = newMessage.concat(line);
  }
}

If necessary, you could add other regex checks for e-mail clients which leave different quoted text signatures.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜