开发者

One regular expression to rule them all (efficiently)?

Hey guys, I've been trying to parse through HTML files to scrape text from them, and every so often, I get some really weird characters like à€œ. I determined that its the "smart quotes" or curly punctuation that is causing the all of my problems, so my temporary fix has been to search for and replace all of these characters with their corresponding HTML codes individually. My question is that is there such a way to use one regular expression (or something else) to search through the string only once and replaces what it needs to based on what is there? My solution right now looks like this:

line = line.replaceAll( "“", "“" ).replaceAll( "”", "”" );
line = line.replaceAll( "–", "–" ).replaceAll( "—", "—" );
line = line.replaceAll( "‘", "‘" ).replaceAll( "’", "’" ); 

For some reason or another, there just seems like there could be a better and possibly more efficient way of d开发者_运维知识库oing this. Any input is greatly appreciated.

Thanks,

-Brett


As stated by others; The recommended method to take care of those characters is to configure your encoding settings.

For comparison, here is a method to re-code UTF-8 sequences as HTML entities using regex:

import java.util.regex.*;

public class UTF8Fixer {
    static String fixUTF8Characters(String str) {
        // Pattern to match most UTF-8 sequences:
        Pattern utf8Pattern = Pattern.compile("[\\xC0-\\xDF][\\x80-\\xBF]{1}|[\\xE0-\\xEF][\\x80-\\xBF]{2}|[\\xF0-\\xF7][\\x80-\\xBF]{3}");

        Matcher utf8Matcher = utf8Pattern.matcher(str);
        StringBuffer buf = new StringBuffer();

        // Search for matches
        while (utf8Matcher.find()) {
            // Decode the character
            String encoded = utf8Matcher.group();
            int codePoint = encoded.codePointAt(0);
            if (codePoint >= 0xF0) {
                codePoint &= 0x07;
            }
            else if (codePoint >= 0xE0) {
                codePoint &= 0x0F;
            }
            else {
                codePoint &= 0x1F;
            }
            for (int i = 1; i < encoded.length(); i++) {
                codePoint = (codePoint << 6) | (encoded.codePointAt(i) & 0x3F);
            }
            // Recode it as an HTML entity
            encoded = String.format("&#%d;", codePoint);
            // Add it to the buffer
            utf8Matcher.appendReplacement(buf,encoded);
        }
        utf8Matcher.appendTail(buf);
        return buf.toString();
    }

    public static void main(String[] args) {
        String subject = "String with \u00E2\u0080\u0092strange\u00E2\u0080\u0093 characters";
        String result = UTF8Fixer.fixUTF8Characters(subject);
        System.out.printf("Subject: %s%n", subject);
        System.out.printf("Result: %s%n", result);
    }
}

Output:

Subject: String with “strange” characters
Result: String with &#8210;strange&#8211; characters


There's a huge thread over here that shows you why it is a bad idea to use regex to parse HTML.

Look for external libraries to do this task. An example would be: JSoup. There's also a tutorial included in their webpage that you can use.


Your file appears to be UTF-8 encoded, but you're reading it as though it were in a single-byte encoding like windows-1252. UTF-8 uses three bytes to encode each of those characters, but when you decode it as windows-1252, each byte is treated as a separate character.

When working with text, you should always specify an encoding if possible; don't let the system use its default encoding. In Java, that means using InputStreamReader and OutputStreamWriter instead of FileReader and FileWriter. Any reasonably good text editor should let you specify an encoding as well.

As for your actual question, no, Java doesn't have a built-in facility for dynamic replacements (unlike most other regex flavors). But it's not too difficult to write your own, or even better, use one that someone else wrote. I posted one from Elliott Hughes in this answer.

One last thing: In your sample code you use replaceAll() to do the replacements, which is overkill and a possible source of bugs. Since you're matching literal text and not regexes, you should be using replace(CharSequence,CharSequence) instead. That way you never have to worry about accidentally including a regex metacharacter and going blooey.


Don't use regular expressions for HTML. Use a real parser.

This will also help you getting around any character encodings you might encounter.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜