One regular expression to rule them all (efficiently)?

2023-01-14 07:21 问答作者：

Hey guys, I've been trying to parse through HTML files to scrape text from them, and every so often, I get some really weird characters like à€œ. I determined that its the "smart quotes" or curly punctuation that is causing the all of my problems, so my temporary fix has been to search for and replace all of these characters with their corresponding HTML codes individually. My question is that is there such a way to use one regular expression (or something else) to search through the string only once and replaces what it needs to based on what is there? My solution right now looks like this:

line = line.replaceAll( "“", "&#8220;" ).replaceAll( "”", "&#8221;" );
line = line.replaceAll( "–", "&#8211;" ).replaceAll( "—", "&#8212;" );
line = line.replaceAll( "‘", "&#8216;" ).replaceAll( "’", "&#8217;" );

For some reason or another, there just seems like there could be a better and possibly more efficient way of d开发者_运维知识库oing this. Any input is greatly appreciated.

Thanks,

-Brett

As stated by others; The recommended method to take care of those characters is to configure your encoding settings.

For comparison, here is a method to re-code UTF-8 sequences as HTML entities using regex:

import java.util.regex.*;

public class UTF8Fixer {
    static String fixUTF8Characters(String str) {
        // Pattern to match most UTF-8 sequences:
        Pattern utf8Pattern = Pattern.compile("[\\xC0-\\xDF][\\x80-\\xBF]{1}|[\\xE0-\\xEF][\\x80-\\xBF]{2}|[\\xF0-\\xF7][\\x80-\\xBF]{3}");

        Matcher utf8Matcher = utf8Pattern.matcher(str);
        StringBuffer buf = new StringBuffer();

        // Search for matches
        while (utf8Matcher.find()) {
            // Decode the character
            String encoded = utf8Matcher.group();
            int codePoint = encoded.codePointAt(0);
            if (codePoint >= 0xF0) {
                codePoint &= 0x07;
            }
            else if (codePoint >= 0xE0) {
                codePoint &= 0x0F;
            }
            else {
                codePoint &= 0x1F;
            }
            for (int i = 1; i < encoded.length(); i++) {
                codePoint = (codePoint << 6) | (encoded.codePointAt(i) & 0x3F);
            }
            // Recode it as an HTML entity
            encoded = String.format("&#%d;", codePoint);
            // Add it to the buffer
            utf8Matcher.appendReplacement(buf,encoded);
        }
        utf8Matcher.appendTail(buf);
        return buf.toString();
    }

    public static void main(String[] args) {
        String subject = "String with \u00E2\u0080\u0092strange\u00E2\u0080\u0093 characters";
        String result = UTF8Fixer.fixUTF8Characters(subject);
        System.out.printf("Subject: %s%n", subject);
        System.out.printf("Result: %s%n", result);
    }
}

Output:

Subject: String with “strange” characters
Result: String with ‒strange– characters

There's a huge thread over here that shows you why it is a bad idea to use regex to parse HTML.

Look for external libraries to do this task. An example would be: JSoup. There's also a tutorial included in their webpage that you can use.

Your file appears to be UTF-8 encoded, but you're reading it as though it were in a single-byte encoding like windows-1252. UTF-8 uses three bytes to encode each of those characters, but when you decode it as windows-1252, each byte is treated as a separate character.

When working with text, you should always specify an encoding if possible; don't let the system use its default encoding. In Java, that means using InputStreamReader and OutputStreamWriter instead of FileReader and FileWriter. Any reasonably good text editor should let you specify an encoding as well.

As for your actual question, no, Java doesn't have a built-in facility for dynamic replacements (unlike most other regex flavors). But it's not too difficult to write your own, or even better, use one that someone else wrote. I posted one from Elliott Hughes in this answer.

One last thing: In your sample code you use replaceAll() to do the replacements, which is overkill and a possible source of bugs. Since you're matching literal text and not regexes, you should be using replace(CharSequence,CharSequence) instead. That way you never have to worry about accidentally including a regex metacharacter and going blooey.

Don't use regular expressions for HTML. Use a real parser.

This will also help you getting around any character encodings you might encounter.

继续阅读：regex string

One regular expression to rule them all (efficiently)?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？