Regex and ISO-8859-1 charset in java

2023-01-10 20:09 问答作者：

I have some text encoded in IS开发者_开发技巧O-8859-1 which I then extract some data from using Regex.

The problem is that the strings I get from the matcher object is in the wrong format, scrambling chars like "ÅÄÖ".

How do I stop the regex library from scrambling my chars?

Edit: Here's some code:

private HttpResponse sendGetRequest(String url) throws ClientProtocolException, IOException
{
    HttpGet get = new HttpGet(url);
    return hclient.execute(get);
}
private static String getResponseBody(HttpResponse response) throws IllegalStateException, IOException
{
    InputStream input = response.getEntity().getContent();
    StringBuilder builder = new StringBuilder();
    int read;
    byte[] tmp = new byte[1024];

    while ((read = input.read(tmp))!=-1)
    {
        builder.append(new String(tmp), 0,read-1);
    }

    return builder.toString();
}
HttpResponse response = sendGetRequest(url);
String html = getResponseBody(response);
Matcher matcher = forum_pattern.matcher(html);
while(matcher.find()) // do stuff

This is probably the immediate cause of your problem, and it's definitely an error:

builder.append(new String(tmp), 0, read-1);

When you call one of the new String(byte[]) constructors that doesn't take a Charset, it uses the platform default encoding. Apparently, the default encoding on your your platform is not ISO-8859-1. You should be able to get the charset name from the response headers so you can supply it to the constructor.

But you shouldn't be using a String constructor for this anyway; the proper way is to use an InputStreamReader. If the encoding were one of the multi-byte ones like UTF-8, you could easily corrupt the data because a chunk of bytes happened to end in the middle of a character.

In any case, never, ever use a new String(byte[]) constructor or a String.getBytes() method that doesn't accept a Charset parameter. Those methods should be deprecated, and should emit ferocious warnings when anyone uses them.

It's html from a website.

Use a HTML parser and this problem and all future potential problems will disappear.

I can recommend picking Jsoup for the job.

Regex and ISO-8859-1 charset in java

See also:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？