开发者

Regex and ISO-8859-1 charset in java

I have some text encoded in IS开发者_开发技巧O-8859-1 which I then extract some data from using Regex.

The problem is that the strings I get from the matcher object is in the wrong format, scrambling chars like "ÅÄÖ".

How do I stop the regex library from scrambling my chars?

Edit: Here's some code:

private HttpResponse sendGetRequest(String url) throws ClientProtocolException, IOException
{
    HttpGet get = new HttpGet(url);
    return hclient.execute(get);
}
private static String getResponseBody(HttpResponse response) throws IllegalStateException, IOException
{
    InputStream input = response.getEntity().getContent();
    StringBuilder builder = new StringBuilder();
    int read;
    byte[] tmp = new byte[1024];

    while ((read = input.read(tmp))!=-1)
    {
        builder.append(new String(tmp), 0,read-1);
    }

    return builder.toString();
}
HttpResponse response = sendGetRequest(url);
String html = getResponseBody(response);
Matcher matcher = forum_pattern.matcher(html);
while(matcher.find()) // do stuff


This is probably the immediate cause of your problem, and it's definitely an error:

builder.append(new String(tmp), 0, read-1);

When you call one of the new String(byte[]) constructors that doesn't take a Charset, it uses the platform default encoding. Apparently, the default encoding on your your platform is not ISO-8859-1. You should be able to get the charset name from the response headers so you can supply it to the constructor.

But you shouldn't be using a String constructor for this anyway; the proper way is to use an InputStreamReader. If the encoding were one of the multi-byte ones like UTF-8, you could easily corrupt the data because a chunk of bytes happened to end in the middle of a character.

In any case, never, ever use a new String(byte[]) constructor or a String.getBytes() method that doesn't accept a Charset parameter. Those methods should be deprecated, and should emit ferocious warnings when anyone uses them.


It's html from a website.

Use a HTML parser and this problem and all future potential problems will disappear.

I can recommend picking Jsoup for the job.

See also:

  • Regular Expressions - Now you have two problems
  • Parsing HTML - The Cthulhu way
  • Pros and cons of HTML parsers in Java
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜