开发者

Reading EUC encoded HTML using Java on Windows

I am trying to read an HTML file which is encoded in EUC-KR from a URL. When I compile the code inside the IDE I get the desired output, but when I build a jar and try running the jar, the data I read is shown as question marks ("????" instead of the korean characters). I am assuming it is due to loss of encoding.

The meta of the site says the following:

 <meta http-equiv="Content-Type" content="text/html; charset=euc-kr">

Here is my code:

  String line;
  URL u = new URL("link to the site");
  InputStream in = u.openConnection().getInputStream();
  BufferedReader r = new BufferedReader(new InputStreamReader(in, "EUC-KR"));
  while ((line = r.readLine()) != null) {
    /*send the string to a text area*/--> This works fine now
    /*take the string and pass it thru ByteArrayInputStream*/ --> this is where I believe the encoding is lost.

    InputStream xin = new ByteArrayInputStream(thestring.getBytes("EUC-KR"));
    Reader reader = new InputStreamReader(xin);
    EditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTM开发者_C百科LDocument) kit.createDefaultDocument();
    kit.read(reader, doc, 0);
    HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.STRONG);

    while (it.isValid()) {
      chaps.add(doc.getText(it.getStartOffset(), it.getEndOffset() - it.getStartOffset()).trim());
      //chaps is a arraylist<string>
      it.next();
    }

I would appreciate if someone could help me figure out how to grab the characters without loosing encoding while running the application on any platform independent of system's default encoding.

Thanks

PS: The program when run as jar shows system encoding as Cp1252 and UTF-8 when run inside the IDE.


InputStream xin = new ByteArrayInputStream(thestring.getBytes("EUC-KR"));
Reader reader = new InputStreamReader(xin);

This is a transcoding error. You encode a string as "EUC-KR" and decode it using the system encoding (resulting in junk). To avoid this, you would have to pass the encoding to the InputStreamReader.

However, it would be better to avoid all that encoding and decoding and just use a StringReader.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜