character encoding in a web page using java
how to find out the type of charac开发者_Go百科ter encoding in a web page using java
Open a connection to the URL (using URL.openConnection()), adn the parse the content type returned by the getContentType() method (which should contain the charset). If not present in this header, you might have to parse the HTML content and look for a tag such as
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
I believe this does exactly what you need. Has both code and explanation. http://nadeausoftware.com/node/73
A quick summary is as follows:
Create a WebFile class where:
- Constructor
public WebFile( String urlString )
opens aURLConnection
, reads in the headers, including the character encoding. If the encoding is not present, then you'll have to read the encoding from the web page itself. If this is not present either, you could try your luck with Character Encoding Detection Algorithm - Method
private Object readStream(int length, java.io.InputStream stream)
reads the page data from the stream and returns aString
using the character encoding, i.e.return new String( bytes, charset )
, or returns the byte array created by reading the stream if there is no encoding present or if there's an encoding exception. - You have getters and setters for the page content (e.g. invokes readStream just once, returns the encoding)
精彩评论