开发者

How to write a generic code to read an html encoded with different encodings?

I'm trying to write a code to read the content of a web page, but I'm not sure of the used encoding in that page, so how can I write a generic code that returns the right string without the strange symbols? The encoding might be ("UTF-8", "windows-1256", ...). I've tried to but the UTF-8 but when the page is encoded with the second mentioned encoding I'm having some strange symbols.

Here is the code I'm using:

HttpWebRequest request = (HttpWebRequest)WebRequest.Create("SOME-URL");
request.Method = "GET";
WebResponse response = request.GetResponse();
StreamReader streamReader = new StreamReader(response.GetResponseStream(), System.Text.E开发者_C百科ncoding.UTF8);
string content = streamReader.ReadToEnd();

And here is a link that causes the problem: http://forum.khleeg.com/144828.html


You must examine the response text to check this field:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1256" />

This chars will also get corretly decoded as they are ANSI. According to data from this tag you should create your Encoding object by the GetEncoding method like this:

var enc1 = Encoding.GetEncoding("windows-1256");
var enc2 = Encoding.GetEncoding(1256);

Another way is to use the .ContentEncoding property of the HttpWebResponse:

HttpWebResponse response = (HttpWebResponse)request.GetResponse();
var enc1 = Encoding.GetEncoding(response.ContentEncoding);

Or the .CharacterSet property:

string Charset = response.CharacterSet;
var enc1 = Encoding.GetEncoding(Charset);


The page you mention does tell you EXACTLY which encoding it uses, here's the string found there.

<meta http-equiv="Content-Type" content="text/html; charset=windows-1256" />

Can't you search for a string like this one and act upon this information?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜