How to write a generic code to read an html encoded with different encodings?
I'm trying to write a code to read the content of a web page, but I'm not sure of the used encoding in that page, so how can I write a generic code that returns the right string without the strange symbols? The encoding might be ("UTF-8", "windows-1256", ...). I've tried to but the UTF-8 but when the page is encoded with the second mentioned encoding I'm having some strange symbols.
Here is the code I'm using:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("SOME-URL");
request.Method = "GET";
WebResponse response = request.GetResponse();
StreamReader streamReader = new StreamReader(response.GetResponseStream(), System.Text.E开发者_C百科ncoding.UTF8);
string content = streamReader.ReadToEnd();
And here is a link that causes the problem: http://forum.khleeg.com/144828.html
You must examine the response text to check this field:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1256" />
This chars will also get corretly decoded as they are ANSI.
According to data from this tag you should create your Encoding
object by the GetEncoding
method like this:
var enc1 = Encoding.GetEncoding("windows-1256");
var enc2 = Encoding.GetEncoding(1256);
Another way is to use the .ContentEncoding property of the HttpWebResponse:
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
var enc1 = Encoding.GetEncoding(response.ContentEncoding);
Or the .CharacterSet
property:
string Charset = response.CharacterSet;
var enc1 = Encoding.GetEncoding(Charset);
The page you mention does tell you EXACTLY which encoding it uses, here's the string found there.
<meta http-equiv="Content-Type" content="text/html; charset=windows-1256" />
Can't you search for a string like this one and act upon this information?
精彩评论