C# WebClient - Getting a question-mark-inside-a-square characters instead of øæå when downloading a page
Im using WebClient to download a webpage from a norwegian website. And in the downloaded data all special characters (øæå) are missing and replaced by a question mark type char instead.
I used to have this issue on my webpage before I added a "" in my html file, this is present here.
If I open a browser and browse to the address everything looks fine.
I have used fiddler to see exactly what headers I need to send and I am use im sending everything the exact same as my brower.
So by power of deduction I believe that WebClient is the offender, and somehow cripples the data before returning it to me, and im not sure how to stop him from doing this.
For more information this is my code to get the webpage:
string result = string.Empty; using (WebClient client = new WebClient()) { client.Headers["Accept"] = "application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*"; client.Headers["Referer"] = "http://mywebsite.no/forum/viewforum.php?f=7"; client.Headers["Accept-Language"]开发者_开发问答 = "nb-NO"; client.Headers["User-Agent"] = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; AskTbFXTV5/5.9.1.14019)"; client.Headers["Accept-Encoding"] = "gzip, deflate"; using (Stream stream = client.OpenRead(new Uri(textBox1.Text))) { using (StreamReader reader = new StreamReader(stream)) { result = reader.ReadToEnd(); } } }
Any tips?
As others have said, you might not have set the correct encoding. See how to detect encoding of the response body which shows how to guess the encoding from the response headers or the HTML META tag in the response body.
Have you tried setting the encoding on the response?
string result = string.Empty;
using (WebClient client = new WebClient())
{
client.Headers["Accept"] = "application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
client.Headers["Referer"] = "http://mywebsite.no/forum/viewforum.php?f=7";
client.Headers["Accept-Language"] = "nb-NO";
client.Headers["User-Agent"] = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; AskTbFXTV5/5.9.1.14019)";
client.Headers["Accept-Encoding"] = "gzip, deflate";
using (Stream stream = client.OpenRead(new Uri("")))
{
byte[] resultBytes = StreamUtil.ReadToEnd(stream);
result = System.Text.ASCIIEncoding.UTF8.GetString(resultBytes);
}
}
internal class StreamUtil
{
internal static byte[] ReadToEnd(System.IO.Stream stream)
{
byte[] readBuffer = new byte[4096];
int totalBytesRead = 0;
int bytesRead;
while ((bytesRead = stream.Read(readBuffer, totalBytesRead, readBuffer.Length - totalBytesRead)) > 0)
{
totalBytesRead += bytesRead;
if (totalBytesRead == readBuffer.Length)
{
int nextByte = stream.ReadByte();
if (nextByte != -1)
{
byte[] temp = new byte[readBuffer.Length * 2];
Buffer.BlockCopy(readBuffer, 0, temp, 0, readBuffer.Length);
Buffer.SetByte(temp, totalBytesRead, (byte)nextByte);
readBuffer = temp;
totalBytesRead++;
}
}
}
byte[] buffer = readBuffer;
if (readBuffer.Length != totalBytesRead)
{
buffer = new byte[totalBytesRead];
Buffer.BlockCopy(readBuffer, 0, buffer, 0, totalBytesRead);
}
return buffer;
}
}
Try using a StreamReader constructor that specifies the encoding.
http://msdn.microsoft.com/en-us/library/ms143456.aspx http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx
To figure out the encoding of the page, in firefox you can right click and select View Page Info. Encoding should be listed there.
There are two likely reasons:
- You are not using the correct encoding for the
StreamReader
. - You are displaying the result using a font that doesn't support the characters.
If you know what the encoding is, and know that it will stay the same, you can just provide the encoding when you create the StreamReader
object.
If not, you would have to get the first part of the page into a byte buffer, so that you can encode enough of it using a plain ASCII encoding to find a content meta tag, so that you can determine the encoding from that. Then you can decode the buffer and the rest of the page using the correct encoding.
As you are saying "question-mark-inside-a-square characters" and not just question marks, it leads me to suspect that it might be displaying the content that is actually the problem, not decoding it. A decoding problem would produce regular question marks, while fonts contains a special character for missing glyphs that looks exactly as you describe.
精彩评论