C# WebClient - Getting a question-mark-inside-a-square characters instead of øæå when downloading a page

2023-02-01 22:59 问答作者：

Im using WebClient to download a webpage from a norwegian website. And in the downloaded data all special characters (øæå) are missing and replaced by a question mark type char instead.

I used to have this issue on my webpage before I added a "" in my html file, this is present here.

If I open a browser and browse to the address everything looks fine.

I have used fiddler to see exactly what headers I need to send and I am use im sending everything the exact same as my brower.

So by power of deduction I believe that WebClient is the offender, and somehow cripples the data before returning it to me, and im not sure how to stop him from doing this.

For more information this is my code to get the webpage:

string result = string.Empty;

using (WebClient client = new WebClient())
{     
     client.Headers["Accept"] = "application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
     client.Headers["Referer"] = "http://mywebsite.no/forum/viewforum.php?f=7";
     client.Headers["Accept-Language"]开发者_开发问答 = "nb-NO";
     client.Headers["User-Agent"] = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; AskTbFXTV5/5.9.1.14019)";
     client.Headers["Accept-Encoding"] = "gzip, deflate";

     using (Stream stream = client.OpenRead(new Uri(textBox1.Text))) 
     { 
          using (StreamReader reader = new StreamReader(stream)) 
          {
               result = reader.ReadToEnd();
          } 
     } 
}

Any tips?

As others have said, you might not have set the correct encoding. See how to detect encoding of the response body which shows how to guess the encoding from the response headers or the HTML META tag in the response body.

Have you tried setting the encoding on the response?

        string result = string.Empty;

        using (WebClient client = new WebClient())
        {
            client.Headers["Accept"] = "application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
            client.Headers["Referer"] = "http://mywebsite.no/forum/viewforum.php?f=7";
            client.Headers["Accept-Language"] = "nb-NO";
            client.Headers["User-Agent"] = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; AskTbFXTV5/5.9.1.14019)";
            client.Headers["Accept-Encoding"] = "gzip, deflate";

            using (Stream stream = client.OpenRead(new Uri("")))
            {
                byte[] resultBytes = StreamUtil.ReadToEnd(stream);
                result = System.Text.ASCIIEncoding.UTF8.GetString(resultBytes);
            }
        }

internal class StreamUtil
{
    internal static byte[] ReadToEnd(System.IO.Stream stream)
    {
        byte[] readBuffer = new byte[4096];

        int totalBytesRead = 0;
        int bytesRead;

        while ((bytesRead = stream.Read(readBuffer, totalBytesRead, readBuffer.Length - totalBytesRead)) > 0)
        {
            totalBytesRead += bytesRead;

            if (totalBytesRead == readBuffer.Length)
            {
                int nextByte = stream.ReadByte();
                if (nextByte != -1)
                {
                    byte[] temp = new byte[readBuffer.Length * 2];
                    Buffer.BlockCopy(readBuffer, 0, temp, 0, readBuffer.Length);
                    Buffer.SetByte(temp, totalBytesRead, (byte)nextByte);
                    readBuffer = temp;
                    totalBytesRead++;
                }
            }
        }

        byte[] buffer = readBuffer;
        if (readBuffer.Length != totalBytesRead)
        {
            buffer = new byte[totalBytesRead];
            Buffer.BlockCopy(readBuffer, 0, buffer, 0, totalBytesRead);
        }
        return buffer;
    }
}

Try using a StreamReader constructor that specifies the encoding.

http://msdn.microsoft.com/en-us/library/ms143456.aspx http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx

To figure out the encoding of the page, in firefox you can right click and select View Page Info. Encoding should be listed there.

There are two likely reasons:

You are not using the correct encoding for the StreamReader.
You are displaying the result using a font that doesn't support the characters.

If you know what the encoding is, and know that it will stay the same, you can just provide the encoding when you create the StreamReader object.

If not, you would have to get the first part of the page into a byte buffer, so that you can encode enough of it using a plain ASCII encoding to find a content meta tag, so that you can determine the encoding from that. Then you can decode the buffer and the rest of the page using the correct encoding.

As you are saying "question-mark-inside-a-square characters" and not just question marks, it leads me to suspect that it might be displaying the content that is actually the problem, not decoding it. A decoding problem would produce regular question marks, while fonts contains a special character for missing glyphs that looks exactly as you describe.

继续阅读：webclient

C# WebClient - Getting a question-mark-inside-a-square characters instead of øæå when downloading a page

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？