开发者

Encoding problem with reading website, three different encodings

I have a problem with a WebRequest in C#. It's a google page.

The header states

text/html; charset=ISO-8859-1

The website states

<meta http-equiv=content-type content="text/html; charset=utf-8">

And finally I only get the expected Result in the debugger as well as regular expression, when I use Encoding.Default which defaults to System.Text.SBCSCodePageEncoding

Now what do I do? Do you have any hints, how this could happen or how I could solve this problem?

The actual Encoding of the page seems to be UTF-8. At least FF displays it correctly in UTF-8, not in Windows-Whatever and not in Latin1.

The URL is this

The problem is the €-sign as well as all German Umlauts.

Thanks in advance for your help on this problem which is making me seriously crazy!

Update: when I output the string via

// create a writer and open the file
TextWriter tw = new St开发者_Go百科reamWriter("test.txt");

// write a line of text to the file
tw.WriteLine(html);

// close the stream
tw.Close();

it works all fine.

So it seems the problem is, that the debugger does not show the correct encoding, and the Regular Expression also.

How do I tell C# to handle the RegEx as UTF-8?


Rather than parsing HTML, why not use the Google Query API?

BTW, before parsing HTML using regexes, read this ;-)

EDIT: In answer to your comment:

  1. The API works for Google Desktop as well.
  2. Is this encoding issue specific to the Google page?
  3. In addition to the problem you have now, who knows what problem you'll run into later, when in production, due to subtle changes in the HTML of these pages, or in the header sent back by the Web server. A web page is supposed to be human eye-friendly, not computer friendly. The only thing you can expect to be friendly is the appearance and rendered contents of the page, not the underlying HTML structure. As opposed to an API, which is supposed to be computer-friendly.
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜