开发者

How to retrieve codepage from cURL HTTP response?

I'm using lib-cURL as a HTTP client to retrieve various pages (can be any URL for that matter).

Usually the data comes as a UTF-8 string and then I just call "MultiByteToWideChar" and it works well.

However, some web-pages still use code-page encoding and I see gibberish if i try to convert those pages to UTF-8.

Is there an easy way to retrieve the code page from the data? or I'll have to scan it manually (for "encoding=") and then translate it accordingly.

If so, how 开发者_StackOverflow社区do i get the code-page id from name (Code Page Identifiers)?

Thanks,

Omer


There are several location where a document can state its encoding:

  • the Content-Type HTTP header
  • the (optional) XML declaration
  • the Content-Type meta tag inside the document header
  • for HTML5 documents the charset meta tag.

There are probably even more I've forgotten.

In the end, detecting the actual encoding is rather hard. You really shouldn't do this yourself but use high-level libraries for retrieving and parsing HTML content. I'm sure they are available even for C++, even if they have to be thiefed from the a browser environment. :)


I used DetectInputCodepage in IMultiLanguage2 interface and it worked great !

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜