How to download webpage text with the correct (chinese) encoding in R

2023-01-28 22:23 问答作者：

I would like to know how to set the encoding parameter so that when I download text, it 'looks' the same as when I saw it on the page source in a web browser, e.g.:

readLines("http://www.baidu.com/s?wd=r+project")[132]
[1] "<div id=\"foot\">&copy;2010 Baidu <span>´ËÄÚÈÝÏµ°Ù¶È¸ù¾ÝÄúµÄÖ¸Áî×Ô¶¯ËÑË÷µÄ½á¹û£¬²»´ú±í°Ù¶ÈÔÞ³É±»ËÑË÷ÍøÕ¾µÄÄÚÈÝ»òÁ¢³¡</span></div>"

When it should be displayed as:

> <div id="foot">&copy;2010 Baidu <span>此内容系百度根据您的指令自动搜索的结果，不代表百度赞成被搜索网站的内容或立场</span></div>

Any help would be much appreciated!

# windows 7
sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[开发者_运维问答2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] XML_3.2-0.1    RCurl_1.4-4.1  bitops_1.0-4.1 rcom_2.2-3.1   rscproxy_1.3-1

loaded via a namespace (and not attached):
[1] tools_2.12.0

con = url("http://www.baidu.com/s?wd=r+project",  encoding = "gb2312")
readLines(con)[132] 
[1] "<div id=\"foot\">&copy;2010 Baidu <span>此内容系百度根据您的指令自动搜索的结果，不代表百度赞成被搜索网站的内容或立场</span></div>"

the webpage says at the top

<meta http-equiv="content-type" content="text/html;charset=gb2312">

which wikipedia says is

GB2312 is the registered internet name for a key official character set of the People's Republic of China, used for simplified Chinese characters

Which seems about appropriate (but still might be a mistake).

To find out the supported encodings on your platform:

iconvlist()

on mine, this includes "GB2312". Let's convert it using iconv:

> a <- readLines("http://www.baidu.com/s?wd=r+project")[132]
> iconv(a, from="gb2312")
[1] "<div id=\"foot\">&copy;2010 Baidu <span>此内容系百度根据您的指令自动搜索的结果，不代表百度赞成被搜索网站的内容或立场</span></div>"

Here's a screenshot for good measure:

How to download webpage text with the correct (chinese) encoding in R

In the long run you will need to locate and use the encoding parameter from each web-page you download to get this encoding correct.

'encoding' is the 'charset' used in the HTML.

In the page you link to, the charset "charset=gb2312" is specified.

Specifying encoding=gb2312 brings the source back correctly.

However, R will likely not display it this way. You're not displaying HTML in R, just getting the source of the web page. You need a web browser to display the HTML.

继续阅读：r

How to download webpage text with the correct (chinese) encoding in R

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？