开发者

How to download webpage text with the correct (chinese) encoding in R

I would like to know how to set the encoding parameter so that when I download text, it 'looks' the same as when I saw it on the page source in a web browser, e.g.:

readLines("http://www.baidu.com/s?wd=r+project")[132]
[1] "<div id=\"foot\">&copy;2010 Baidu <span>´ËÄÚÈÝϵ°Ù¶È¸ù¾ÝÄúµÄÖ¸Áî×Ô¶¯ËÑË÷µÄ½á¹û£¬²»´ú±í°Ù¶ÈÔ޳ɱ»ËÑË÷ÍøÕ¾µÄÄÚÈÝ»òÁ¢³¡</span></div>"

When it should be displayed as:

> <div id="foot">&copy;2010 Baidu <span>此内容系百度根据您的指令自动搜索的结果,不代表百度赞成被搜索网站的内容或立场</span></div> 

Any help would be much appreciated!

# windows 7
sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[开发者_运维问答2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] XML_3.2-0.1    RCurl_1.4-4.1  bitops_1.0-4.1 rcom_2.2-3.1   rscproxy_1.3-1

loaded via a namespace (and not attached):
[1] tools_2.12.0


con = url("http://www.baidu.com/s?wd=r+project",  encoding = "gb2312")
readLines(con)[132] 
[1] "<div id=\"foot\">&copy;2010 Baidu <span>此内容系百度根据您的指令自动搜索的结果,不代表百度赞成被搜索网站的内容或立场</span></div>"


the webpage says at the top

<meta http-equiv="content-type" content="text/html;charset=gb2312"> 

which wikipedia says is

GB2312 is the registered internet name for a key official character set of the People's Republic of China, used for simplified Chinese characters

Which seems about appropriate (but still might be a mistake).

To find out the supported encodings on your platform:

iconvlist()

on mine, this includes "GB2312". Let's convert it using iconv:

> a <- readLines("http://www.baidu.com/s?wd=r+project")[132]
> iconv(a, from="gb2312")
[1] "<div id=\"foot\">&copy;2010 Baidu <span>此内容系百度根据您的指令自动搜索的结果,不代表百度赞成被搜索网站的内容或立场</span></div>"

Here's a screenshot for good measure:

How to download webpage text with the correct (chinese) encoding in R

In the long run you will need to locate and use the encoding parameter from each web-page you download to get this encoding correct.


'encoding' is the 'charset' used in the HTML.

In the page you link to, the charset "charset=gb2312" is specified.

Specifying encoding=gb2312 brings the source back correctly.

However, R will likely not display it this way. You're not displaying HTML in R, just getting the source of the web page. You need a web browser to display the HTML.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜