strange characters: interaction of R and Windows locale?
WinXP-x32, R-2.13.0
Dear list,
I have a problem that (I think) relat开发者_运维问答es to the interaction between Windows and R.
I am trying to scrape a table with data on the Hawai'ian Islands. This is my R code:
library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]
The output is (first set of columns):
Island Nickname > > Islands Island Nickname > > Location 1 Hawaiʻi[7] The Big
Island 19°34′N 155°30′W / 19.567°N 155.5°W / 19.567; -155.5 2 Maui[8] The Valley Isle 20°48′N 156°20′W / 20.8°N 156.333°W / 20.8; -156.333 3 KahoÊ»olawe[9] The Target Isle 20°33′N 156°36′W / 20.55°N 156.6°W / 20.55; -156.6 4 LÄnaÊ»i[10] The Pineapple Isle 20°50′N 156°56′W / 20.833°N 156.933°W / 20.833; -156.933 5 MolokaÊ»i[11] The Friendly Isle 21°08′N 157°02′W / 21.133°N 157.033°W / 21.133; -157.033 6 OÊ»ahu[12] The Gathering Place 21°28′N 157°59′W / 21.467°N 157.983°W / 21.467; -157.983 7 KauaÊ»i[13] The Garden Isle 22°05′N 159°30′W / 22.083°N 159.5°W / 22.083; -159.5 8 NiÊ»ihau[14] The Forbidden Isle
21°54′N 160°10′W / 21.9°N 160.167°W / 21.9; -160.167
As you can see, there are "weird" characters in there. I have also tried readHTMLTable(u, encoding = "UTF-16")
and readHTMLTable(u, encoding = "UTF-8")
but that didn't help.
It seems to me that there may be an issue with the interaction of the Windows settings of the character set and R.
sessionInfo()
gives
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.2-0.2
I have also attempted to let R use another setting by entering: Sys.setlocale("LC_ALL", "en_US.UTF-8")
, but this yields the response:
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
In addition, I have attempted to make the change directly from the windows command prompt, using: chcp 65001
and variations of that, but that didn't change anything.
I noticed from searching the web that others have the issue as well, but have not been able to find a solution. I looks like this is an issue of how Windows and R interact. Unfortunately, all three computers at my disposal have this problem. It occurs both under WinXP-x32 and under Win7-x86.
Is there a way to make R override the windows settings or can the issue be solved otherwise? I have also tried other websites, and the issue occurs every time when there is an é, ü, ä, î, et cetera in the text-to-be-scraped.
Thank you, Roger
A not quite an answer:
If you look at the wikipedia page and change the encoding in your browser (in IE, View -> Encoding; in Firefox, View -> Character Encoding) to Western (ISO-8869-1) or Western (Windows-1252) then you see the silly characters. That ought to mean that you can use iconv
to change the encoding and fix your problems.
#Convert factors to character
Islands <- as.data.frame(lapply(Islands, as.character), stringsAsFactors = FALSE)
iconv(Islands$Island, "windows-1252", "UTF-8")
Unfortunately, it doesn't work. It may be possible to get the correct text by using a different conversion (iconvlist()
shows all the possibilities).
It is possible it simply strip out the offending characters, though this isn't ideal.
iconv(Islands$Island, "windows-1252", "ASCII", "")
Unable to replicate the error, however looking at the help files is useful.
Sys.setlocale("LC_TIME", "de") # Solaris: details are OS-dependent
Sys.setlocale("LC_TIME", "de_DE.utf8") # Modern Linux etc.
Sys.setlocale("LC_TIME", "de_DE.UTF-8") # ditto
Sys.setlocale("LC_TIME", "de_DE") # OS X, in UTF-8
Sys.setlocale("LC_TIME", "German") # Windows
For a windows you should use formatting like "English" or "Dutch_Netherlands.1252" to change these settings.
I tried to replicate your state
> Sys.setlocale("LC_ALL","Dutch_Netherlands.1252")
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
> Sys.getlocale()
[1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]
However I do not get the funny characters in console, in my own locale the ʻ was marked as , but still all functionality remained.
> Islands[1,1]
[1] Hawaiʻi[27]
8 Levels: Hawaiʻi[27] Kahoʻolawe[34] Kauaʻi[30] Lānaʻi[32] Maui[28] ... Oʻahu[29]
And these funny characters can be read easily, and found from the table.
> Encoding(as.character("Hawaiʻi"))
[1] "UTF-8"
> Encoding(as.character(Islands[1,1]))
[1] "UTF-8"
> grep("Hawaiʻi", as.character(Islands[1,1]))
[1] 1
If you still have problems it would rely elsewhere, however to change the locale under windows you have to use different names than Linux or OS X (see your own locale info for example). In Windows "Dutch" is probably enough.
精彩评论