How to detect the right encoding for read.csv?
I have this file (http://b7hq6v.alterupload.com/en/) that I want to read in R with rea开发者_如何学运维d.csv
. But I am not able to detect the correct encoding. It seems to be a kind of UTF-8. I am using R 2.12.1 on an WindowsXP Machine.
Any Help?
First of all based on more general question on StackOverflow it is not possible to detect encoding of file in 100% certainty.
I've struggle this many times and come to non-automatic solution:
Use iconvlist
to get all possible encodings:
codepages <- setNames(iconvlist(), iconvlist())
Then read data using each of them
x <- lapply(codepages, function(enc) try(read.table("encoding.asc",
fileEncoding=enc,
nrows=3, header=TRUE, sep="\t"))) # you get lots of errors/warning here
Important here is to know structure of file (separator, headers). Set encoding using fileEncoding
argument. Read only few rows.
Now you could lookup on results:
unique(do.call(rbind, sapply(x, dim)))
# [,1] [,2]
# 437 14 2
# CP1200 3 29
# CP12000 0 1
Seems like correct one is that with 3 rows and 29 columns, so lets see them:
maybe_ok <- sapply(x, function(x) isTRUE(all.equal(dim(x), c(3,29))))
codepages[maybe_ok]
# CP1200 UCS-2LE UTF-16 UTF-16LE UTF16 UTF16LE
# "CP1200" "UCS-2LE" "UTF-16" "UTF-16LE" "UTF16" "UTF16LE"
You could look on data too
x[maybe_ok]
For your file all this encodings returns identical data (partially because there is some redundancy as you see).
If you don't know specific of your file you need to use readLines
with some changes in workflow (e.g. you can't use fileEncoding
, must use length
instead of dim
, do more magic to find correct ones).
The package readr
, https://cran.r-project.org/web/packages/readr/readr.pdf, includes a function called guess_encoding
that calculates the probability of a file of being encoded in several encodings:
guess_encoding("your_file", n_max = 1000)
First, you have to figure out what is the encoding of the file, what cannot be done in R (at least as I know). You can use external tools for it e.g. from Perl, python or eg. the file
utility under Linux/UNIX.
As @ssmit suggested, you have an UTF-16LE (Unicode) encoding here, so load the file with this encoding and use readLines
to see what you have in the first (e.g.) 10 lines:
> f <- file('encoding.asc', open="r", encoding="UTF-16LE") # UTF-16LE, which is "called" Unicode in Windows
> readLines(f,10)
[1] "\tFe 2\tZn\tO\tC\tSi\tMn\tP\tS\tAl\tN\tCr\tNi\tMo\tCu\tV\tNb 2\tTi\tB\tZr\tCa\tH\tCo\tMg\tPb 2\tW\tCl\tNa 3\tAr"
[2] ""
[3] "0\t0,003128\t3,82E-05\t0,0004196\t0\t0,001869\t0,005836\t0,004463\t0,002861\t0,02148\t0\t0,004768\t0,0003052\t0\t0,0037\t0,0391\t0,06409\t0,1157\t0,004654\t0\t0\t0\t0,00824\t7,63E-05\t0,003891\t0,004501\t0\t0,001335\t0,01175"
[4] "0,0005\t0,003265\t3,05E-05\t0,0003662\t0\t0,001709\t0,005798\t0,004395\t0,002808\t0,02155\t0\t0,004578\t0,0002441\t0\t0,003601\t0,03897\t0,06406\t0,1158\t0,0047\t0\t0\t0\t0,008026\t6,10E-05\t0,003876\t0,004425\t0\t0,001343\t0,01157"
[5] "0,001\t0,003332\t2,54E-05\t0,0003052\t0\t0,001704\t0,005671\t0,0044\t0,002823\t0,02164\t0\t0,004603\t0,0003306\t0\t0,003611\t0,03886\t0,06406\t0,1159\t0,004705\t0\t0\t0\t0,008036\t5,09E-05\t0,003815\t0,004501\t0\t0,001246\t0,01155"
[6] "0,0015\t0,003313\t2,18E-05\t0,0002616\t0\t0,001678\t0,005689\t0,004447\t0,002921\t0,02171\t0\t0,004621\t0,0003488\t0\t0,003597\t0,03889\t0,06404\t0,1158\t0,004752\t0\t0\t0\t0,008022\t4,36E-05\t0,003815\t0,004578\t0\t0,001264\t0,01144"
[7] "0,002\t0,003313\t2,18E-05\t0,0002834\t0\t0,001591\t0,005646\t0,00436\t0,003008\t0,0218\t0\t0,004643\t0,0003488\t0\t0,003619\t0,03895\t0,06383\t0,1159\t0,004752\t0\t0\t0\t0,008\t4,36E-05\t0,003771\t0,004643\t0\t0,001351\t0,01142"
[8] "0,0025\t0,003488\t2,18E-05\t0,000218\t0\t0,001657\t0,00558\t0,004338\t0,002986\t0,02175\t0\t0,004469\t0,0002616\t0\t0,00351\t0,03889\t0,06374\t0,1159\t0,004621\t0\t0\t0\t0,008131\t4,36E-05\t0,003771\t0,004708\t0\t0,001243\t0,01125"
[9] "0,003\t0,003619\t0\t0,0001526\t0\t0,001591\t0,005668\t0,004207\t0,00303\t0,02169\t0\t0,00449\t0,0002834\t0\t0,00351\t0,03874\t0,06383\t0,116\t0,004665\t0\t0\t0\t0,007956\t0\t0,003749\t0,004796\t0\t0,001286\t0,01125"
[10] "0,0035\t0,003422\t0\t4,36E-05\t0\t0,001482\t0,005711\t0,004185\t0,003292\t0,02156\t0\t0,004665\t0,0003488\t0\t0,003553\t0,03852\t0,06391\t0,1158\t0,004708\t0\t0\t0\t0,007717\t0\t0,003597\t0,004905\t0\t0,00133\t0,01136"
From this, it can be seen, that we have a header, and a blank line in the second row (which will be skipped by default using the read.table
function), the separator is \t
and the decimal character is ,
.
> f <- file('encoding.asc', open="r", encoding="UTF-16LE")
> df <- read.table(f, sep='\t', dec=',', header=TRUE)
And see what we have:
> head(df)
X Fe.2 Zn O C Si Mn P S
1 0.0000 0.003128 3.82e-05 0.0004196 0 0.001869 0.005836 0.004463 0.002861
2 0.0005 0.003265 3.05e-05 0.0003662 0 0.001709 0.005798 0.004395 0.002808
3 0.0010 0.003332 2.54e-05 0.0003052 0 0.001704 0.005671 0.004400 0.002823
4 0.0015 0.003313 2.18e-05 0.0002616 0 0.001678 0.005689 0.004447 0.002921
5 0.0020 0.003313 2.18e-05 0.0002834 0 0.001591 0.005646 0.004360 0.003008
6 0.0025 0.003488 2.18e-05 0.0002180 0 0.001657 0.005580 0.004338 0.002986
Al N Cr Ni Mo Cu V Nb.2 Ti B Zr
1 0.02148 0 0.004768 0.0003052 0 0.003700 0.03910 0.06409 0.1157 0.004654 0
2 0.02155 0 0.004578 0.0002441 0 0.003601 0.03897 0.06406 0.1158 0.004700 0
3 0.02164 0 0.004603 0.0003306 0 0.003611 0.03886 0.06406 0.1159 0.004705 0
4 0.02171 0 0.004621 0.0003488 0 0.003597 0.03889 0.06404 0.1158 0.004752 0
5 0.02180 0 0.004643 0.0003488 0 0.003619 0.03895 0.06383 0.1159 0.004752 0
6 0.02175 0 0.004469 0.0002616 0 0.003510 0.03889 0.06374 0.1159 0.004621 0
Ca H Co Mg Pb.2 W Cl Na.3 Ar
1 0 0 0.008240 7.63e-05 0.003891 0.004501 0 0.001335 0.01175
2 0 0 0.008026 6.10e-05 0.003876 0.004425 0 0.001343 0.01157
3 0 0 0.008036 5.09e-05 0.003815 0.004501 0 0.001246 0.01155
4 0 0 0.008022 4.36e-05 0.003815 0.004578 0 0.001264 0.01144
5 0 0 0.008000 4.36e-05 0.003771 0.004643 0 0.001351 0.01142
6 0 0 0.008131 4.36e-05 0.003771 0.004708 0 0.001243 0.01125
In addition to using the readr-package, you may also choose to use stringi::stri_enc_detect2. This function is particularly efficient if the locale is known and if you are dealing with some form of UTF or ASCII: "..it turns out that (empirically) stri_enc_detect2 works better than the ICU-based one [stringi::stri_enc_detect used by the guess_encoding] if UTF-* text is provided."
Details on stringi::stri_enc_detect.
Details on stringi::stri_enc_detect2.
Change-request for guess_encoding
This file has UTF-16LE encoding with BOM (byte order mark). You probably should use encoding = "UTF-16LE"
My tidy update to @marek's solution, since I'm running into the same problem in 2020:
#Libraries
library(magrittr)
library(purrr)
#Make a vector of all the encodings supported by R
encodings <- set_names(iconvlist(), iconvlist())
#Make a simple reader function
reader <- function(encoding, file) {
read.csv(file, fileEncoding = encoding, nrows = 3, header = TRUE)
}
#Create a "safe" version so we only get warnings, but errors don't stop it
# (May not always be necessary)
safe_reader <- safely(reader)
#Use the safe function with the encodings and the file being interrogated
map(encodings, safe_reader, `<TEST FILE LOCATION GOES HERE>`) %>%
#Return just the results
map("result") %>%
#Keep only results that are dataframes
keep(is.data.frame) %>%
#Keep only results with more than one column
#This predicate will need to change with the data
#I knew this would work, because I could open in a text editor
keep(~ ncol(.x) > 1) %>%
#Return the names of the encodings
names()
@enrique-pérez-herrero answer is great. With guess_encoding("your_file", n_max = 1000)
you will get the most likely encodings. Then you can read the file using that encoding: readr::read_csv(file_path, locale = locale(encoding = "ENCODING_CODE"))
Just for the sake of completeness, here is a function that tries to read the file with all the potential encodings and outputs a list with two data frames:
- Nested data frame with the content of the file in all the working encodings
- The data frame with the most likely one.
detect_file_encoding <- function(file_path) {
library(cli)
library(dplyr)
library(purrr)
library(readr)
library(stringi)
# Read file in UTF-8 and detect encodings present
file_raw = readr::read_file(file_path, locale = locale(encoding = "UTF-8"))
encodings_found = stringi::stri_enc_detect(file_raw)
# Function to read the file using all the encodings found
try_all_encodings <- function(file_path, ENCODING) {
FILE = read_file(file_path, locale = locale(encoding = ENCODING))
HAS_BAD_CHARS = grepl("\u0086", FILE)
if (!HAS_BAD_CHARS) {
tibble(encoding = ENCODING,
content_file = list(FILE))
} else {
tibble(encoding = ENCODING,
content_file = list("BAD_CHARS detected"))
}
}
# Safe version of function
try_all_encodings_safely = safely(try_all_encodings)
# Loop through all the encodings
OUT = 1:length(encodings_found[[1]]$Encoding) %>%
purrr::map(~ try_all_encodings_safely(file_path, encodings_found[[1]]$Encoding[.x]))
# Create nested clean tibble with all the working encodings and contents
OUT_clean = 1:length(OUT) %>% purrr::map(~ OUT[[.x]]$result) %>% dplyr::bind_rows() %>% dplyr::left_join(encodings_found[[1]] %>% dplyr::as_tibble(), by = c("encoding" = "Encoding"))
# Read file with the most likely working encoding
DF_proper_encoding = suppressMessages(readr::read_csv(file_path, skip = 12, locale = locale(encoding = encodings_found[[1]]$Encoding[1]), show_col_types = FALSE, name_repair = "unique"))
# Output list
OUT_final = list(OUT_clean = OUT_clean,
DF_proper_encoding = DF_proper_encoding)
# Output message
cli::cli_alert_info("Found {nrow(OUT_clean)} potential encodings: {paste(OUT_clean$encoding)} \n - DF_proper_encoding stored using {OUT_clean$encoding[1]}")
return(OUT_final)
}
精彩评论