Using Ruby's fastercsv with character encodings
Using Ruby 1.8.7, I want to accept csv's into my system, even though this is an admin application, it seems I can get several different types of csvs. On my mac if I export from excel using "windows csv" option then fastercsv can read it out by default. On windows I seem to be getting utf-16 encoded csvs (which I havent figured out how to parse yet)
It seems like a pretty common thing to allow users to upload a csv that could be in utf8, utf16, ascii etc type formats, detect and parse them. Has anyone figured this out?
I started to look at Uni开发者_运维百科versalDetector to help me detct, then use Iconv to convert, but this seems to be tricky and was hoping someone figured it out :)
According to FasterCSV's docs, the initialize
method takes an :encoding
option:
The encoding to use when parsing the file. Defaults to your $KDOCE setting. Valid values:
n??? or
N??? for none,e??? or
E??? for EUC,s??? or
S??? for SJIS, andu??? or
U??? for UTF-8 (see Regexp.new()).
Because its list is limited, you might want to look into using iconv
to do a pre-process of the contents, then pass them to CSV. You can use Ruby's interface to iconv ("Iconv") or the command-line version of it. Iconv is very powerful and flexible and capable of converting UTF-16 among other things.
Actually detecting the encoding of the document is more problematic, but the command-line version can help you there. If I remember right it can help identify the encoding. It can also convert between encodings, or, if you want, it can be told to convert to ASCII, converting to the closest matching characters, or ignoring them entirely.
Ruby 1.9.2 is much more capable than 1.8.7 when it comes to dealing with different character sets, so you might want to consider upgrading. Also, to become more familiar with the tools and issues of dealing with character-sets and multibyte characters you should read James Gray's blogs.
精彩评论