Why does Rails 3 think xE2x80x89 means â x80 x89
I have a field scraped from a utf-8 page:
"O’Reilly"
And saved in a yml fil开发者_开发技巧e:
:name: "O\xE2\x80\x99Reilly"
(xE2x80x99 is the correct UTF-8 representation of this apostrophe)
However when I load the value into a hash and yield it to a page tagged as utf-8, I get:
OâReilly
I looked up the character â, which is encoded in UTF-16 as x00E2, and the characters x80 and x89 were invisible but present after the â when I pasted the string. I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.
How do I make rails interpret a 3-byte UTF-8 code as a single character?
Ruby strings are sequences of bytes instead of characters:
$ irb
>> "O\xE2\x80\x99Reilly"
=> "O\342\200\231Reilly"
Your string is a sequence of 10 bytes but 8 characters (as you know). The safest way to see that you output the correct string in HTML (I assume you want HTML since you mentioned Rails) is to convert non-printable characters to HTML entities; in your case to
O’Reilly
This takes some work but it should help in cases where send your HTML in UTF-8 but your end-user has set his or her browser to override and show Latin-1 or some other silly restricted charset.
Ultimately this was caused by loading a syck file (generated by an external script) with psych (in rails). Loading with syck solved the issue:
#in ruby environment
puts YAML::ENGINE.yamler => syck
#in rails
puts YAML::ENGINE.yamler => psych
#in webapp
YAML::ENGINE.yamler = 'syck'
a = YAML::load(file_saved_with_syck)
a[index][:name] => "O’Reilly"
YAML::ENGINE.yamler = 'psych'
I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.
It's not really UTF-16, which is rarely used on the web (and largely breaks there). Your app is outputting three Unicode characters (including the two invisible control codes), but that's not the same thing as the UTF-16 encoding.
The problem would seem to be that the YAML file is being read in as if it were ISO-8859-1-encoded, so that the \xE2
byte maps to character U+00E2 and so on. I am guessing you are using Ruby 1.9 and the YAML is being parsed into byte strings with associated ASCII-8BIT encoding instead of UTF-8, causing the strings to undergo a round of trancoding (mangling) later.
If this is the case you might have to force_encoding
the read strings back to what they should have been, or set default_internal
to cause the strings to be read back into UTF-8. Bit of a mess this.
精彩评论