Ruby: Convert encoded character to actual UTF-8 character
Ruby will not play nice with UTF-8 strings. I am passing data in an XML file and although the XML document is specified as UTF-8 it treats the ascii encoding (two bytes per character) as individual characters.
I have started encoding the input strings in the '\uXXXX' format, however I can not figure out how to convert this to an开发者_如何转开发 actual UTF-8 character. I have been searching all over on this site and google to no avail and my frustration is pretty high right now. I am using Ruby 1.8.6
Basically, I want to convert the string '\u03a3' -> "Σ".
What I had is:
data.gsub /\\u([a-zA-Z0-9]{4})/, $1.hex.to_i.chr
Which of course gives "931 out of char range" error.
Thank you Tim
Try this :
[0x50].pack("U")
where 0x50
is the hex code of the utf8 char.
Does something break because Ruby strings treats UTF-8 encoded code points as two characters? If not, then that you should not worry too much about that. If something does break, then please add a comment to let us know. It is probably better to solve that problem instead of looking for a workaround.
If you need to do conversions, look at the Iconv library.
In any case, Σ
could be better alternative to \u03a3
. \uXXXX is used in JSON, but not in XML. If you want to parse \uXXXX format, look at some JSON library how they do it.
Ruby (at least, 1.8.6) doesn't have full Unicode support. Integer#chr
only supports ASCII characters and otherwise only up to 255
in octal notation ('\377'
).
To demonstrate:
irb(main):001:0> 255.chr
=> "\377"
irb(main):002:0> 256.chr
RangeError: 256 out of char range
from (irb):2:in `chr'
from (irb):2
You might try upgrading to Ruby 1.9. The chr
docs don't explicitly state ASCII, so support may have expanded -- though the examples stop at 255.
Or, you might try investigating ruby-unicode. I've never tried it myself, so I don't know how well it'll help.
Otherwise, I don't think you can do quite what you want in Ruby, currently.
You can pass an encoding to the Integer#chr
:
chr([encoding]) → string
Returns a string containing the character represented by the
int
's value according toencoding
.65.chr #=> "A" 230.chr #=> "\xE6" 255.chr(Encoding::UTF_8) #=> "\u00FF"
So instead of using .chr
, use .chr(Encoding::UTF_8)
.
精彩评论