开发者

Ruby: Convert encoded character to actual UTF-8 character

Ruby will not play nice with UTF-8 strings. I am passing data in an XML file and although the XML document is specified as UTF-8 it treats the ascii encoding (two bytes per character) as individual characters.

I have started encoding the input strings in the '\uXXXX' format, however I can not figure out how to convert this to an开发者_如何转开发 actual UTF-8 character. I have been searching all over on this site and google to no avail and my frustration is pretty high right now. I am using Ruby 1.8.6

Basically, I want to convert the string '\u03a3' -> "Σ".

What I had is:

data.gsub /\\u([a-zA-Z0-9]{4})/,  $1.hex.to_i.chr

Which of course gives "931 out of char range" error.

Thank you Tim


Try this :

[0x50].pack("U")

where 0x50 is the hex code of the utf8 char.


Does something break because Ruby strings treats UTF-8 encoded code points as two characters? If not, then that you should not worry too much about that. If something does break, then please add a comment to let us know. It is probably better to solve that problem instead of looking for a workaround.

If you need to do conversions, look at the Iconv library.

In any case, Σ could be better alternative to \u03a3. \uXXXX is used in JSON, but not in XML. If you want to parse \uXXXX format, look at some JSON library how they do it.


Ruby (at least, 1.8.6) doesn't have full Unicode support. Integer#chr only supports ASCII characters and otherwise only up to 255 in octal notation ('\377').

To demonstrate:

irb(main):001:0> 255.chr
=> "\377"
irb(main):002:0> 256.chr
RangeError: 256 out of char range
        from (irb):2:in `chr'
        from (irb):2

You might try upgrading to Ruby 1.9. The chr docs don't explicitly state ASCII, so support may have expanded -- though the examples stop at 255.

Or, you might try investigating ruby-unicode. I've never tried it myself, so I don't know how well it'll help.

Otherwise, I don't think you can do quite what you want in Ruby, currently.


You can pass an encoding to the Integer#chr:

chr([encoding]) → string

Returns a string containing the character represented by the int's value according to encoding.

65.chr    #=> "A"
230.chr   #=> "\xE6"
255.chr(Encoding::UTF_8)   #=> "\u00FF"

So instead of using .chr, use .chr(Encoding::UTF_8).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜