开发者

Ruby hexacode to unicode conversion

I crawled a website which contains unicode, an the results look something like, if in code

a = "\\u2665 \\uc624开发者_如何学Python \\ube60! \\uc8fd \\uae30 \\uc804 \\uc5d0"

May I know how do I do it in Ruby to convert it back to the original Unicode text which is in UTF-8 format?


If you have ruby 1.9, you can try:

a.force_encoding('UTF-8')

Otherwise if you have < 1.9, I'd suggest reading this article on converting to UTF-8 in Ruby 1.8.


short answer: you should be able to 'puts a', and see the string printed out. for me, at least, I can print out that string in both 1.8.7 and 1.9.2

long answer: First thing: it depends on if you're using ruby 1.8.7, or 1.9.2, since the way strings and encodings were handled changed.

in 1.8.7: strings are just lists of bytes. when you print them out, if your OS can handle it, you can just 'puts a' and it should work correctly. if you do a[0], you'll get the first byte. if you want to get each character, things are pretty darn tricky.

in 1.9.2 strings are lists of bytes, with an encoding. If the webpage was sent with the correct encoding, your string should already be encoded correctly. if not, you'll have to set it (as per Mike Lewis's answer). if you do a[0], you'll get the first character (the heart). if you want each byte, you can do a.bytes.


If your OS, for whatever reason, is giving you those literal ascii characters,my previous answer is obviously invalid, disregard it. :P

here's what you can do:

a.gsub(/\\u([a-z0-9]+)/){|p| [$1.to_i(16)].pack("U")}

this will scan for the ascii string '\u' followed by a hexadecimal number, and replace it with the correct unicode character.


You can also specify the encoding when you open a new IO object: http://www.ruby-doc.org/core/classes/IO.html#M000889

Compared to Mike's solution, this may prevent troubles if you forget to force the encoding before exposing the string to the rest of your application, if there are multiple mechanisms for retrieving strings from your module or class. However, if you begin crawling SJIS or KOI-8 encoded websites, then Mike's solution will be easier to adapt for the character encoding name returned by the web server in its headers.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜