How do I escape a Unicode string with Ruby?
I need to encode/convert a Un开发者_JS百科icode string to its escaped form, with backslashes. Anybody know how?
In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"
To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/
If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.
There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
Finding the value:
Method 1a: from Ruby with String#dump
:
If you already have the character in a Ruby String
object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump
method on it. For example, with a file called unicode.txt
that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$
(plus a trailing newline) – running the following code (executed either in irb
or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that €
is U+20AC, £
is U+00A3, and ¥
is U+00A5. ($
is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)
(Note: a previous answer suggested using #inspect
instead of #dump
. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect'
prints an unhappy face for me, rather than an escape sequence. Changing inspect
to dump
, though, gets me the escape sequence back.)
Method 1b: with Ruby using String#encode
and rescue
:
Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of
精彩评论