How do I escape a Unicode string with Ruby?

2023-02-22 03:22 问答作者：

I need to encode/convert a Un开发者_JS百科icode string to its escaped form, with backslashes. Anybody know how?

In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.

>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"

>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""

>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil

In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:

>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""

In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):

>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"

To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/

If you have Rails kicking around you can use the JSON encoder for this:

require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"

The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.

There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.

Finding the value:

Method 1a: from Ruby with `String#dump`:

If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$ (plus a trailing newline) – running the following code (executed either in irb or as a script):

s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.

... should print out:

"\u20AC\u00A3\u00A5$\n"

Thus you can see that € is U+20AC, £ is U+00A3, and ¥ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)

(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)

Method 1b: with Ruby using `String#encode` and `rescue`:

Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:

encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
  begin
    c.encode("ASCII") # try to encode it to ASCII
  rescue Encoding::UndefinedConversionError # but if that fails
    encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
  end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
  puts "#{char} encodes to #{dumped}."
end

With the same input as above, this would then print:

€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".

Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of

继续阅读：rubyunicode


                            更多精彩内容
                            一文搞懂Go语言堆内存原理小结
一文彻底掌握Go语言泛型的使用及应用
Go语言范围Range的实现示例
C++滑动窗口算法习题的解题思路及示例代码
C#按路径流式遍历XML节点的实用方法

How do I escape a Unicode string with Ruby?

Finding the value:

Method 1a: from Ruby with `String#dump`:

Method 1b: with Ruby using `String#encode` and `rescue`:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Finding the value:

Method 1a: from Ruby with String#dump:

Method 1b: with Ruby using String#encode and rescue:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Method 1a: from Ruby with `String#dump`:

Method 1b: with Ruby using `String#encode` and `rescue`:

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？