Why are two strings with same bytes and encoding not identical in Ruby 1.9?
In Ruby 1.9.2, I found a way to make two strings that have the same bytes, same encoding, and are equal, but they have a different length
and different characters returned by []
.
Is this a bug? If it is not a bug, then I'd like to fully understand it. What kind of information is stored inside Ruby 1.9.2 String objects that allows these two strings to behave differently?
Below is the code that reproduces this behavior. The comments that start with #=>
show you what output I am getting from this script, and the parenthetical words tell you my judgment of that output.
#!/usr/bin/ruby1.9
# coding: utf-8
string1 = "\xC2\xA2" # A well-behaved string with one character (¢)
string2 = "".concat(0xA2) # A bizarre string very similar to string1开发者_运维知识库.
p string1.bytes.to_a #=> [194, 162] (good)
p string2.bytes.to_a #=> [194, 162] (good)
puts string1.encoding.name #=> UTF-8 (good)
puts string2.encoding.name #=> UTF-8 (good)
puts string1 == string2 #=> true (good)
puts string1.length #=> 1 (good)
puts string2.length #=> 2 (weird!)
p string1[0] #=> "¢" (good)
p string2[0] #=> "\xC2" (weird!)
I am running Ubuntu and compiled Ruby from source. My Ruby version is:
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
It is Ruby's bug and fixed r29848.
Matz mentioned this question via Twitter:
http://twitter.com/matz_translator/status/6597021662187520
http://twitter.com/matz_translator/status/6597055132733440
"It's hard to determine as a bug but, it's not acceptable to leave it as is. I'd prefer to fix this issue."
I think the problem is in the string's encoding. Check out James Grey's Shades of Gray: Ruby 1.9's String article on Unicode encoding.
Additional odd behavior:
# coding: utf-8
string1 = "\xC2\xA2"
string2 = "".concat(0xA2)
string3 = 0xC2.chr + 0xA2.chr
string1.bytes.to_a # => [194, 162]
string2.bytes.to_a # => [194, 162]
string3.bytes.to_a # => [194, 162]
string1.encoding.name # => "UTF-8"
string2.encoding.name # => "UTF-8"
string3.encoding.name # => "ASCII-8BIT"
string1 == string2 # => true
string1 == string3 # => false
string2 == string3 # => true
string1.length # => 1
string2.length # => 2
string3.length # => 2
string1[0] # => "¢"
string2[0] # => "\xC2"
string3[0] # => "\xC2"
string3.unpack('C*') # => [194, 162]
string4 = string3.unpack('C*').pack('C*') # => "\xC2\xA2"
string4.encoding.name # => "ASCII-8BIT"
string4.force_encoding('UTF-8') # => "¢"
string3.force_encoding('UTF-8') # => "¢"
string3.encoding.name # => "UTF-8"
精彩评论