Active Record serialized attr losing string encodings (probably YAML issue), workarounds?
I'm using Rails 2.3.8 with Ruby 1.9.1 and I'm having a problem with serialized attributes in active record not preserving string encodings. The underlying problem is probably yaml, but I'm 开发者_如何转开发wondering if anyone has any good ideas on how to handle this. The app I'm working on has numerous serialized fields some of which contain deep structures of arrays and hashes. Getting back an ASCII-8Bit string (that's actually UTF-8) deep within those structures wrecks havoc later...
Perhaps best illustrated by example, if I save l to a serialized attr in an active record model I'll get back l2 on reading from the database.
>> l => ["English", "Türkçe", "Русский"] >> l.map(&:encoding) => [#<Encoding:UTF-8>, #<Encoding:UTF-8>, #<Encoding:UTF-8>] >> l.map(&:valid_encoding?) => [true, true, true] >> l.to_yaml => "--- \n- English\n- !binary |\n VMO8cmvDp2U=\n\n- \"\\xD0\\xA0\\xD1\\x83\\xD1\\x81\\xD1\\x81\\xD0\\xBA\\xD0\\xB8\\xD0\\xB9\"\n" >> l2 = YAML.load(l.to_yaml) => ["English", "T\xC3\xBCrk\xC3\xA7e", "Русский"] >> l2.map(&:encoding) => [#<Encoding:UTF-8>, #<Encoding:ASCII-8BIT>, #<Encoding:UTF-8>]
Does anyone know how yaml decides on whether or not to store a string as binary vs. as an escaped string? Both the last two strings above are non-ascii-7 but only the first is stored as binary...
My current thinking is to hook the active record deserialization routine, walk hashes and arrays and force encoding on all the string elements. Not terribly safe or general, but would probably work for my use case, though I also wonder if anyone's patched YAML to be smarter here...
I did come up with one solution :
monkey patching String can force YAML to use \ escaping rather then binary and therefore return strings in the default encoding (UTF-8 for me) rather then ASCII-8BIT
class String def is_binary_data? encoding == Encoding::ASCII_8BIT unless empty? end end
originally this routine uses some heuristics around which would be shorter \ escaping of binary encoding of the string which is why only some of the international strings I had were having problems.
精彩评论