开发者

Ruby 1.9, YAML, and string encodings: how to lead a life of sanity?

It seems to me that the YAML library that ships with ruby 1.9 is encoding-deaf.

What this means is that when generating YAML, it'll take any string of bytes, and escape any byte sequence that doesn't output clean ASCII. That's lame, but acceptable.

My problem is the other way around. When loading content from said YAML dump.

In the example that follows I create a UTF-8 string, dump it, it's dumped with the type !binary. When I load it back, it has the encoding ASCII-8BIT. In the end of the example I try to concatenate both the original and the reloaded string with another UTF-8 string. The latter will fail with an Encoding::CompatibilityError.

require 'yaml'
s0 = "Iñtërnâtiônàlizæt开发者_运维知识库iøn"
y  = s0.to_yaml
s1 = YAML::load y
puts s0                 # => Iñtërnâtiônàlizætiøn
puts s0.encoding        # => UTF-8
puts s1                 # => Iñtërnâtiônàlizætiøn
puts s1.encoding        # => ASCII-8BIT
puts y                  # => --- !binary |
                        #    ScOxdMOrcm7DonRpw7Ruw6BsaXrDpnRpw7hu
puts "ñårƒ" + s0        # => ñårƒIñtërnâtiônàlizætiøn
puts "ñårƒ" + s1        # => Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT

I think it's clear how this will quickly lead to trouble when you're dealing with some YAML source containing nested hashes and arrays with leaf strings.

Currently I have some code that traverses all hashes and arrays and calls force_encoding on each string. That, to say the least, is unsightly.

What I'm looking for right now is a way to tell YAML::load that any string that comes in should be treated as, and therefore have its encoding set to UTF-8.


Ideally, ruby's YAML should just annotate the strings it dumps with the proper encoding. There's a Ya2YAML project that attempts to dump UTF-8 safe YAML. I'm not sure how far along it is. If anyone has played with it, I welcome any thoughts.

Regardless of that, I still have these dumps without any encoding information to deal with. Although I know they are all UTF-8.


Consider to upgrade your ruby to the latest 1.9.2 .

I found that bug in 1.9.1 but not 1.9.2 .


YAML::ENGINE.yamler='psych'
'Résumé'.to_yaml # => "--- Résumé\n...\n"


Firstly the textfile that you're attempting to read must be UTF-8 encoded(this should be your YAML file).

Then add this line to the top of your ruby file, hash and all

# encoding: UTF-8

This will mean that the default encoding for all strings will be UTF-8, and should mean that any text you dump with YAML.dump('text') or even string literals 'like this' should also be encoded UTF-8, and all should work well from here on in.


Evgeny's answer still shows binary for me, but this works ('syck' instoad of 'psych'):

YAML::ENGINE.yamler='syck'
'Résumé'.to_yaml # => "--- "R\xE9sum\xE9"

I'm using Ruby 1.9. Note for my purposes having the specials be escaped is fine - I just needed it to not show !binary... for normal words. Thank god .to_yaml is functional for me again - used to use it all the time. How to load a life of sanity indeed :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜