开发者

libxml converts accented characters into backslash x escapes. Json is not happy

I have the following attribute in an xml node I'm reading with libxml. It prints out normally with the accented character if I print out reader.node.

reader = XML::Reader.new(File.open("somefile.xml", "r"))
reader.read
reader.read
...
p reader.node

=> ... Full_Name="Univisión Network - East Feed" ...

If I do this, though, it comes out escaped.

p reader.node["Full_Name"]
=> "Univisi\xC3\xB3n Network - East Feed"

And when I try to convert this value to json laater, I get the following error.

Encoding::Undefin开发者_如何学运维edConversionError: "\xC3" from ASCII-8BIT to UTF-8

Here is the xml line in the document

<?xml version="1.0" encoding="ISO-8859-1"?>

I don't have control over the xml document itself. How can I get that unicode character back into json, or into a format json understands?

EDIT: Oh, I forgot to mention - this is how it looks in the actual XML document

Full_Name="Univisi&#243;n Network - East Feed" 


So, I'm still completely lost as to why I couldn't figure out the "Right" way to do it, but this thread helped to find the force_encoding method on the String class. Since my code involves copying attributes into a hash anyway, it's not a big deal to call force_encoding when I copy the value.

I doubly made sure I had saved the file as UTF-8, and put the right xml declaration at the top. It still failed.

Anyway, until I can figure out how to fix the actual problem, this code fixed it.

  object = { type: node.name }      
  node.attributes.each do |attribute|
    name = attribute.name.gsub /_/,""
    value = attribute.value.force_encoding('UTF-8')

    object[name] = value
  end

Note this would not be appropriate if I weren't already needing to copy the node into a hash, since it definitely wouldn't be worth all the trouble. If I then do

object.to_json

It works without a problem. Thanks for all your help ax! Do you have any idea how I can force the encoding on the xml?


EDIT
so i've been trying figuring this out for quite some time now. funny thing: your code works without error in ruby 1.8 (at least here). so i think the error has to do with ruby 1.9's new encoding handling. somehow it cannot figure out that the parsed and read XML is in (libxml's internal) utf-8 format (the document encoding doesn't matter here: in 1.8 it works with both iso-8859-1 and utf-8, even with the wrong xml encoding declaration). instead, it treats it as ASCII-8BIT, or BINARY. in other words, it doesn't know the encoding. which is why to_json fails trying to convert it to utf-8.

your easiest way to solve it might be to downgrade to ruby 1.8.

alternatively, your approach of force_encoding('UTF-8') seems to be reasonable.
EDIT END

you can try passing the proper encoding to the reader:

reader = XML::Reader.new(File.open("somefile.xml", "r"), 
  XML::Encoding::ISO_8859_1)


If it do this, though, it comes out escaped.

Not quite. What you're seeing is UTF-8 output interpreted as a string of bytes.

The problem is that your XML document says it's ISO-8859-1, while it is really UTF-8. Fix the encoding problems and it should work.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜