libxml converts accented characters into backslash x escapes. Json is not happy
I have the following attribute in an xml node I'm reading with libxml. It prints out normally with the accented character if I print out reader.node.
reader = XML::Reader.new(File.open("somefile.xml", "r"))
reader.read
reader.read
...
p reader.node
=> ... Full_Name="Univisión Network - East Feed" ...
If I do this, though, it comes out escaped.
p reader.node["Full_Name"]
=> "Univisi\xC3\xB3n Network - East Feed"
And when I try to convert this value to json laater, I get the following error.
Encoding::Undefin开发者_如何学运维edConversionError: "\xC3" from ASCII-8BIT to UTF-8
Here is the xml line in the document
<?xml version="1.0" encoding="ISO-8859-1"?>
I don't have control over the xml document itself. How can I get that unicode character back into json, or into a format json understands?
EDIT: Oh, I forgot to mention - this is how it looks in the actual XML document
Full_Name="Univisión Network - East Feed"
So, I'm still completely lost as to why I couldn't figure out the "Right" way to do it, but this thread helped to find the force_encoding
method on the String class. Since my code involves copying attributes into a hash anyway, it's not a big deal to call force_encoding
when I copy the value.
I doubly made sure I had saved the file as UTF-8, and put the right xml declaration at the top. It still failed.
Anyway, until I can figure out how to fix the actual problem, this code fixed it.
object = { type: node.name }
node.attributes.each do |attribute|
name = attribute.name.gsub /_/,""
value = attribute.value.force_encoding('UTF-8')
object[name] = value
end
Note this would not be appropriate if I weren't already needing to copy the node into a hash, since it definitely wouldn't be worth all the trouble. If I then do
object.to_json
It works without a problem. Thanks for all your help ax! Do you have any idea how I can force the encoding on the xml?
EDIT
so i've been trying figuring this out for quite some time now. funny thing: your code works without error in ruby 1.8 (at least here). so i think the error has to do with ruby 1.9's new encoding handling. somehow it cannot figure out that the parsed and read XML is in (libxml's internal) utf-8 format (the document encoding doesn't matter here: in 1.8 it works with both iso-8859-1 and utf-8, even with the wrong xml encoding declaration). instead, it treats it as ASCII-8BIT, or BINARY. in other words, it doesn't know the encoding. which is why to_json
fails trying to convert it to utf-8.
your easiest way to solve it might be to downgrade to ruby 1.8.
alternatively, your approach of force_encoding('UTF-8')
seems to be reasonable.
EDIT END
you can try passing the proper encoding to the reader:
reader = XML::Reader.new(File.open("somefile.xml", "r"),
XML::Encoding::ISO_8859_1)
If it do this, though, it comes out escaped.
Not quite. What you're seeing is UTF-8 output interpreted as a string of bytes.
The problem is that your XML document says it's ISO-8859-1, while it is really UTF-8. Fix the encoding problems and it should work.
精彩评论