How to feed only a string to Nokogiri
I have the following sample XML:
<all>
<houses>
<reg info='<root><h level="2" i="1"> something </h><root>'
other="test"
something
</reg>
</houses>
</all>
I want to parse the XML provided in the info
property of the <reg>
tag, but I don't know how to feed the content of the info
attribute to Nokogiri.
This is what I have now:
doc = Nokogiri::HTML(open-uri(mylink))
node = doc.xpath(//houses/reg)
puts node[0]['info'].class #string
#conten开发者_高级运维t of info property as string. This is what I want to feed to nokogiri as xml
puts node[0]['info'].text
How can I do this?
You need to get the text of the info attribute, and use the GCI class to unescape the HTML. Then you can feed the string to Nokogiri::HTML
and it will be parsed. Something like this.
require "nokogiri"
require "open-uri"
require "cgi"
doc = Nokogiri::HTML(open-uri("http://example.com/foo.xml"))
node = doc.xpath("//houses/reg")
info_string = CGI.unescapeHTML(node[0]['info'])
info_doc = Nokogiri::XML(info_string)
# Now you can have a Nokogiri document from that attribute.
require 'nokogiri'
xml = "<all>
<houses>
<reg info='<root><h level=\"2\" i=\"1\"> something </h><root>'
other=\"test\"
something
</reg>
</houses>
</all>"
doc = Nokogiri::HTML(xml)
node = doc.xpath('//houses/reg')
puts node[0]['info'].class #string
puts node[0]['info']
inner_xml = node[0]['info']
inner_doc = Nokogiri::XML(inner_xml)
puts inner_doc.xpath('root/h')[0].text
Here are some things to note:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<all>
<houses>
<reg info='<root><h level="2" i="1"> something </h><root>'
other="test"
something
</reg>
</houses>
</all>
EOT
doc.errors # => [#<Nokogiri::XML::SyntaxError: Unescaped '<' not allowed in attributes values>, #<Nokogiri::XML::SyntaxError: attributes construct error>, #<Nokogiri::XML::SyntaxError: Couldn't find end of Start Tag reg line 3>, #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: root line 3 and reg>, #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: root line 3 and houses>, #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: houses line 2 and all>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag all line 1>]
doc.at('reg')['info'] # => ""
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <all>
# >> <houses>
# >> <reg info=""/><root><h level="2" i="1"> something </h><root>'
# >> other="test"
# >> something
# >> </root>
# >> </root>
# >> </houses>
# >> </all>
Parsing XML should normally use Nokogiri::XML
as XML is a strict specification. This markup is malformed, and Nokogiri will correctly flag the errors, and, because it's malformed will attempt to fix it up and continue parsing.
Using Nokogiri::HTML
loosens the reins and lets the parser be more lenient about what it sees; HTML is notoriously badly written so Nokogiri tries to be more accommodating:
doc = Nokogiri::HTML(<<EOT)
<all>
<houses>
<reg info='<root><h level="2" i="1"> something </h><root>'
other="test"
something
</reg>
</houses>
</all>
EOT
doc.errors # => [#<Nokogiri::XML::SyntaxError: Tag all invalid>, #<Nokogiri::XML::SyntaxError: Tag houses invalid>, #<Nokogiri::XML::SyntaxError: error parsing attribute name>, #<Nokogiri::XML::SyntaxError: Tag reg invalid>]
doc.at('reg')['info'] # => "<root><h level=\"2\" i=\"1\"> something </h><root>"
puts doc.to_xml
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <all>
# >> <houses>
# >> <reg info='<root><h level="2" i="1"> something </h><root>' other="test" something>
# >> </reg></houses>
# >> </all>
# >> </body></html>
Notice how Nokogiri now:
- correctly HTML encoded the content of
info
- correctly extracts and decodes the content for
info
.
I'm not sure if Nokogiri's behavior changed since the question was originally asked, but the current behavior in v.1.6.7.2 handles the decoding correctly without needing to use CGI.
Here are some things to note:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<all>
<houses>
<reg info='<root><h level="2" i="1"> something </h><root>'
other="test"
something
</reg>
</houses>
</all>
EOT
doc.errors # => [#<Nokogiri::XML::SyntaxError: Unescaped '<' not allowed in attributes values>, #<Nokogiri::XML::SyntaxError: attributes construct error>, #<Nokogiri::XML::SyntaxError: Couldn't find end of Start Tag reg line 3>, #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: root line 3 and reg>, #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: root line 3 and houses>, #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: houses line 2 and all>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag all line 1>]
doc.at('reg')['info'] # => ""
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <all>
# >> <houses>
# >> <reg info=""/><root><h level="2" i="1"> something </h><root>'
# >> other="test"
# >> something
# >> </root>
# >> </root>
# >> </houses>
# >> </all>
Parsing XML should normally use Nokogiri::XML
as XML is a strict specification. This markup is malformed, and Nokogiri will correctly flag the errors, and, because it's malformed will attempt to fix it up and continue parsing.
Using Nokogiri::HTML
loosens the reins and lets the parser be more lenient about what it sees; HTML is notoriously badly written so Nokogiri tries to be more accommodating:
doc = Nokogiri::HTML(<<EOT)
<all>
<houses>
<reg info='<root><h level="2" i="1"> something </h><root>'
other="test"
something
</reg>
</houses>
</all>
EOT
doc.errors # => [#<Nokogiri::XML::SyntaxError: Tag all invalid>, #<Nokogiri::XML::SyntaxError: Tag houses invalid>, #<Nokogiri::XML::SyntaxError: error parsing attribute name>, #<Nokogiri::XML::SyntaxError: Tag reg invalid>]
doc.at('reg')['info'] # => "<root><h level=\"2\" i=\"1\"> something </h><root>"
puts doc.to_xml
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <all>
# >> <houses>
# >> <reg info='<root><h level="2" i="1"> something </h><root>' other="test" something>
# >> </reg></houses>
# >> </all>
# >> </body></html>
Notice how Nokogiri now:
- correctly HTML encoded the content of
info
- correctly extracts and decodes the content for
info
. - has wrapped the XML in HTML
<html><body>
tags due to parsing the content as HTML.
To extract the fixed XML requires peeling back a couple layers:
puts doc.at('all').to_xml
# >> <all>
# >> <houses>
# >> <reg info="<root><h level="2" i="1"> something </h><root>" other="test" something="">
# >> </reg></houses>
# >> </all>
I'm not sure if Nokogiri's behavior changed since the question was originally asked, but the current behavior in v.1.6.7.2 handles the decoding correctly without needing to use CGI.
node[0].attr('info')
gives you the value of info attribute
精彩评论