How to prevent Nokogiri from adding <DOCTYPE> tags?
I noticed something strange using Nokogiri recently. All of the HTML I had been parsing had been given start and end <html>
and <body>
tags.
<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n
How can I prevent Nokogiri from doing this?
I.E., when I do:
doc = Nokogiri::HTML("<div>some content</div>")
doc.to_s
or:
doc.to_html
I get the original:
<html bla开发者_开发问答h><body>div>some content</div></body></html>
The problem occurs because you're using the wrong method in Nokogiri to parse your content.
require 'nokogiri'
doc = Nokogiri::HTML('<p>foobar</p>')
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>foobar</p></body></html>
Rather than using HTML
which results in a complete document, use HTML.fragment
, which tells Nokogiri you only want the fragment parsed:
doc = Nokogiri::HTML.fragment('<p>foobar</p>')
puts doc.to_html
# >> <p>foobar</p>
The to_s
method on a Nokogiri::HTML::Document
outputs a valid HTML page, complete with its required elements. This is not necessarily what was passed in to the parser.
If you want to output less than a complete document, you use methods such as inner_html
, inner_text
, etc., on a node.
Edit: if you are not expecting to parse a complete, well-formed XML document as input, then theTinMan's answer is best.
精彩评论