How to prevent Nokogiri from adding <DOCTYPE> tags?

2023-02-05 10:12 问答作者：

I noticed something strange using Nokogiri recently. All of the HTML I had been parsing had been given start and end <html> and <body> tags.

<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n

How can I prevent Nokogiri from doing this?

I.E., when I do:

doc = Nokogiri::HTML("<div>some content</div>")
doc.to_s

or:

doc.to_html

I get the original:

<html bla开发者_开发问答h><body>div>some content</div></body></html>

The problem occurs because you're using the wrong method in Nokogiri to parse your content.

require 'nokogiri'

doc = Nokogiri::HTML('<p>foobar</p>')
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>foobar</p></body></html>

Rather than using HTML which results in a complete document, use HTML.fragment, which tells Nokogiri you only want the fragment parsed:

doc = Nokogiri::HTML.fragment('<p>foobar</p>')
puts doc.to_html
# >> <p>foobar</p>

The to_s method on a Nokogiri::HTML::Document outputs a valid HTML page, complete with its required elements. This is not necessarily what was passed in to the parser.

If you want to output less than a complete document, you use methods such as inner_html, inner_text, etc., on a node.

Edit: if you are not expecting to parse a complete, well-formed XML document as input, then theTinMan's answer is best.

继续阅读：nokogiri ruby ruby-on-rails

How to prevent Nokogiri from adding <DOCTYPE> tags?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？