Cleaning HTML with Nokogiri (instead of Tidy)
The tidy
gem is no longer maintained and has multiple memory leak issues.
Some people suggested using Nokogiri.
I'm currently cleaning the HTML using:
Nokogiri::HTML::DocumentFragment.parse(htm开发者_如何学编程l).to_html
I've got two issues though:
Nokogiri removes the
DOCTYPE
Is there an easy way to force the cleaned HTML to have a
html
andbody
tag?
If you are processing a full document, you want:
Nokogiri::HTML(html).to_html
That will force html
and body
tags, and introduce or preserve the DOCTYPE
:
puts Nokogiri::HTML('<p>Hi!</p>').to_html
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
#=> "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body><p>Hi!</p></body></html>
puts Nokogiri::HTML('<!DOCTYPE html><p>Hi!</p>').to_html
#=> <!DOCTYPE html>
#=> <html><body><p>Hi!</p></body></html>
Note that the output is not guaranteed to be syntactically valid. For example, if I provide a broken document that lies and claims that it is HTML4.01 strict, Nokogiri will output a document with that DOCTYPE but without the required <head><title>...</title></head>
section:
dtd = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">'
puts Nokogiri::HTML("#{dtd}<p>Hi!</p>").to_html
#=> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
#=> "http://www.w3.org/TR/html4/strict.dtd">
#=> <html><body><p>Hi!</p></body></html>
The Tidy gem might not be supported, but the underlying tidy
app is maintained, and that is what you really need. It's flexible and has quite a list of options.
You can pass HTML to it in many different ways, and define its configuration in a .tidyrc
file or pass them on the command-line. You could use Ruby's %x{}
to pass it a file or use IO.popen
, or IO.pipe
to treat it as a pipe.
精彩评论