开发者

how to translate this hpricot code to nokogiri?

 Hpricot(html).inner_text.gsub("\r"," ").gsub("\n"," ").split(" 开发者_JAVA百科").join(" ")

hpricot = Hpricot(html)
hpricot.search("script").remove
hpricot.search("link").remove
hpricot.search("meta").remove
hpricot.search("style").remove

found it on http://www.savedmyday.com/2008/04/25/how-to-extract-text-from-html-using-rubyhpricot/


Nokogiri and Hpricot are pretty interchangeable. I.e. Nokogiri(html) is an equivalent of Hpricot(html). Not really sure I understand what the linked article is trying to achieve, but to:

Extract text from HTML body which includes ignoring large white spaces between tags and words.

This would be an easier approach in Hpricot, and remove the need for the hpricot.search("script").remove bits. I.e. Just get the body in the first place:

Hpricot(html).search('body').inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")

And in Nokogiri:

Nokogiri(html).search('body').inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜