Convert HTML to plain text and maintain structure/formatting, with ruby

2023-03-07 02:14 问答作者：

I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line br开发者_如何学Pythoneaks for <br> tags, detecting paragraphs and formatting them as such, etc.

The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images).

I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with more intelligence.

First, don't try to use regex for this. The odds are really good you'll come up with a fragile/brittle solution that will break with changes in the HTML or will be very hard to manage and maintain.

You can get part of the way there very quickly using Nokogiri to parse the HTML and extract the text:

require 'nokogiri'

html = '
<html>
<body>
  <p>This is
  some text.</p>
  <p>This is some more text.</p>
  <pre>
  This is
  preformatted
  text.
  </pre>
</body>
</html>
'

doc = Nokogiri::HTML(html)
puts doc.text

>>  This is
>>  some text.
>>  This is some more text.
>>  
>>  This is
>>  preformatted
>>  text.

The reason this works is Nokogiri is returning the text nodes, which are basically the whitespace surrounding the tags, along with the text contained in the tags. If you do a pre-flight cleanup of the HTML using tidy you can sometimes get a lot nicer output.

The problem is when you compare the output of a parser, or any means of looking at the HTML, with what a browser displays. The browser is concerned with presenting the HTML in as pleasing way as possible, ignoring the fact that the HTML can be horribly malformed and broken. The parser is not designed to do that.

You can massage the HTML before extracting the content to remove extraneous line-breaks, like "\n", and "\r" followed by replacing <br> tags with line-breaks. There are many questions here on SO explaining how to replace tags with something else. I think the Nokogiri site also has that as one of the tutorials.

If you really want to do it right, you'll need to figure out what you want to do for <li> tags inside <ul> and <ol> tags, along with tables.

An alternate attack method would be to capture the output of one of the text browsers like lynx. Several years ago I needed to do text processing for keywords on websites that didn't use Meta-Keyword tags, and found one of the text-browsers that let me grab the rendered output that way. I don't have the source available so I can't check to see which one it was.

继续阅读：hpricot html-parsing nokogiri ruby

Convert HTML to plain text and maintain structure/formatting, with ruby

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？